admin管理员组

文章数量:1434968

I am trying to create a function that takes overlapping periods into account in this ordered data by id, start_ins, rxdate:

dt <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2004-02-27", "2004-01-13", "2004-03-20"))
)

In this set, observations are grouped by id and start_ins. People also purchase Rx on a date and the number of pills (amount). rx_end is

[, rx_end := rxdate + amount]

However, if you look closely, you will find that rx_end for line 2 comes after rxdate for line 3. In this case, there is an overlap between rxdate line 3 and end of previous prescription. I want to assume that a person finished a rx before starting a new rx. In this case rxdate for line 3 should start the day after rx_end of line 2 AND rx_end for line 3 should be updated to take into account the new rxdate in line 3. Same logic applies for lines 4 and 5. In this case, the new data set would look something like this:

dt1 <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-28", "2003-12-29", "2004-02-28", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-08-06", "2004-02-27", "2004-02-29", "2004-03-20"))
)

I tried writing this function using data.table:

overlapper <- function (dt){
dt[,{
prev_end_date <- shift(rx_end, 1, type = "lag")

for (i in 2:.N){
if(!is.na(prev_end_date[i-1]) & prev_end_date >= dt[i, rxdate]){
dt[i, rxdate := prev_end_date[i-1] +1]
dt[i, rx_end := rxdate + amount]
}
prev_end_date[i] <- dt[i, rx_end]
}
return(dt)}, by = .(id, start_ins)]}

Keep in mind that some people have only one observation, so these should not be taken into account and that is why I am doing a grouped operation by id and start_ins. I always want to look at line n and compare it to n-1 to update line n ONLY if there is an overlap. If there is no over lap, I would keep the line as is and move on to the next line.

I am trying to create a function that takes overlapping periods into account in this ordered data by id, start_ins, rxdate:

dt <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2004-02-27", "2004-01-13", "2004-03-20"))
)

In this set, observations are grouped by id and start_ins. People also purchase Rx on a date and the number of pills (amount). rx_end is

[, rx_end := rxdate + amount]

However, if you look closely, you will find that rx_end for line 2 comes after rxdate for line 3. In this case, there is an overlap between rxdate line 3 and end of previous prescription. I want to assume that a person finished a rx before starting a new rx. In this case rxdate for line 3 should start the day after rx_end of line 2 AND rx_end for line 3 should be updated to take into account the new rxdate in line 3. Same logic applies for lines 4 and 5. In this case, the new data set would look something like this:

dt1 <- data.table(
    id = c(1, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-28", "2003-12-29", "2004-02-28", "2004-03-10")),
    amount = c(30, 28, 100, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-08-06", "2004-02-27", "2004-02-29", "2004-03-20"))
)

I tried writing this function using data.table:

overlapper <- function (dt){
dt[,{
prev_end_date <- shift(rx_end, 1, type = "lag")

for (i in 2:.N){
if(!is.na(prev_end_date[i-1]) & prev_end_date >= dt[i, rxdate]){
dt[i, rxdate := prev_end_date[i-1] +1]
dt[i, rx_end := rxdate + amount]
}
prev_end_date[i] <- dt[i, rx_end]
}
return(dt)}, by = .(id, start_ins)]}

Keep in mind that some people have only one observation, so these should not be taken into account and that is why I am doing a grouped operation by id and start_ins. I always want to look at line n and compare it to n-1 to update line n ONLY if there is an overlap. If there is no over lap, I would keep the line as is and move on to the next line.

Share Improve this question edited Nov 18, 2024 at 11:04 SamR 21k4 gold badges19 silver badges48 bronze badges asked Nov 18, 2024 at 10:35 YouknowmeYouknowme 857 bronze badges 1
  • 2 Does your data really require a sequential approach? Your toy data looks like it can be done without. – Friede Commented Nov 18, 2024 at 11:27
Add a comment  | 

2 Answers 2

Reset to default 2

You can do this using data.table::shift().

  1. Calculate the duration of each prescription and put it in rx_days.
  2. If there's an overlap with the previous prescription (by ID) move the start date to the day after the previous end date.
  3. Make sure we never replace the first row in a group (we need to state this explicitly given your note in the comment about the change from < to <=).
  4. Adjust the end dates based on the duration and the new start dates and remove the temporary variable rx_days.
  5. Continue until this is stable.

Here are steps 1-4, which work with your sample data:

dt[, c("rxdate", "rx_end") := {
    i <- seq(.N) # never replace first row
    shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
    shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
    rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
    rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
    .(rxdate, rx_end)
}, .(id, start_ins)]

identical(dt, dt1)
# [1] TRUE

However, imagine this new data frame:

      id  start_ins     rxdate amount     rx_end
   <num>     <Date>     <Date>  <num>     <Date>
1:     1 2000-10-12 2010-01-04     30 2010-02-03
2:     2 2000-03-31 2010-03-30     28 2010-04-27
3:     2 2000-03-31 2010-04-08    100 2010-07-17
4:     2 2000-03-31 2010-07-20     10 2010-07-30 # new row
5:     2 2000-03-31 2010-08-15     17 2010-09-01 # new row
6:     3 2000-01-11 2003-12-29     60 2004-02-27
7:     3 2000-01-11 2004-01-12      1 2004-01-13
8:     3 2000-01-11 2004-03-10     10 2004-03-20

If we apply this approach, row 3 will update correctly but row 4 will not reflect the changes in row 3, and so does not trigger what should be an update in row 5.

To resolve this we can put it in a recursive function that calls itself until there are no more changes:

shift_dates <- function(dt, grp = c("id", "start_ins")) {
    old_state <- dt[, .(rxdate, rx_end)]
    dt[, c("rxdate", "rx_end") := {
        i <- seq(.N) # never replace first row
        shifted_rx_end <- shift(rx_end, 1, fill = rxdate[1])
        shift_dates <- fifelse(i == 1, FALSE, rxdate <= shifted_rx_end)
        rxdate <- fifelse(shift_dates, shifted_rx_end + 1, rxdate)
        rx_end[shift_dates] <- rxdate[shift_dates] + amount[shift_dates]
        .(rxdate, rx_end)
    }, grp]

    if (identical(dt[, .(rxdate, rx_end)], old_state)) {
        return(dt)
    }
    shift_dates(dt)
}

This means the changes from previous rows will propagate until complete:

shift_dates(dt2)
      id  start_ins     rxdate amount     rx_end
   <num>     <Date>     <Date>  <num>     <Date>
1:     1 2000-10-12 2010-01-04     30 2010-02-03
2:     2 2000-03-31 2010-03-30     28 2010-04-27
3:     2 2000-03-31 2010-04-28    100 2010-08-06
4:     2 2000-03-31 2010-08-07     10 2010-08-17 # change propagates
5:     2 2000-03-31 2010-08-18     17 2010-09-04 # change propagates
6:     3 2000-01-11 2003-12-29     60 2004-02-27
7:     3 2000-01-11 2004-02-28      1 2004-02-29
8:     3 2000-01-11 2004-03-10     10 2004-03-20

New data

dt2 <- data.table(
    id = c(1, 2, 2, 2, 2, 3, 3, 3),
    start_ins = as.Date(c("2000-10-12", "2000-03-31", "2000-03-31", "2000-03-31", "2000-03-31", "2000-01-11", "2000-01-11", "2000-01-11")),
    rxdate = as.Date(c("2010-01-04", "2010-03-30", "2010-04-08", "2010-07-20", "2010-08-15", "2003-12-29", "2004-01-12", "2004-03-10")),
    amount = c(30, 28, 100, 10, 17, 60, 1, 10),
    rx_end = as.Date(c("2010-02-03", "2010-04-27", "2010-07-17", "2010-07-30", "2010-09-01", "2004-02-27", "2004-01-13", "2004-03-20"))
)

Use SamR's answer (better approach, much faster, your data is already storead as data.table)! This answer is intended to explicitly demonstrate the for-loops if a sequential approach is indeed needed.

Base R

for (i in seq(nrow(l))[-1]) {
  t0 = l[["rx_end"]][i-1] # pre-allocation
  if (t0 - l[["rxdate"]][i] >= 0) {
    u = t0 + 1 # <- update ->
    l[["rxdate"]][i] = u
    l[["rx_end"]][i] = u + l[["amount"]][i]
    }
  }

where l is a data.frame. This is the inner loop running over the observations (rows) of each id. We also need an outer loop which iterates over each id. A convenient approach is to split dt on id and start_ins and iterate over the resulting list of data.frames with lapply().

To collect results we use

# combine results 
unlist2d = \(l) do.call("rbind", l) |> `row.names<-`(NULL)

All togehter

dt0 = as.data.frame(dt)
dt0 =
  lapply(split(dt0, ~id+start_ins, drop = TRUE), \(l) 
         if (nrow(l) > 1) { # more than one row per group
           for (i in seq(nrow(l))[-1]) {
             t0 = l[["rx_end"]][i-1]
             if (t0 - l[["rxdate"]][i] >= 0) {
               u = t0 + 1 
               l[["rxdate"]][i] = u
               l[["rx_end"]][i] = u + l[["amount"]][i] }
             }
           l
           } 
         else l
         ) |> 
  unlist2d() |> 
  sort_by(~id+start_ins+rxdate)

giving

> identical(as.data.table(dt0), dt1)
[1] TRUE

Assumption: dt is ordered by id+start_ins+rxdate.

本文标签: rdates and overlapping dates functionStack Overflow