-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Last observation carried forward with time constraints #7
Comments
Here's an attempt using library(data.table)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:data.table':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday,
#> week, yday, year
#> The following object is masked from 'package:base':
#>
#> date
test_locf <- tibble::tribble(
~patid, ~start_date , ~value,
1, ymd_hms("2010-01-05 12:00:00"), 5,
1, ymd_hms("2010-01-05 13:00:00"), NA,
1, ymd_hms("2010-01-05 15:59:59"), NA,
1, ymd_hms("2010-01-05 17:00:00"), NA,
1, ymd_hms("2010-01-05 18:00:00"), 10,
2, ymd_hms("2010-01-05 13:00:00"), NA,
2, ymd_hms("2010-01-05 14:00:00"), NA,
2, ymd_hms("2010-01-05 15:00:00"), 2,
2, ymd_hms("2010-01-05 15:31:01"), NA,
2, ymd_hms("2010-01-06 16:00:00"), NA
)
locf_window <- function(df, value, by, date, window, unit = "hours"){
# Parameters
# df = data.frame with value, by, date columns
# value = column name to perform LOCF
# by = grouping column (e.g. ID)
# date = date column (created using lubridate package)
# window = maximum allowed time window for LOCF
# unit = time unit (passed to lubridate::time_length function)
DT <- data.table(df, key = c(by,date))
setnames(DT, c(value, by, date), c("value", "by", "date"))
# Create a temporary grouping variable per non-missing observation per patient
DT[!is.na(value), tmp_group := seq_along(value)]
# Impute grouping for missing obs with zoo::na.locf (na.rm=F will keep missing first observation)
DT[, tmp_group := zoo::na.locf(tmp_group, na.rm=F), by=by]
# Fill in missing observation if time_length < window by group
DT[ , value := ifelse(time_length(`[`(date, 1L) %--% date, unit = unit) < window,
`[`(value, 1L), NA),
by=c("by", "tmp_group")]
setnames(DT, c("value", "by", "date"), c(value, by, date))
return(DT[,-"tmp_group"])
}
locf_window(test_locf, "value", "patid", "start_date", 4)
#> patid start_date value
#> 1: 1 2010-01-05 12:00:00 5
#> 2: 1 2010-01-05 13:00:00 5
#> 3: 1 2010-01-05 15:59:59 5
#> 4: 1 2010-01-05 17:00:00 NA
#> 5: 1 2010-01-05 18:00:00 10
#> 6: 2 2010-01-05 13:00:00 NA
#> 7: 2 2010-01-05 14:00:00 NA
#> 8: 2 2010-01-05 15:00:00 2
#> 9: 2 2010-01-05 15:31:01 2
#> 10: 2 2010-01-06 16:00:00 NA Created on 2019-08-08 by the reprex package (v0.3.0) The code is a bit lengthy as I'm not very familiar with data.table syntaxes for programming (hence the double |
First, thanks Albert for looking into this, really appreciated! Second, sorry for the absolutely crap "reproducible example" which is absolutely not reproducible. Just tried it and it failed horribly. That happens if you screenread something from the Data Safe Haven, then make changes for it to use tidyverse instead of data.table, and then not even try to run it... Example should now be fixed. Third, I love the temp_group approach and particularly how you get the first element in each group with simple list subsetting: Finally, the main problem why this solution doesn't work for dataset is speed. The main problem are the by-statements, which call na.locf() and time_length(), which are both expensive functions, separately for each temp_group:
Running the two solutions on a dataset of 100,000 observations takes 83 milliseconds for the code without by and 27 seconds for the code with by (a factor of 325). So the goal is to avoid the by. Here is quick and dirty code to run the timing of large samples yourself (I will clean it up if I have time):
|
Hi all,
I have created a function that performs a last observation carried forward that takes account of how long ago the last observation was recorded. Code for the function below.
The function generally seems to work, but it seems overly complicated. Does anyone have an idea of how to simplify or speed up the function?
P
The text was updated successfully, but these errors were encountered: