-
I'm writing a pipeline that is supposed to do the following:
I am using
This works but all comparisons are computed anew when a new file is added (as the list of comparisons changes). What I would want is to only compute the comparisons for newly added files, so e.g., if I already have 100 files in the directory and I read the documentation and played around with this idea for quite a while but I'm feeling stuck. Would you mind pointing me in the right direction how to implement this workflow? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
I recommend something like this: #615. In your case, you could first identify pairs of files, then track them with # _targets.R file:
library(targets)
# Set up the files if they do not already exist (for the sake of this reprex)
fs::dir_create("dir")
files <- c("dir/file1.txt", "dir/file2.txt", "dir/file3.txt", "dir/file4.txt")
purrr::walk(files, ~if (!file.exists(.x)) writeLines(basename(.x), .x))
files <- list.files("dir", full.names = TRUE) # continuously refresh as a global variable here
list(
tar_target(
pair_files,
tibble::tibble(
first = as.character(na.omit(dplyr::lag(files, n = 2))), # as.character() removes global attributes that could invalidate all files.
last = as.character(na.omit(dplyr::lead(files, n = 2)))
)
),
tar_target(
watch_file_pairs,
c(pair_files$first, pair_files$last),
pattern = map(pair_files),
format = "file" # Best to do this right before you read the file contents.
),
tar_target(
read_files,
tibble::tibble(
target_name = tar_name(),
read_files = purrr:::map_chr(watch_file_pairs, readLines)
),
pattern = map(watch_file_pairs)
)
) # R console
tar_make()
#> • start target pair_files
#> • built target pair_files
#> • start branch watch_file_pairs_e5cd7ab5
#> • built branch watch_file_pairs_e5cd7ab5
#> • start branch watch_file_pairs_b5c3157d
#> • built branch watch_file_pairs_b5c3157d
#> • start branch watch_file_pairs_56ac7c58
#> • built branch watch_file_pairs_56ac7c58
#> • built pattern watch_file_pairs
#> • start branch read_files_31b27878
#> • built branch read_files_31b27878
#> • start branch read_files_f23744cc
#> • built branch read_files_f23744cc
#> • start branch read_files_42f8fea5
#> • built branch read_files_42f8fea5
#> • built pattern read_files
#> • end pipeline
tar_read(read_files)
#> # A tibble: 6 × 2
#> target_name read_files
#> <chr> <chr>
#> 1 read_files_31b27878 file1.txt
#> 2 read_files_31b27878 file2.txt
#> 3 read_files_f23744cc file2.txt
#> 4 read_files_f23744cc file3.txt
#> 5 read_files_42f8fea5 file3.txt
#> 6 read_files_42f8fea5 file4.txt |
Beta Was this translation helpful? Give feedback.
I recommend something like this: #615. In your case, you could first identify pairs of files, then track them with
format = "file"
, then read in the contents. Sketch: