Acting on branches in sequence in pairs of two in downstream target #719

januz · 2021-12-06T17:29:49Z

januz
Dec 6, 2021

I'm writing a pipeline that is supposed to do the following:

read in files from a directory
- files are named in a regular manner including dates, so listing the directory will always result in the same order
- new files are added to the directory over time, always "added at the end"
compare file content in pairs by sequence
- file_1 vs. file_2, file_2 vs. file_3, file_3 vs. file_4, ...
- output of the comparison is a tibble/df with defined columns
return the results of all comparisons in one tibble/df
- just dplyr::bind_rows() the single comparison results from 2)

I am using tarchetypes::tar_files() to read in and branch over in files dynamically, i.e., only new files are read in for 1). I am struggling with how to implement 2) dynamically though. What I am doing so far is

generate a list of comparisons (1 vs. 2, 2 vs. 3, 3 vs. 4, ...) based on the length of the target (i.e., the number of branches) generated in step 1)
map over this list and compare the single files from the target generated in step 1) to generate the intended output (step 3).

This works but all comparisons are computed anew when a new file is added (as the list of comparisons changes). What I would want is to only compute the comparisons for newly added files, so e.g., if I already have 100 files in the directory and {targets} has already computed the result for all 99 comparisons (1 vs. 2, 2 vs. 3, ... 99 vs. 100), I only want it to compute the comparison 100 vs. 101 (and update the output df to now bind all 100 comparisons) instead of recomputing the 99 comparisons.

I read the documentation and played around with this idea for quite a while but I'm feeling stuck. Would you mind pointing me in the right direction how to implement this workflow? Thanks!

Answered by wlandau

Dec 6, 2021

I recommend something like this: #615. In your case, you could first identify pairs of files, then track them with format = "file", then read in the contents. Sketch:

# _targets.R file:
library(targets)  

# Set up the files if they do not already exist (for the sake of this reprex)
fs::dir_create("dir")
files <- c("dir/file1.txt", "dir/file2.txt", "dir/file3.txt", "dir/file4.txt")
purrr::walk(files, ~if (!file.exists(.x)) writeLines(basename(.x), .x))

files <- list.files("dir", full.names = TRUE) # continuously refresh as a global variable here

list(
  tar_target(
    pair_files,
    tibble::tibble(
      first = as.character(na.omit(dplyr::lag(files, n = 2))), # as.character() removes…

View full answer

wlandau · 2021-12-06T21:30:03Z

wlandau
Dec 6, 2021
Maintainer

I recommend something like this: #615. In your case, you could first identify pairs of files, then track them with format = "file", then read in the contents. Sketch:

# _targets.R file:
library(targets)  

# Set up the files if they do not already exist (for the sake of this reprex)
fs::dir_create("dir")
files <- c("dir/file1.txt", "dir/file2.txt", "dir/file3.txt", "dir/file4.txt")
purrr::walk(files, ~if (!file.exists(.x)) writeLines(basename(.x), .x))

files <- list.files("dir", full.names = TRUE) # continuously refresh as a global variable here

list(
  tar_target(
    pair_files,
    tibble::tibble(
      first = as.character(na.omit(dplyr::lag(files, n = 2))), # as.character() removes global attributes that could invalidate all files.
      last = as.character(na.omit(dplyr::lead(files, n = 2)))
    )
  ),
  tar_target(
    watch_file_pairs,
    c(pair_files$first, pair_files$last),
    pattern = map(pair_files),
    format = "file" # Best to do this right before you read the file contents.
  ),
  tar_target(
    read_files,
    tibble::tibble(
      target_name = tar_name(),
      read_files = purrr:::map_chr(watch_file_pairs, readLines)
    ),
    pattern = map(watch_file_pairs)
  )
)

# R console
tar_make()
#> • start target pair_files
#> • built target pair_files
#> • start branch watch_file_pairs_e5cd7ab5
#> • built branch watch_file_pairs_e5cd7ab5
#> • start branch watch_file_pairs_b5c3157d
#> • built branch watch_file_pairs_b5c3157d
#> • start branch watch_file_pairs_56ac7c58
#> • built branch watch_file_pairs_56ac7c58
#> • built pattern watch_file_pairs
#> • start branch read_files_31b27878
#> • built branch read_files_31b27878
#> • start branch read_files_f23744cc
#> • built branch read_files_f23744cc
#> • start branch read_files_42f8fea5
#> • built branch read_files_42f8fea5
#> • built pattern read_files
#> • end pipeline

tar_read(read_files)
#> # A tibble: 6 × 2
#>   target_name         read_files
#>   <chr>               <chr>     
#> 1 read_files_31b27878 file1.txt 
#> 2 read_files_31b27878 file2.txt 
#> 3 read_files_f23744cc file2.txt 
#> 4 read_files_f23744cc file3.txt 
#> 5 read_files_42f8fea5 file3.txt 
#> 6 read_files_42f8fea5 file4.txt

8 replies

januz Dec 8, 2021
Author

wrap both calls in as.character() to remove those attributes

Thanks, this works!

in this case, there is a target that dynamically branches over pair_files, so it turns out pair_files actually does have to be a target in this case.

Yes, I was wondering about this. I tried defining the object outside of the pipeline but then couldn't have {targets} map over it like elude to.

Thanks for all your help, Will!

januz Dec 16, 2021
Author

@wlandau I implemented above solution and it works well for the described scenario in which files are incrementally added to the directory and existing ones aren't changed. To explore this setup further, I was trying out whether changes to existing files would be picked up, too. I had expected that if I change a file (i.e., its content but leave its name as is) after it has already been analyzed, this change would be picked up by {targets} and the targets related to this file would be rerun. This does not seem to be the case. This isn't a problem for the project I'm working on currently but I still would like to understand why that is and how I could modify the code retaining the functionality of file pairs but with the added functionality {targets} typically provides for file inputs.

Thanks so much!

wlandau Dec 16, 2021
Maintainer

That's odd. Did you have a format = "file" target like in the example? Here is a reprex of it working. (Also, I forgot n = 2 in dplyr lead() and lag() before, and I just edited the original answer to include it.)�

library(targets)

fs::dir_create("dir")
files <- c("dir/file1.txt", "dir/file2.txt", "dir/file3.txt", "dir/file4.txt")
purrr::walk(files, ~if (!file.exists(.x)) writeLines(basename(.x), .x))

tar_script({
  files <- list.files("dir", full.names = TRUE) # continuously refresh as a global variable here
  
  list(
    tar_target(
      pair_files,
      tibble::tibble(
        first = as.character(na.omit(dplyr::lag(files, n = 2))), # as.character() removes global attributes that could invalidate all files.
        last = as.character(na.omit(dplyr::lead(files, n = 2)))
      )
    ),
    tar_target(
      watch_file_pairs,
      c(pair_files$first, pair_files$last),
      pattern = map(pair_files),
      format = "file" # Best to do this right before you read the file contents.
    ),
    tar_target(
      read_files,
      tibble::tibble(
        target_name = tar_name(),
        read_files = purrr:::map_chr(watch_file_pairs, readLines)
      ),
      pattern = map(watch_file_pairs)
    )
  )
})

tar_make() # Runs everything.
#> • start target pair_files
#> • built target pair_files
#> • start branch watch_file_pairs_7e284f77
#> • built branch watch_file_pairs_7e284f77
#> • start branch watch_file_pairs_ccda31d0
#> • built branch watch_file_pairs_ccda31d0
#> • built pattern watch_file_pairs
#> • start branch read_files_a60b93f7
#> • built branch read_files_a60b93f7
#> • start branch read_files_ce34d0ed
#> • built branch read_files_ce34d0ed
#> • built pattern read_files
#> • end pipeline

tar_make() # Skips everything.
#> ✓ skip target pair_files
#> ✓ skip branch watch_file_pairs_7e284f77
#> ✓ skip branch watch_file_pairs_ccda31d0
#> ✓ skip pattern watch_file_pairs
#> ✓ skip branch read_files_a60b93f7
#> ✓ skip branch read_files_ce34d0ed
#> ✓ skip pattern read_files
#> ✓ skip pipeline

writeLines("new stuff", "dir/file3.txt")

tar_make() # Reruns the targets that depend on dir/file3.txt.
#> ✓ skip target pair_files
#> • start branch watch_file_pairs_7e284f77
#> • built branch watch_file_pairs_7e284f77
#> ✓ skip branch watch_file_pairs_ccda31d0
#> • built pattern watch_file_pairs
#> • start branch read_files_a60b93f7
#> • built branch read_files_a60b93f7
#> ✓ skip branch read_files_ce34d0ed
#> • built pattern read_files
#> • end pipeline

^{Created on 2021-12-16 by the reprex package (v2.0.1)}

januz Dec 17, 2021
Author

Mh, weird. When I tested it yesterday, it didn't recompute but I must have done something wrong, when I checked again today, it worked as expected.

I think I might have mixed it up with the phenomenon that when I re-create the same files (i.e., same name and content, just newer date stamp), {targets} does not recompute the targets. But I assume that is just due to the check being based on checksum only and not on modification/creation date + checksum, is that correct?

Thanks so much as always for developing such a great package and on top of that always helping me understand it better!!

P.S. For my use case, the version without n = 2 was the correct one.

wlandau Dec 17, 2021
Maintainer

Yeah, it comes down to hashes, not time stamps. Sometimes a target reruns but the actual results don’t change, and hashes detect that so you don’t waste time rerunning stuff downstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acting on branches in sequence in pairs of two in downstream target #719

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Acting on branches in sequence in pairs of two in downstream target #719

januz Dec 6, 2021

Replies: 1 comment · 8 replies

wlandau Dec 6, 2021 Maintainer

januz Dec 8, 2021 Author

januz Dec 16, 2021 Author

wlandau Dec 16, 2021 Maintainer

januz Dec 17, 2021 Author

wlandau Dec 17, 2021 Maintainer

januz
Dec 6, 2021

Replies: 1 comment 8 replies

wlandau
Dec 6, 2021
Maintainer

januz Dec 8, 2021
Author

januz Dec 16, 2021
Author

wlandau Dec 16, 2021
Maintainer

januz Dec 17, 2021
Author

wlandau Dec 17, 2021
Maintainer