Move through targets finishing each branch before starting a new one #437

gorkang · 2021-04-28T14:34:05Z

gorkang
Apr 28, 2021

Hi there!

I am working on a system to generate synthetic responses for jsPsych experiments (jsPsychMonkeys).

I use targets to create one docker container per participant using map(). Each container opens a browser instance, each browser runs the task for the participant (this can be from a minute to more than an hour), and in the last step, the docker container is destroyed. See gif below.

When I run a few participants everything works fine. But when I run a big number of participants, the containers eat my RAM. This happens because right now targets runs the nodes by layers. So, first all the docker containers, then all the remote drivers, etc. (see gif). So I need to have enough RAM to have as many docker containers as participants simultaneously.

Is there is a way to finish each branch before starting a new one?

In brief:


Right now targets does:
container_24001 -> container_24002 -> container_24003 -> container_24004 ->  ...

But I need:
container_24001 -> remoteDriver_24001 -> task_24001 -> clean_container_24001 -> container_24002 ->...

Thanks for the help!

Answered by wlandau

Apr 28, 2021

In your case, it would probably be easiest to change the memory management strategy, e.g. tar_option_set(memory = "transient", garbage_collection = TRUE). If you did that, I suspect target execution order would matter less. (Please keep in mind that targets assumes each target produces a single serializable/exportable return value and any untracked side effects can be safely discarded.)

An alternative would be to manually set the priority of each target. It is difficult to do this in tar_map() but may be easier with tar_eval().

# _targets.R
library(rlang)
library(targets)
library(tarchetypes)
tar_eval(
  values = list(
    target1 = c("target1_a", "target1_b"),
    target2 = c("target2_a", "

View full answer

wlandau · 2021-04-28T16:14:07Z

wlandau
Apr 28, 2021
Maintainer

In your case, it would probably be easiest to change the memory management strategy, e.g. tar_option_set(memory = "transient", garbage_collection = TRUE). If you did that, I suspect target execution order would matter less. (Please keep in mind that targets assumes each target produces a single serializable/exportable return value and any untracked side effects can be safely discarded.)

An alternative would be to manually set the priority of each target. It is difficult to do this in tar_map() but may be easier with tar_eval().

# _targets.R
library(rlang)
library(targets)
library(tarchetypes)
tar_eval(
  values = list(
    target1 = c("target1_a", "target1_b"),
    target2 = c("target2_a", "target2_b"),
    x = c("a", "b"),
    priority = c(0.5, 1)
  ),
  list(
    tar_target(target1, f(x), priority = priority),
    tar_target(target2, g(target1), priority = priority)
  )
)

0 replies

gorkang · 2021-04-29T15:49:12Z

gorkang
Apr 29, 2021
Author

Thanks for the answer @wlandau

Sorry for the long message. The TLDR version is: tar_option_set(memory = "transient", garbage_collection = TRUE) does not work, but tar_make_future() helps quite a bit.

It seems the future workers move row-wise more often than column-wise, reaching my clean_docker() function and freeing RAM.

If it was possible to set an option for row-wise movement as the default for targets, it would allow for cases where a lot of independent parallel computations with high RAM cost are needed.

Long version. I've been doing some testing on a different computer with more RAM (64GB vs 24), and tar_option_set(memory = "transient", garbage_collection = TRUE) does not seem to help (max RAM used and time is exactly the same). It makes sense, because the RAM cost in this case has to do with having docker containers with Google Chrome opened.

I tried running the transient memory and normal targets in 4 different configurations (Type of protocol x Type of make):

Type of protocol:

Short protocol (5 items) with 100 participants
Long protocol (~ 500 items) with 50 participants

Type of make:

tar_make()
tar_make_future()

For the sake of simplicity I am not showing the tar_option_set(memory = "transient", garbage_collection = TRUE) versions in the table below because they are almost equal (+-1GB).

protocol	participants	tar_make_future()	tar_make()
5 screens	100 participants	20 GB / 270s	48 GB / 1428s
500 screens	50 participants	20 GB / 2220s	error

I was puzzled about the lower memory requirements of tar_make_future() compared to tar_make(), so I tested a bit more (see gif below). To make it easier to visualize, I used 1 worker in the tar_make_future (with more workers future ends up being much faster).

It seems tar_make() moves column-wise, completing all "column 1" targets before moving to "column 2". tar_make_future() completes all the "column 1" targets, but then... moves row-wise, almost completing the first row, before moving to the second row. With a hundred target rows and 15 workers, a lot of the workers are reaching the "cleaning the docker container" targets much faster than with tar_make().

tar_make_future(workers = 1)	tar_make()

I have not tried setting the priorities for each individual target. As you can see, I use tar_map(), and if possible, I might need to nest a couple tar_map() in the future. So each "participant" can complete a couple of tasks in parallel in the same container.

Just in case it may be useful, this is my _targets.R file:

# Parameters --------------------------------------------------------------
  
  parameters_monkeys = list(
    
    participants = list(uid = 24000),
    
    docker = list(
      browserName = "chrome",
      big_container = FALSE,
      keep_alive = FALSE,
      folder_downloads = "~/Downloads"
    ),
    
    debug = list(
      DEBUG = TRUE,
      screenshot = FALSE,
      debug_file = FALSE,
      open_VNC = FALSE
    ),
    
    task_params = list(
      pid = 999,
      local_or_server = "server",
      local_folder_tasks = "",
      server_folder_tasks = "test/1x",
      initial_wait = 2,
      wait_retry = 2
    )
  )
  
# Libraries ---------------------------------------------------------------

  suppressMessages(suppressWarnings(library(targets)))
  suppressMessages(suppressWarnings(library(tarchetypes)))
  suppressMessages(suppressWarnings(library(future)))
  suppressMessages(suppressWarnings(library(future.callr)))

  # List of packages to use
  packages_to_load = c("targets", "tarchetypes", "dplyr", "glue", "purrr", "readr", "RSelenium", "rvest" ,"XML")

  
  # Needed here if we run make_future()
    # https://github.com/HenrikBengtsson/future/#controlling-how-futures-are-resolved
    future::plan(callr)
    future::tweak(strategy = "multisession")
    
    
# Functions ---------------------------------------------------------------

  # Source all /R files
  lapply(list.files("./R", full.names = TRUE, pattern = ".R"), source)
  

# Maintenance -------------------------------------------------------------

  # So crayon colors work when using future
  Sys.setenv(R_CLI_NUM_COLORS = crayon::num_ansi_colors()) 
  
  # target options (packages, errors...)
  tar_option_set(
    # Load packages for all targets
    packages = packages_to_load,
    # to load workspace on error to debug
    error = "workspace",
    memory = "transient",
    garbage_collection = TRUE
    ) 
  
  
  # Restore output to console (in case it was left hanging...)
    suppressWarnings(sink())
    sink(type = "message")
    

# Targets -----------------------------------------------------------------

  list(
    
    tarchetypes::tar_map(
      values = parameters_monkeys$participants,
      
      # Create docker container
      tar_target(
        container,
        create_docker(
          container_name = paste0("container", uid),
          browserName = parameters_monkeys$docker$browserName,
          DEBUG = parameters_monkeys$debug$DEBUG,
          big_container = parameters_monkeys$docker$big_container,
          folder_downloads = parameters_monkeys$docker$folder_downloads,
          parameters_docker = parameters_monkeys
        )
      ),
      
      # Open remote Driver and browser
      tar_target(
        remoteDriver,
        create_remDr(
          container_port = container$container_port,
          browserName = container$browserName,
          container_name = container$container_name
        )
      ),
      
      
      # Create links
      tar_target(
        links_tar,
        create_links(
          parameters_task = parameters_monkeys,
          uid = uid,
          DEBUG = parameters_monkeys$debug$DEBUG,
          container_name = remoteDriver$container_name,
          remDr = remoteDriver$remDr
        )
      ),
      
      
      # Launch task
      tar_target(
        launch,
        launch_task(
          parameters_task = parameters_monkeys,
          uid = uid,
          links = links_tar$links,
          initial_wait = parameters_monkeys$task_params$initial_wait,
          DEBUG = parameters_monkeys$debug$DEBUG,
          open_VNC = parameters_monkeys$debug$open_VNC,
          container_name = links_tar$container_name,
          remDr = links_tar$remDr
        )
      ),
      
      # Complete task
      tar_target(
        task,
        complete_task(
          parameters_task = parameters_monkeys,
          uid = uid,
          wait_retry = parameters_monkeys$task_params$wait_retry,
          screenshot = parameters_monkeys$debug$screenshot,
          DEBUG = parameters_monkeys$debug$DEBUG,
          container_name = launch$container_name,
          remDr = launch$remDr
        )
      ),
      
      # Clean up after participants finish
      tar_target(
        clean_container,
        clean_up_docker(
          container_name = task,
          keep_alive = parameters_monkeys$docker$keep_alive,
          DEBUG = parameters_monkeys$debug$DEBUG
        )
      )
    )
  )

1 reply

wlandau Apr 29, 2021
Maintainer

So it looks like the each target works through side effects: creating the docker container, running it, then cleaning it up. At each stage, not all that information is captured in the return value of the target, which is why transient memory and garbage collection do not help. Those memory options assume all your data is in the R object returned from the target's. command. It will not actually help clean up other kinds of memory, especially memory outside the R process running the pipeline.

Ideally, a target should return an R object and not create side effects (with format = "file" for output files being the main exception; see https://books.ropensci.org/targets/practices.html#how-to-define-good-targets). So I recommend doing everything you need to do with the Docker image within a single target and then returning the R object you get from all that processing at the end.

A similar situation comes up in database connections. Some users want to use one target to create the connection object, another target to use it, and another target to clean up the connection. But connection objects are non-exportable, which means parallel workers cannot use them. As @pat-s explains at #429, it is better to create and destroy connection objects within a single target and return an R object at the end. Docker images seem similar in your use case.

gorkang · 2021-04-29T16:25:56Z

gorkang
Apr 29, 2021
Author

My initial version had a big function taking care of everything, and future_map() for the parallelization, but debugging was a nightmare and was not easy to recover intermediate steps. Using targets to modularize the code in three logical steps "docker creation -> remoteDriver/browser -> completing task" made things much easier (the code I shared has more intermediate targets because I was experimenting with ways to parallelize browsers inside a container, but memory becomes an issue anyway...).

One critical bit with my use case is that the side effect (csv file with participant's responses) is the desired result. The memory demands seem to also be unavoidable given the "column-wise" default direction in targets, and the need for as much parallel workers as possible. A "row-wise" mode would be a blessing, making possible to launch much more parallel participants.

If at any point you think it may be a good idea to develop this "row-wise" mode, I would be happy to help testing it.

Thanks for all the help.

8 replies

wlandau Apr 30, 2021
Maintainer

My mistake, it turns out priorities do not solve this particular issue. I suspect this is because targets schedules jobs faster than the container jobs finish. container_3 starts, but while container_3 is running, the main process of the pipeline goes back and schedules other jobs. run_task_3 is not ready yet because container_3 is still running, so container_1 and container_2 begin. Priorities are only useful for deciding among targets whose dependencies are already checked.

Another approach is to force a bottleneck in the graph. If the run_task_* targets are the ones with the memory issues but the container_* targets can be run in parallel, you could add directed edges between pairs of tasks.

library(targets)

tar_script({
  library(tarchetypes)
  tar_option_set(packages = "tibble")
  values <- list(
    index = seq_len(3),
    previous_task = rlang::syms(c("NULL", "run_task_1", "run_task_2"))
  )
  tar_map(
    values = values,
    names = index,
    tar_target(container, build_container(index)),
    tar_target(run_task, do_task(previous_task, container))
  )
})

tar_glimpse()

^{Created on 2021-04-30 by the reprex package (v2.0.0)}

gorkang Apr 30, 2021
Author

Thanks for the idea @wlandau

I should have RTFM more carefully before posting my last message, as you explicitly say in the tar_target() help that the priority parameter: "Only applies to tar_make_future() and tar_make_clustermq() (not tar_make())". :(

And of course, running the assign_priority() function with tar_make_future() makes targets run completely row-wise, which solves my problem. So sorry for misguiding you in my previous post.

If you think an improved version of assign_priority() may be an useful addition to the tarchetypes package, feel free to include it.

Thanks a lot for your time. Your work with Targets is so amazing I feel awful when bothering you with small stuff like this.

wlandau Apr 30, 2021
Maintainer

Never mind my last comment (which I deleted); yes, priorities only work with tar_make_future() in the version of targets you have been using so far. There are some efficiency tradeoffs to worry about. But to avoid confusion, I think we should make priorities should work with tar_make() too, and I think I know how to do that efficiently enough. Implemented in f56febd.

library(targets)
tar_script(
  list(
    tar_target(x1, 1, priority = 0),
    tar_target(y1, 1, priority = 0.5),
    tar_target(z1, 1, priority = 1),
    tar_target(x2, x1, priority = 0),
    tar_target(y2, y1, priority = 0.5),
    tar_target(z2, z1, priority = 1)
  )
)

tar_make()
#> • start target z1
#> • built target z1
#> • start target z2
#> • built target z2
#> • start target y1
#> • built target y1
#> • start target y2
#> • built target y2
#> • start target x1
#> • built target x1
#> • start target x2
#> • built target x2
#> • end pipeline

^{Created on 2021-04-30 by the reprex package (v2.0.0)}

wlandau Apr 30, 2021
Maintainer

Sorry, it took me a while to remember what was actually in the code. It has been a long time since I implemented priorities.

gorkang Apr 30, 2021
Author

No worries at all. I am glad this discussion motivated you to extend the functionality to tar_make(). I just tried it and it worked!

This will make some of the stuff I need to do so much easier.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move through targets finishing each branch before starting a new one #437

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Move through targets finishing each branch before starting a new one #437

gorkang Apr 28, 2021

Replies: 3 comments · 9 replies

wlandau Apr 28, 2021 Maintainer

gorkang Apr 29, 2021 Author

wlandau Apr 29, 2021 Maintainer

gorkang Apr 29, 2021 Author

wlandau Apr 30, 2021 Maintainer

gorkang Apr 30, 2021 Author

wlandau Apr 30, 2021 Maintainer

wlandau Apr 30, 2021 Maintainer

gorkang Apr 30, 2021 Author

gorkang
Apr 28, 2021

Replies: 3 comments 9 replies

wlandau
Apr 28, 2021
Maintainer

gorkang
Apr 29, 2021
Author

wlandau Apr 29, 2021
Maintainer

gorkang
Apr 29, 2021
Author

wlandau Apr 30, 2021
Maintainer

gorkang Apr 30, 2021
Author

wlandau Apr 30, 2021
Maintainer

wlandau Apr 30, 2021
Maintainer

gorkang Apr 30, 2021
Author