Automatic task cleanup #3849

bentsherman · 2023-04-10T21:15:47Z

Alternative to #3818

Instead of adding a temporary option to output paths, this PR facilitates the automatic cleanup through the cleanup config option. By setting cleanup = 'eager', Nextflow will automatically delete task directories during the workflow run. Caveats are documented in the PR.

TODO:

wait for outputs to be published
warn about incompatible publish modes
resumability
refactor based on workflow output DSL
remove lazy, eager strategies, make aggressive strategy resumable

Signed-off-by: Ben Sherman <[email protected]>

…emote paths) Signed-off-by: Ben Sherman <[email protected]>

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2023-04-11T15:05:13Z

As I mentioned in the other PR, this eager cleanup currently won't work correctly with file publishing because the publishing is asynchronous. So for each task we need to wait for any files to be published first...

Signed-off-by: Ben Sherman <[email protected]>

modules/nextflow/src/main/groovy/nextflow/trace/TraceObserver.groovy

Signed-off-by: Ben Sherman <[email protected]>

netlify · 2023-07-07T21:29:30Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`9ce10dc`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/65c1015b846889000810a7cc

Signed-off-by: Ben Sherman <[email protected]>

JohnHadish · 2024-03-21T23:26:23Z

If I understand correctly, your current method of implementing this is to wait for the output of each process to be output then perform cleanup. It may be better to wait until the end of all processes of a single run and then delete. This would probably make it easier to run since you do not need to worry about what files need to be stored. This would still offer a significant amount of space saving.

Exmaple:
I have 10000 files to process. I have a machine where I can have 100 jobs at a time. My nextflow workflow has 5 processes, each which produces files I do not care about, but need for the next step. The run begins and the first 100 files are selected to process. File 1 goes through all 5 processes. File 1 is now done, and all of its temporary files can be deleted. Something interupts the run. File 2 had just completed process 3 during the interuption. The user restarts, File 2 continues where it was, as it has all intermediate files still.

bentsherman · 2024-03-22T17:56:33Z

The cleanup strategy needs to be reworked due to the upcoming workflow output DSL #4670.

If the publish definition is moved to the workflow level, the task has no idea which of its outputs will be published, and the cleanup observer can't delete a task until it knows that all of the outputs that are going to be published have been published.

A simple solution would be to mark an output for deletion when it is published (and downstream tasks are done with it, etc). The downside is that outputs that are not published are not deleted.

Thinking further, the current POC of the output DSL just appends some "publish" operators to the DAG, so I might be able to trace each process output through the DAG to see if it's connected to a publish op. That way we know if an output can "not" be published and delete it sooner. It still misses files that "could" be published but aren't at runtime, e.g. because they get filtered out by a filter op, but I suspect this is an edge case that can be avoided with good pipeline design.

Finally, we can always fall back to the existing strategy of "delete whatever is left at the end". As long as the eager cleanup can delete enough files early on, it should be enough to move many pipelines from un-runnable to runnable.

pbieberstein · 2024-07-17T17:34:25Z

Many thanks for working on this, I know many people will already benefit from this even if resumeability doesn't work but I wanted to share our usecase for why resumeability is important since I haven't seen anyone mention it yet :)

Our pipeline will be the backbone of a platform that will continously accept new samples and archive all the results. The first part of the pipeline is QC, Filtering, genome assembly, etc. And then comes various genome annotation methods.

Since annotation methods tend to evolve over time, we want to be able to quickly re-run all samples we have already analyzed. That's why we want to keep the work directory longterm (but it needs to be as small as possible) to enable skipping the first parts of the pipeline until the annotation steps. So this will require some flexibility in deciding which files should persist the cleanup (we want to keep finished assemblies since these files aren't so large and the updated annotation methods usually start from there + raw reads). Occasionally, an earlier step will also be updated and that's why it'd be great to have a smart resumeability method which can figure out what truly needs to re-run. But I think a user needs to be able to define which files should persist the cleanup because automatic purging would also delete the finished assemblies and then the pipeline has to compute everything from scratch even if only the annotation method changes. That would be the same as simply deleting the entire work directory.

Hoping that this is a good usecase to have in mind while developing the resuming functionality. For now we will try the GEMmaker method.

Thank you!!

bentsherman · 2024-07-18T13:31:17Z

One way to handle that would be to publish files that you don't want to be lost. They will be deleted from the work directory, but when the automatic cleanup recovers a deleted task on resume, it will also verify that the published files are up to date (the file checksums will be stored in the .nextflow cache), so it could use the published files for downstream tasks. But I guess it would need to re-download the published file if it's in a remote location

pbieberstein · 2024-07-18T15:20:20Z

Nice, true that would make sense that it can reference published files, is that logic already implemented in the nf-boost plugin?

bentsherman · 2024-07-18T15:59:28Z

No, nf-boost doesn't do anything with resume

bentsherman added 12 commits March 27, 2023 14:01

Add initial task graph and metadata json file

47d0168

Signed-off-by: Ben Sherman <[email protected]>

Add task inputs and outputs to conrete DAG

ae67027

Signed-off-by: Ben Sherman <[email protected]>

Fix failing tests

8f95cd6

Signed-off-by: Ben Sherman <[email protected]>

Use path-based APIs to get file metadata

9f11e4b

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'master' into ben-task-graph

db6aed1

Signed-off-by: Ben Sherman <[email protected]>

Use buffer to compute checksum

8456892

Signed-off-by: Ben Sherman <[email protected]>

Add support for temporary output paths

77f2cdc

Signed-off-by: Ben Sherman <[email protected]>

Fix failing test

3e55ad5

Signed-off-by: Ben Sherman <[email protected]>

Add caveat about overlapping output channels [ci skip]

e307f75

Signed-off-by: Ben Sherman <[email protected]>

Delete files instead of emptying them (now supports directories and r…

08881b0

…emote paths) Signed-off-by: Ben Sherman <[email protected]>

Add `eager' cleanup option

0cf07ec

Signed-off-by: Ben Sherman <[email protected]>

Fix concurrency issues [ci fast]

73b2f3b

Signed-off-by: Ben Sherman <[email protected]>

bentsherman added 4 commits April 21, 2023 11:27

Merge branch 'master' into ben-task-graph-pull

0dd98d6

Merge branch 'master' into ben-task-graph-pull

0f505d3

Signed-off-by: Ben Sherman <[email protected]>

Replace synchronized with lock

e81e584

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'ben-task-graph-pull' into 452-eager-cleanup

ba2e7a6

Signed-off-by: Ben Sherman <[email protected]>

bentsherman changed the base branch from ben-task-graph to master April 28, 2023 18:43

bentsherman added 2 commits April 28, 2023 13:51

Remove dependency on task graph branch

f7bcfa8

Signed-off-by: Ben Sherman <[email protected]>

Use downstream tasks to determine lifetime for task cleanup

4d90e27

Signed-off-by: Ben Sherman <[email protected]>

sonatype-lift bot reviewed Apr 28, 2023

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/trace/TraceObserver.groovy Show resolved Hide resolved

pditommaso force-pushed the master branch from 68c35f1 to 36b9e22 Compare June 4, 2023 20:18

pditommaso force-pushed the master branch from 38c9931 to 295bc1f Compare June 13, 2023 08:44

bentsherman added 5 commits July 6, 2023 23:50

Rename TemporaryFileObserver to TaskCleanupObserver

dd23b2a

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'master' into 452-eager-cleanup

6a34be6

Signed-off-by: Ben Sherman <[email protected]>

Wait for output files to be published

ff08984

Signed-off-by: Ben Sherman <[email protected]>

Log warning if eager cleanup is used with incompatible publish modes

9b343b6

Signed-off-by: Ben Sherman <[email protected]>

Add eager cleanup for individual output files

6b5a820

Signed-off-by: Ben Sherman <[email protected]>

bentsherman mentioned this pull request Jul 7, 2023

Add support for temporary output paths #3818

Closed

bentsherman mentioned this pull request Aug 13, 2023

publishDir checksum method #2844

Open

pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13

bentsherman added 3 commits September 13, 2023 14:25

Merge branch 'master' into 452-eager-cleanup

08342dd

Signed-off-by: Ben Sherman <[email protected]>

Add thread pool for task cleanup

3498c5d

Signed-off-by: Ben Sherman <[email protected]>

Minor improvement to checkCachedOutput()

c421617

Signed-off-by: Ben Sherman <[email protected]>

bentsherman mentioned this pull request Sep 17, 2023

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Closed

bentsherman mentioned this pull request Oct 3, 2023

Improve cache debugging with -dump-hashes #4367

Closed

This was linked to issues Oct 4, 2023

Automatically delete files marked as temp as soon as not needed anymore #452

Open

Azure, WARN: Failed to cleanup work dir #2683

Open

Clean up option fail when using AWS Batch (and likely other clouds) #3645

Open

marcodelapierre added the core-runtime label Nov 1, 2023

bentsherman removed the core-runtime label Nov 1, 2023

pditommaso force-pushed the master branch from fd99141 to 19d2ccb Compare November 24, 2023 20:42

pditommaso force-pushed the master branch from 4e27468 to dfd7d09 Compare December 20, 2023 09:55

AlexVCaron mentioned this pull request Jan 8, 2024

Running multiple pipelines with nf-scil on clusters on a lot of subjects scilus/nf-scil#48

Open

bentsherman added 2 commits January 31, 2024 00:50

Merge branch 'master' into 452-eager-cleanup

5b05aad

Signed-off-by: Ben Sherman <[email protected]>

minor edits

9ce10dc

Signed-off-by: Ben Sherman <[email protected]>

This was referenced Feb 2, 2024

Automatically delete files marked as temp as soon as not needed anymore #452

Open

Automatic task cleanup (without resumability) #4713

Closed

Overwrite published outputs only if they are stale #4729

Open

bentsherman mentioned this pull request Apr 22, 2024

Dry run #4214

Draft

KarinSchork mentioned this pull request Jul 10, 2024

clean up work folder after execution mpc-bioinformatics/McQuaC#70

Open

bentsherman mentioned this pull request Sep 27, 2024

Enable or disable run cleanup via CLI parameter #5342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic task cleanup #3849

Automatic task cleanup #3849

bentsherman commented Apr 10, 2023 •

edited

Loading

bentsherman commented Apr 11, 2023

netlify bot commented Jul 7, 2023 •

edited

Loading

JohnHadish commented Mar 21, 2024 •

edited

Loading

bentsherman commented Mar 22, 2024

pbieberstein commented Jul 17, 2024

bentsherman commented Jul 18, 2024 •

edited

Loading

pbieberstein commented Jul 18, 2024

bentsherman commented Jul 18, 2024

Automatic task cleanup #3849

Are you sure you want to change the base?

Automatic task cleanup #3849

Conversation

bentsherman commented Apr 10, 2023 • edited Loading

bentsherman commented Apr 11, 2023

netlify bot commented Jul 7, 2023 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

JohnHadish commented Mar 21, 2024 • edited Loading

bentsherman commented Mar 22, 2024

pbieberstein commented Jul 17, 2024

bentsherman commented Jul 18, 2024 • edited Loading

pbieberstein commented Jul 18, 2024

bentsherman commented Jul 18, 2024

bentsherman commented Apr 10, 2023 •

edited

Loading

netlify bot commented Jul 7, 2023 •

edited

Loading

JohnHadish commented Mar 21, 2024 •

edited

Loading

bentsherman commented Jul 18, 2024 •

edited

Loading