Improve `-dump-hashes` output #4369

bentsherman · 2023-10-04T13:57:37Z

Re-structures the -dump-hashes output to be on a single line per task, which makes it easier to produce a diff from the log files.

To test it, do an initial run, modify the pipeline, then a resumed run:

./launch.sh -log run_fresh.log run rnaseq-nf -profile docker -dump-hashes

# modify the rnaseq-nf pipeline locally
vim ~/.nextflow/assets/nextflow-io/rnaseq-nf/modules/fastqc/main.nf

./launch.sh -log run_resume.log run rnaseq-nf -profile docker -dump-hashes -resume

Then run the following bash script to filter the logs and produce the diff:

#!/bin/bash

if [[ $# != 2 ]] ; then
    >&2 echo "usage: $0 <log1> <log2>"
    exit 1
fi

get_hashes() {
    cat $1 \
    | grep 'cache hash:' \
    | cut -d ' ' -f 10- \
    | sort \
    | awk '{ print; print ""; }'
}

get_hashes $1 > diff.1.log
get_hashes $2 > diff.2.log

diff diff.1.log diff.2.log

./diff-hashes.sh run_fresh.log run_resume.log

Signed-off-by: Ben Sherman <[email protected]>

netlify · 2023-10-04T13:57:42Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`4c43753`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/652bf01e3c172c0008971af3

pditommaso · 2023-10-05T14:27:20Z

To tell the true I find this output much more confusing compared the current one

[sayHello (4)] cache hash: 341e08ddb2f780263ffe665202f0db6c; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edexecutor >  local (4)
[7d/105df3] process > sayHello (1) [100%] 4 of 4 ✔
[sayHello (4)] cache hash: 341e08ddb2f780263ffe665202f0db6c; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | f5e76d4e64af0c5d859ff08ab3b720b7 [java.lang.String] Hola | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (2)] cache hash: b37742150cadbd69cb6072d8dffc51c9; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | 442e002ddd8b0a2b10ed51352f8c0488 [java.lang.String] Ciao | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (1)] cache hash: eb8e64ef291404fc09bdd89324c9a984; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | c9273e5a7ac3508ef910437c4bb35a90 [java.lang.String] Bonjour | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
[sayHello (3)] cache hash: b86b4bc95cebc58c8be5f1c06fa470e8; mode: STANDARD; entries: 0a01126af407b31e29e31a5778ce7e99 [java.util.UUID] 7bda72f1-8d94-4ca7-818b-4db206314bf1 | 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello | e605d10a140bca2c087a238261356639 [java.lang.String] """\n    echo '$x world!'\n    """\n | 769f897d21d56476ad01edc930becff0 [java.lang.String] x | 0ab6632d52e811e9ef7c044666ac496a [java.lang.String] Hello | 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ | 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true

bentsherman · 2023-10-05T14:38:02Z

Why? It's the same content but on a single line per task. And it's not really meant to be viewed in that form... imagine a pipeline run with 10,000 tasks, you wouldn't be viewing those logs.

Instead, you should use the bash script to extract and sort the logs, then use a diff viewer to highlight the changes.

pditommaso · 2023-10-05T14:42:51Z

because the current format allows to diff two logs and find where it diverges. With the new one is quite everything in one row

bentsherman · 2023-10-05T14:45:33Z

In other words, there's no point in making the output prettier, because the tasks will be logged in arbitrary order anyway. You have to take some extra steps to get it into a usable format, which primarily means one line per task.

pditommaso · 2023-10-05T14:55:06Z

It can be added as an optional format, but it should be human readable first

bentsherman · 2023-10-05T15:04:15Z

It's still human readable though

bentsherman · 2023-10-05T15:06:36Z

because the current format allows to diff two logs and find where it diverges. With the new one is quite everything in one row

but you can't really diff the logs because the tasks are printed in arbitrary order. you have to re-order the tasks manually.

I don't see the issue with putting everything on a single line, just use line wrap in your editor

bentsherman · 2023-10-05T15:12:30Z

Sorry about the flurry of messages, got out of sync with you. Here is a screenshot of a diff I produced with the bash script, using a diff viewer in VS code with line wrap:

pditommaso · 2023-10-06T08:41:57Z

Most of the time the order is maintained (and using process.fair should be consistent). Above all having one hash per liner at least, It can be figure out quickly where it start to diverge.

If the need is a format for programmatic processing, something more structured should be used (json?), instead of stacking hacks over hacks

bentsherman · 2023-10-06T14:16:29Z

I don't see how your screenshot illustrates your point... it is mostly noise 😄

Fair execution would improve somewhat, but for large pipelines you could still have different processes executing in arbitrary order. So the current format is still very difficult to use

But I agree that I would rather not layer hacks upon hacks. That is why in the original issue I suggested that we find a way for Nextflow to automate the whole thing and produce the final diff:

Make sure the task cache has all the components to re-create the hash

Add a new command to query the task cache for two runs with the same session id and compare the tasks for each run:

nextflow diff kickass_pare nostalgic_murdock

and produces some human-readable output, like:

Task `RNASEQ:FASTQC (1)` differs between the two runs:

  script:
  """
- fastqc.sh "$sample_id" "$reads"
+ fastqc.sh "$sample_id" "$reads" -t 2
  """

But that solution is more involved, and restructuring the hash dump to one line per task is a simple fix that greatly improves the overall debugging experience, even if some manual effort is still required. So here we are 😄

If it will make you happy, I will change the one-line format to json and make it opt-in with -dump-hashes json.

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2023-10-06T15:42:24Z

Okay it's ready

Signed-off-by: Paolo Di Tommaso <[email protected]>

modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy

…sor.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

Signed-off-by: Ben Sherman <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>

Improve -dump-hashes output

f0f38f3

Signed-off-by: Ben Sherman <[email protected]>

bentsherman requested a review from pditommaso October 4, 2023 13:57

bentsherman mentioned this pull request Oct 4, 2023

Add docs page on caching and resuming #4371

Merged

bentsherman added the ready-to-merge label Oct 4, 2023

pditommaso removed the ready-to-merge label Oct 5, 2023

bentsherman added 2 commits October 6, 2023 10:15

Merge branch 'master' into 4367-improve-dump-hashes

c10e0d4

Output as JSON, make optional with -dump-hashes json

f2aa2ee

Signed-off-by: Ben Sherman <[email protected]>

bentsherman added the ready-to-merge label Oct 6, 2023

pditommaso added 2 commits October 15, 2023 15:38

Merge branch 'master' into 4367-improve-dump-hashes

fb8d2fd

Signed-off-by: Paolo Di Tommaso <[email protected]>

Minor changes [ci fast]

50c7797

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso reviewed Oct 15, 2023

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy Outdated Show resolved Hide resolved

Update modules/nextflow/src/main/groovy/nextflow/processor/TaskProces…

4c43753

…sor.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso merged commit 5bdaac9 into master Oct 15, 2023
5 checks passed

pditommaso deleted the 4367-improve-dump-hashes branch October 15, 2023 14:19

bentsherman mentioned this pull request Apr 15, 2024

[feature] Resume: print cache invalidation reason to console #4894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `-dump-hashes` output #4369

Improve `-dump-hashes` output #4369

bentsherman commented Oct 4, 2023

netlify bot commented Oct 4, 2023 •

edited

Loading

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

bentsherman commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 6, 2023

bentsherman commented Oct 6, 2023

bentsherman commented Oct 6, 2023

Improve -dump-hashes output #4369

Improve -dump-hashes output #4369

Conversation

bentsherman commented Oct 4, 2023

netlify bot commented Oct 4, 2023 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

bentsherman commented Oct 5, 2023

bentsherman commented Oct 5, 2023

pditommaso commented Oct 6, 2023

bentsherman commented Oct 6, 2023

bentsherman commented Oct 6, 2023

Improve `-dump-hashes` output #4369

Improve `-dump-hashes` output #4369

netlify bot commented Oct 4, 2023 •

edited

Loading