Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix createyml random image matching #221

Merged
merged 10 commits into from
Aug 26, 2024

Conversation

ViktorHy
Copy link
Collaborator

@ViktorHy ViktorHy commented Aug 21, 2024

Description and reviewer info

In this PR I aim to fix two issues.

First, eklipse images from trios would always be random, or rather the image that was created last in the pipeline, To fix this I have added proband ID information to create_yaml process in main.nf and use that information in create_yml.pl to select correct image.

Second, some genelist would go missing without any reason from yaml-files. I tracked the issue down to a crappy regex that matched anything with test and ignored them. This resulted in missing all intestinal genelists. I removed the regex-part and cleaned up the code around genelist a bit.

Moreover, I also changed some variable names to be more readable, removed some errors and made sure to only output to STDERR.

Type of change

  • Documentation
  • Patch
  • Minor change
  • Major change

Checklist

  • Self-review of my code
  • Update the CHANGELOG
  • Tag the latest commit (vX.Y.Z format)
  • Log samples used for testing in the Verification_samples_log Excel sheet

Patch

  • Stub run completes without errors or new warnings
  • At least one other person has reviewed and approved my code (not required for trivial changes)

Test/review documentation

Review performed by

  • Alexander
  • Jakob
  • Paul
  • Ryan
  • Viktor

(Add if missing)

Testing performed by

  • Alexander
  • Jakob
  • Paul
  • Ryan
  • Viktor

@ViktorHy
Copy link
Collaborator Author

I just realized I could also filter the eklipse channel to only pass on proband image. Would require less create_yml.pl fixes. I dont know what you guys prefer? @alkc @Jakob37

@Jakob37
Copy link
Contributor

Jakob37 commented Aug 22, 2024

Can we briefly discuss the overall strategy here @ViktorHy before I wrap the reviewing.

This seems fragile to me, to do this filtering out in the process based on the position. If we change something upstreams of yml_diag later, this might very well break.

set group, id, sex, mother, father, phenotype, diagnosis, type, assay, clarity_sample_id, ffpe, analysis, type, file(ped), file(INFO) from yml_diag.filter { it[7] == 'proband' }.join(ped_scout).join(yaml_INFO).view()

On the other hand, it seems like script itself does not need, and thus shouldn't, know about whether it is processing a proband file or not.

So the right place to handle this seems to be on the nextflow level.

I tested if you could filter on "name" instead of position in the yml_diag channel. It seems this cannot be done after converting it into a tuple, but it could be done before (tested using nextflow console).

Channel
    .fromPath("trio.csv")
    .splitCsv(header: true)
    .filter { row -> row.type == "proband" }
    .map { row -> tuple(row.type, row.group, row.id, row.sex) }.view() 

As the yml_diag only seems to be used in this process. Could we do a separate channel with only the proband info instead? And then feed the csv_proband_only. I think this would give the same result as what you have here 🤔

I just realized I could also filter the eklipse channel to only pass on proband image. Would require less create_yml.pl fixes. I dont know what you guys prefer? @alkc @Jakob37

If we could handle that in a named way as the example above, then I think you are right - we should offload create_yml.pl this responsibility and handle it on the nextflow level.

Copy link
Contributor

@Jakob37 Jakob37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works in testing - great! Some minor comments.

my @g_c = split/,/,$opt{g};
### Proband ### Could differ from group, needed to select correct eklipse image
### Clarity-ID ###
my @g_c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of this variable name is not clear to me (I realize this is not from this PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its an abbreviated name, group_clarity. And the --g flag was from it being only group in the beginning.

@@ -132,52 +132,75 @@
}

### Group ###
if (!defined $opt{g}) { print STDERR "need group name"; exit;}
my @g_c = split/,/,$opt{g};
### Proband ### Could differ from group, needed to select correct eklipse image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition - early fail and clear user info when calling with wrong input. I like it

print STDERR $_,"\n";
if ($tmp[0] eq "BAM") {
$INFO{BAM}->{$tmp[1]} = $tmp[2];
my $category = $tmp[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

}
close INFO;
print Dumper(%INFO);
my $info_json = to_json(\%INFO, { pretty => 1, indent => 4 });
print STDERR ($info_json);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For debugging only (to be removed) or for production (to be kept)? (Just wondering)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I thought I'd keep it, but then again the input file shows the same information

@@ -121,7 +121,7 @@
if ($opt{assay}) {
my @a_a = split/,/,$opt{assay};
$assay = $a_a[0];
if ($a_a[1] ne 'false' && $a_a[1]) {
if ($a_a[1]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -203,9 +219,12 @@
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a STDERR message in the if statement here. To give us a tiny chance to see this if ever triggered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it really should be triggered everytime, since the regex really does not look for anything being added. Maybe I should just remove this. The funciton this was to solve has been solved elsewhere by using flags in loqusdb processes instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK removing sounds even better!

foreach my $key (@{$data}) {
if (ref $key->{institute} eq 'ARRAY') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, tricky to follow here. Guessing you know this chunk well enough, so that no new issues are introduced here ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is some really old legacy code. It's from the time we had a manually cured list of genepanels. Now it was just confusing to keep

@@ -1695,14 +1695,18 @@ process run_eklipse {
publishDir "${OUTDIR}/plots/mito", mode: 'copy', overwrite: 'true', pattern: '*.png'

input:
set group, id, file(bam), file(bai) from eklipse_bam

set group, id, file(bam), file(bai), sex, type from eklipse_bam.join(meta_eklipse, by: [0,1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the [0, 1] due to the id and group flipping? We should really sort that out soon ... Maybe we can replace all the group + id with a meta object. Then we could also bunch the type and sex into that one. An issue for another day though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not so far away though. Seems like something to sort out before transitioning to DSL2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not because of the flipping. It's just to say that it should match on both. The result is that the group and id channel does not become a wierd list.
if I only match on group it will be:
group, id1, bam, bai, id1/2/3(random), sex, type.
if I only match on id it will be:
[group,group], id, bam, bai, sex, type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

main.nf Show resolved Hide resolved
main.nf Show resolved Hide resolved
@ViktorHy
Copy link
Collaborator Author

Tests passed for wgs single, trio and onko samples

@ViktorHy ViktorHy merged commit 1d86c92 into master Aug 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants