Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RNU-gene and reference files cleanup #231

Merged
merged 3 commits into from
Sep 26, 2024
Merged

Conversation

Jakob37
Copy link
Contributor

@Jakob37 Jakob37 commented Sep 24, 2024

Description and reviewer info

  • Added RNU2-4 gene to bed file
  • Moved input data to a sub dir in reference_tools
  • Update the config to point to new intersect bed location
  • Checked the diffs using bedtools intersect -a <> -b <> -v, confirming that the changes was some added ClinVar (~1000), a few removed, and the added RNU-gene (please double check as a reviewer)
  • Organized the .gitignore a bit (just shifting things around)

I wonder if we should move out the config and things like reference data outside the repo itself. The workflow overall should be location agnostic, but these parts are coded to us in Lund (and which we might rather keep private, such as exact file locations on our servers). Discussion for the future.

I'll test run onco, wgs and wgs trio, and verify that things look OK.

Type of change

  • Documentation
  • Patch
  • Minor change
  • Major change

Checklist

  • Self-review of my code
  • Update the CHANGELOG
  • Tag the latest commit (vX.Y.Z format)
  • Log samples used for testing in the Verification_samples_log Excel sheet

Documentation

  • At least one other person has reviewed my changes (not required for trivial changes)

Patch

  • Stub run completes without errors or new warnings
  • At least one other person has reviewed and approved my code (not required for trivial changes)

Major / Minor change

  • Stub run completes without errors or new warnings
  • onco run finishes without any new warnings/errors and the results can
    be loaded into scout
  • wgs single run finishes without any new warnings/errors and the results
    can be loaded into scout
  • wgs trio run finishes without any new warnings/errors and the results
    can be loaded into scout
  • At least one other person has reviewed and approved my code
  • I have made corresponding changes to the documentation (software versions, etc.)

Test/review documentation

Review performed by

  • Alexander
  • Jakob
  • Paul
  • Ryan
  • Viktor

(Add if missing)

Testing performed by

  • Alexander
  • Jakob
  • Paul
  • Ryan
  • Viktor

@Jakob37 Jakob37 marked this pull request as ready for review September 24, 2024 13:42
@Jakob37 Jakob37 requested a review from alkc September 24, 2024 13:42
@Jakob37 Jakob37 changed the title Update reference tools structure and update config to new version Add RNU-gene and reference files cleanup Sep 24, 2024
Copy link
Contributor

@alkc alkc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got one question above below about added variants in the updated bed file.

@@ -113,7 +113,7 @@ profiles {
params.outdir = "${params.resultsdir}${params.dev_suffix}"
params.subdir = 'wgs'
params.crondir = "${params.outdir}/cron/"
params.intersect_bed = "${params.refpath}/bed/wgsexome/exons_108padded20bp_clinvar-20231230padded5bp.bed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the diffs using bedtools intersect -a <> -b <> -v, confirming that the changes was some added ClinVar (~1000), a few removed, and the added RNU-gene (please double check as a reviewer)

Sure about that ~1000 added, or is it a typo? I get 168 new entries when comparing the new file with the old:

(alkc-base) alkc@MTLUCMDS1:~$ bedtools intersect -a /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20240825padded5bp.bed -b /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20231230padded5bp.bed -v | wc -l     
168                                                                                                                                                                                                                                                   
(alkc-base) alkc@MTLUCMDS1:~$ bedtools intersect -b /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20240825padded5bp.bed -a /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20231230padded5bp.bed -v | wc -l     
33                                                                                                                                                                                                                                                                                                                                                                                           

RNU confirmed added w/ correct corrdinated:

(alkc-base) alkc@MTLUCMDS1:~$ bedtools intersect -a /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20240825padded5bp.bed -b /fs1/resources/ref/hg38//bed/wgsexome/exons_108padded20bp_clinvar-20231230padded5bp.bed -v | grep RNU  
17      43387226        43387417        RNU2-4                                                                                                                                                                                                        

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure about that ~1000 added, or is it a typo? I get 168 new entries when comparing the new file with the old:

Here is the log output:

INFO: Clinvar in common between versions: 231560
INFO: Added new (unique targets): 42111 (1305)
INFO: Removed old (unique targets): 18330 (37)

But, this comparison is between the ClinVar vcfs, not between the final intersect files. Maybe most of the new entries are overlapping with the exons.

I'll see if I can verify that hypothesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand this now. The logged numbers are for targets added and removed to the bed file based on the new ClinVar, i.e. what isn't present in the exons, agilient and our custom bed.

I think the number is correct, but I'll see if I can clarify the code / output a bit so this is clear next time around.

I'll re-request review when I am done with this.

@@ -0,0 +1 @@
17 43387226 43387417 RNU2-4
Copy link
Contributor

@alkc alkc Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coordinates check out 👍

@Jakob37
Copy link
Contributor Author

Jakob37 commented Sep 26, 2024

OK, I have worked through the ClinVar adding/removing logic again. What confused me before was the difference numbers:

  1. The total number of ClinVar entries added into the bed (i.e. independent of previously being present)
  2. The number of these not previously present

I extended the log to include this:

INFO: New ClinVar variants (total added ClinVar sites) (new ClinVar sites): 42111 (1305) (183)
INFO: Removed ClinVar variants (old removed ClinVar sites): 18330 (37)

This does not change the number of entries included in the output, only the logging.

Beyond this I have meditated for quite some time about why 183. When reducing overlapping ranges it goes down to 175. The remaining bunch are a few Pathogenic which for some reason isn't seen as new compared to the previous run. But they are included in the final version. So I don't think it is dangerous.

@Jakob37 Jakob37 requested a review from alkc September 26, 2024 07:05
@Jakob37 Jakob37 merged commit 24e7e40 into master Sep 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants