Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSL2: Sharding tweaks #1104

Open
TCLamnidis opened this issue Dec 20, 2024 · 0 comments
Open

DSL2: Sharding tweaks #1104

TCLamnidis opened this issue Dec 20, 2024 · 0 comments
Labels

Comments

@TCLamnidis
Copy link
Collaborator

TCLamnidis commented Dec 20, 2024

One of the "issues" with sharding at the moment is that once the mapped BAMs get collected again for merging, the order of files is inconsistent, meaning that downstream BAMs and their headers are unstable in terms of checksums.

The inconsistency comes from the record of the mapping command, which contains the shard filename. Example output from the test profile:

@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa samse -r @RG\tID:ILLUMINA-JK2782_JK2782_TGGCCGATCAACGA\tSM:JK2782\tLB:JK2782_TGGCCGATCAACGA\tPL:illumina\tPU:ILLUMINA-JK2782_TGGCCGATCAACGA-double_stranded-SE ./bwa/Mammoth_MT_Krause JK2782_Mammoth_MT_Krause.sai JK2782_JK2782_TGGCCGATCAACGA_L1.merged.part_002.fastq.gz

I think we could use some channel manipulation to sort the filenames going into shard merging, to ensure that the headers of downstream BAMs are consistent across runs (i.e. get the command to always show part_001).

I think at that point we could also change sharding to be turned on by default, as it makes sense to use it in most cases, instead of requiring users to configure the resources for their mapper manually.

@TCLamnidis TCLamnidis changed the title Sharding tweaks DSL2: Sharding tweaks Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant