-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 81, "call empty droplets" #301
Conversation
tested only with kallisto aligner (both with and without automated kallisto filtering with bustools --filter parameter)
Python linting (
|
tuple val(meta), path ("*.count/counts_unfiltered"), emit: raw_counts // TODO: Add to nf-coew/modules before merging PR | ||
tuple val(meta), path ("*.count/counts_filtered") , emit: filtered_counts, optional: true // TODO: Add to nf-coew/modules before merging PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am aware that this modification must go to nf-core/modules and not here, thus I added the TODO so this removed once testing is done.
|
…(tested with lamanno)
…l-empty-droplets
…d in the testings (cellranger)
Hi @grst and @apeltzer , The empty-drops moduleFirst of all, I have added a script to perform the empty-drops calling and filtering using a library that is available in bioconda, With that, I added a module which is simple, it takes a matrix file, and performs the empty drops call on it, generating another matrix file.
The inclusion in the workflowWith the module generated, I could then include it in the workflow. // Run emptydrops calling module
if ( !params.skip_emptydrops ) {
//
// emptydrops should only run on the raw matrices thus, filter-out the filtered result of the aligners that can produce it
//
if ( params.aligner in [ 'cellranger', 'cellrangerarc', 'kallisto', 'star' ] ) {
ch_mtx_matrices_for_emptydrops =
ch_mtx_matrices.filter { meta, mtx_files ->
mtx_files.toString().contains("raw_feature_bc_matrix") || // cellranger
mtx_files.toString().contains("counts_unfiltered") || // kallisto
mtx_files.toString().contains("raw") // star
}
} else {
ch_mtx_matrices_for_emptydrops = ch_mtx_matrices
}
EMPTYDROPS_CELL_CALLING( ch_mtx_matrices_for_emptydrops )
ch_mtx_matrices = ch_mtx_matrices.mix( EMPTYDROPS_CELL_CALLING.out.filtered_matrices )
} One thing to note above is that, as discussed previously, I have to add a checker/filter in order to only pass on the raw/unprocessed matrices generated by the assemblers, because, I think it does not make sense to run the module in the already filtered/processed matrices.
Changes in conversion modulesBecause now we will have both the data directly from the aligners, and a custom-made filtering module, I had to change the conversion modules a bit so they are aware of that, and can understand the difference between raw/filtered from the aligners itself and what is the custom empty drops filter. With that, to try to avoid confusion by the user, I had to add such "suffixes" to the generated converted files, so now we write data with such sufixes: *_{raw,filtered,custom_emptydrops_filter}_matrix.{h5ad,rds} In this case, the meanings are:
I also had to update the two conversion-scripts for rds and h5ad, because now they also have to understand this new matrix generated, and to allow it to also transpose the matrices from cellranger, because, when we do the normal conversion of cellranger matrices, we use the Integrating with the alignersFinally, it has to be discussed the changes made in the aligners-connections to integrate this module. One by one. AlevinBecause alevin only produce on matrix result, without producing a pair (raw&filtered) as others like cellranger, star and kallisto, the integration was seamless and required no major change, just connecting it to the module. When using it, the results generated are the following: alevin_run/
├── alevin
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ │ ├── pbmc8k_raw_matrix.h5ad
│ │ │ └── pbmc8k_raw_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k
│ │ └── emptydrops_filtered
│ │ ├── quants_mat_cols.txt
│ │ ├── quants_mat.mtx
│ │ └── quants_mat_rows.txt
│ ├── pbmc8k_alevin_results
│ │ ├── af_map
│ │ │ ├── alevin
│ │ │ ├── aux_info
│ │ │ ├── cmd_info.json
│ │ │ ├── libParams
│ │ │ ├── logs
│ │ │ ├── map.rad
│ │ │ └── unmapped_bc_count.bin
│ │ ├── af_quant
│ │ │ ├── alevin
│ │ │ ├── all_freq.bin
│ │ │ ├── collate.json
│ │ │ ├── featureDump.txt
│ │ │ ├── generate_permit_list.json
│ │ │ ├── map.collated.rad
│ │ │ ├── permit_freq.bin
│ │ │ ├── permit_map.bin
│ │ │ ├── quant.json
│ │ │ └── unmapped_bc_count_collated.bin
│ │ └── simpleaf_quant_log.json
├── alevinqc
├── fastqc
├── multiqc
└── pipeline_info KallistoFor kallisto, I first had to include a new parameter in the pipeline, called, Finally, to make sure I had a channel that could be properly used, and of course, that we could filter raw / filtered data to choose what to pass on to the empty drops filter module, I had to modify the generated channels by the module (see here). tuple val(meta), path ("*.count") , emit: count
tuple val(meta), path ("*.count/counts_unfiltered"), emit: raw_counts // TODO: Add to nf-coew/modules before merging PR
tuple val(meta), path ("*.count/counts_filtered") , emit: filtered_counts, optional: true // TODO: Add to nf-coew/modules before merging PR Then, of course, I updated the downstream snippets of the codes in the suf-workflows and workflow to understand it.
Results look like this, when kallisto_lamanno_run
├── fastqc
├── kallisto
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_filtered_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_spliced_matrix.h5ad
│ │ │ ├── pbmc8k_spliced_matrix.rds
│ │ │ ├── pbmc8k_unspliced_matrix.h5ad
│ │ │ └── pbmc8k_unspliced_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k.count
│ │ ├── 10x_version2_whitelist.txt
│ │ ├── counts_filtered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── counts_unfiltered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── emptydrops_filtered
│ │ │ ├── spliced.barcodes.txt
│ │ │ ├── spliced.genes.txt
│ │ │ ├── spliced.mtx
│ │ │ ├── unspliced.barcodes.txt
│ │ │ ├── unspliced.genes.txt
│ │ │ └── unspliced.mtx
│ │ ├── filter_barcodes.txt
│ │ ├── inspect.json
│ │ ├── inspect.spliced.json
│ │ ├── inspect.unspliced.json
│ │ ├── kb_info.json
│ │ ├── matrix.ec
│ │ ├── output.bus
│ │ ├── output.filtered.bus
│ │ ├── output.unfiltered.bus
│ │ ├── run_info.json
│ │ ├── spliced.filtered.bus
│ │ ├── spliced.unfiltered.bus
│ │ ├── transcripts.txt
│ │ ├── unspliced.filtered.bus
│ │ └── unspliced.unfiltered.bus
│ └── versions.yml
├── multiqc
└── pipeline_info
kallisto_run
├── fastqc
├── kallisto
│ ├── mtx_conversions
│ │ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ │ ├── combined_raw_matrix.h5ad
│ │ ├── pbmc8k
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ │ ├── pbmc8k_raw_matrix.h5ad
│ │ │ └── pbmc8k_raw_matrix.rds
│ │ └── versions.yml
│ ├── pbmc8k.count
│ │ ├── 10x_version2_whitelist.txt
│ │ ├── counts_unfiltered
│ │ │ ├── cells_x_genes.barcodes.txt
│ │ │ ├── cells_x_genes.genes.txt
│ │ │ └── cells_x_genes.mtx
│ │ ├── emptydrops_filtered
│ │ │ ├── cells_x_genes.barcodes.txt
│ │ │ ├── cells_x_genes.genes.txt
│ │ │ └── cells_x_genes.mtx
│ │ ├── inspect.json
│ │ ├── kb_info.json
│ │ ├── matrix.ec
│ │ ├── output.bus
│ │ ├── output.unfiltered.bus
│ │ ├── run_info.json
│ │ └── transcripts.txt
│ └── versions.yml
├── multiqc
└── pipeline_info CellrangerFor cellranger, basically it happened the same to Kallisto. The difference is that cellranger always produces a pair of raw/filtered and then I had to just modify the channels to account for that so would make filtering later on easier (see here) I also had to update the conversion-scripts because the emptydrops filter module does not produce a Results for it looks like this: cellranger_run/
├── cellranger
│ ├── count
│ │ ├── pbmc8k
│ │ │ ├── emptydrops_filtered
│ │ │ └── outs
│ │ └── versions.yml
│ ├── mkgtf
│ │ └── genome_genes.filtered.gtf
│ ├── mkref
│ │ ├── cellranger_reference
│ │ │ ├── fasta
│ │ │ ├── genes
│ │ │ ├── reference.json
│ │ │ └── star
│ │ └── versions.yml
│ └── mtx_conversions
│ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ ├── combined_filtered_matrix.h5ad
│ ├── combined_raw_matrix.h5ad
│ ├── pbmc8k
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ ├── pbmc8k_filtered_matrix.h5ad
│ │ ├── pbmc8k_filtered_matrix.rds
│ │ ├── pbmc8k_raw_matrix.h5ad
│ │ └── pbmc8k_raw_matrix.rds
│ └── versions.yml
├── fastqc
├── multiqc
└── pipeline_info STARFor star, basically the same for cellranger. It always produce a raw/filtered pair, but I had to adjust the out-channels to make the filtering/selection easier. Of course, adjusting all the downstream channel selections to account for them. The results for it look like this: star_run/
├── fastqc
├── multiqc
├── pipeline_info
└── star
├── mtx_conversions
│ ├── combined_custom_emptydrops_filter_matrix.h5ad
│ ├── combined_filtered_matrix.h5ad
│ ├── combined_raw_matrix.h5ad
│ ├── pbmc8k
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.h5ad
│ │ ├── pbmc8k_custom_emptydrops_filter_matrix.rds
│ │ ├── pbmc8k_filtered_matrix.h5ad
│ │ ├── pbmc8k_filtered_matrix.rds
│ │ ├── pbmc8k_raw_matrix.h5ad
│ │ └── pbmc8k_raw_matrix.rds
│ └── versions.yml
└── pbmc8k
├── emptydrops_filtered
│ ├── barcodes.tsv
│ ├── features.tsv
│ └── matrix.mtx
├── pbmc8k.Aligned.sortedByCoord.out.bam
├── pbmc8k.Log.final.out
├── pbmc8k.Log.out
├── pbmc8k.Log.progress.out
├── pbmc8k.SJ.out.tab
├── pbmc8k.Solo.out
│ ├── Barcodes.stats
│ └── Gene
└── versions.yml UniverSC and cellrangerarcI could not even run them, so could not be tested nor integrated. Just not sure what should go first. About the out-channelsYou will see in the I do this to guarantee all results are going to the downstream analysis.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me apart from some parts:
- Changes in the module code --> need to go upstream, ideally open PRs already for this
- Upgrades in the respective upstream modules might be necessary, check out the cellranger update PR, which could be merged prior to updating the modules here I believe --> should be easy to do..
- Some more docs on what this sfeature does would be helpful - so that people can both see it in the changelog and also in the main documentation
Yes, the changes in the modules I added as a About the docs, I will work on it. |
…l-empty-droplets
sounds good |
@nf-core-bot fix linting |
Now that kallisto was updated, and the workflows it provides are different. I will have to test it with them as well. |
Hi @grst , Can you take a look again at the changes?
The only one missing is the last one, which is currently running. |
I'm wondering if instead of updating all those modules it would be easier to do something like ch_filtered = ch_out.map{
meta, files -> [meta, out.findAll{ it -> it.contains("filtered") }]
} |
That was my first try. But because many of them use the ** to grab the files. It catches the files and not the directories. So, when filtering, we only select the files that have it in its name instead of the directory and all that is inside. I added some information in the last section of this comment #301 (comment) |
I see, fair enough then |
…rs to avoid file collision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go as soon as the tests pass!
Pipeline execution terminated. As said here, the nf-core modules were updated and TODOs removed. Documentation was updated. And also, workflow (kallisto) was updated so that the new modules works for both non-standard kallisto workflows, lamanno and nac. Results structure organization and namings are being produced as said here. Finally, all testings passed, so, merging the PR 😄 |
Right now, just opening a draft PR on the attempt of solving issue #81 so that it is easier to keep track of modifications.
One work is "done" I will add a thorough overview of the changes, with explanations of the main modifications and listing on TODOs to be addressed before merging.
Then, of course, only then I will add some reviewers.