-
Notifications
You must be signed in to change notification settings - Fork 27
Modes
In principle, the mgatk
CLI will produce the same final output files for each of the major processing modes specified. However, depending on the source and format of your input data and the availability of computational resources, one should select which mode is best suited. Here's a breakdown of key considerations:
If you have a collection of .bam
files, one per sample, use the call
mode
If you have a have data from a 10x
Genomics library, use the tenx
mode
If you have single-cell data or a setting where multiple samples are in one .bam
file but have unconventional barcodes, or don't know the barcodes that you want to analyze yet, use bcall
The call
mode tells mgatk
to look at a directory, identify all .bam
files, and treat each file like it is its own sample. This mode is great for 1) Fluidigm C1, Smart-seq2, and other plate-based assays or 2) bulk genomics samples. To run this mode, simply specify the file path and add any additional user options desired.
mgatk call -i folder_of_bam_files ...
This mode is the most flexible for calling genotypes from single-cell data, but it comes at a cost. In brief, this mode utilizes a user-specified SAM tag (e.g. CB
) to identify distinct cells. These can either be a known list of barcodes (specified using the -b FILE
option to point to a particular FILE
of barcodes, one per line) or by identifying barcodes with greater than X
mtDNA reads (using the -mb X
option). However, the price for this flexibility is that mgatk
will 1) split the master .bam
file into (often) thousands of single-cell bams and then process them sequentially. While the snakemake scheduler achieves this relatively quickly, it can still be a strain on the file system to open thousands of tiles. One can use the -ns
flag to reduce the number of files open at once depending on the file system specifications, though the more files that can be processed in parallel, the faster the computational will complete.
To use bcall
, specify a valid .bam
file containing mtDNA reads and further specify either the -b
or -mb
option with valid parameters:
mgatk bcall -i path_to_bam_file ...
The other benefit to using bcall
is that since each cell is split into its own .bam
file, one can elect to keep these filtered, deduplicated, per-cell .bam
files by throwing the -qc
flag, which may be useful for other downstream applications.
This mode utilizes the feature conventions of the 10x Genomics .bam
file. Specifically, the 16bp barcode and the optional UMI are used to enable more intelligent processing that circumvents splitting the original .bam
file into thousands of separate files. The run time is also faster. Importantly, for appropriate input (_i.e.) CellRanger or CellRanger-ATAC output), bcall
and tenx
will give identical results.
The basic input requires both a .bam
file as well as a plaintext file specifying known HQ barcodes for analysis, such as those produced by the CellRanger knee call:
mgatk tenx -i path_to_bam_file -b known_barcodes_file ...
The three modes highlighted above will be the meat and potatoes of using mgatk
though the following two may also come in handy:
This mode functions as a drop-in for the examples shown above. Simply switch out bcall
, call
, or tenx
with call
and run your command with all appropriate flags. The mgatk
main function will then attempt an execution and raise any problems (e.g. incorrect file paths, errant reference genome specification, or missing dependencies) prospectively. This is useful to make sure that you don't waste minutes or hours of your life only to realize that the genotyping command failed at the last step.
This is a simple mode that shows what built-in contigs are available:
Sun Aug 09 21:01:09 PDT 2020: mgatk v0.5.7
Sun Aug 09 21:01:09 PDT 2020: List of built-in genomes supported in mgatk:
Sun Aug 09 21:01:09 PDT 2020: ['GRCh37', 'GRCh38', 'GRCm38', 'GRCz10', 'NC_012920', 'hg19', 'hg19_chrM', 'hg38', 'mm10', 'mm9', 'rCRS']
Sun Aug 09 21:01:09 PDT 2020: Specify one of these genomes or provide your own .fasta file with the --mito-genome flag
See this part of the documentation if you don't see the reference genome of interest.
If you have a reference genome that you anticipate will be widely used but is currently absent, consider submitted a pull request with the .fasta
file added to the fasta annotation package folder.
Please raise an issue here