Skip to content
Caleb Lareau edited this page Aug 10, 2020 · 2 revisions

TLDR

In principle, the mgatk CLI will produce the same final output files for each of the major processing modes specified. However, depending on the source and format of your input data and the availability of computational resources, one should select which mode is best suited. Here's a breakdown of key considerations:

If you have a collection of .bam files, one per sample, use the call mode

If you have a have data from a 10x Genomics library, use the tenx mode

If you have single-cell data or a setting where multiple samples are in one .bam file but have unconventional barcodes, or don't know the barcodes that you want to analyze yet, use bcall

call

The call mode tells mgatk to look at a directory, identify all .bam files, and treat each file like it is its own sample. This mode is great for 1) Fluidigm C1, Smart-seq2, and other plate-based assays or 2) bulk genomics samples. To run this mode, simply specify the file path and add any additional user options desired.

mgatk call -i folder_of_bam_files ...

bcall

This mode is the most flexible for calling genotypes from single-cell data, but it comes at a cost. In brief, this mode utilizes a user-specified SAM tag (e.g. CB) to identify distinct cells. These can either be a known list of barcodes (specified using the -b FILE option to point to a particular FILE of barcodes, one per line) or by identifying barcodes with greater than X mtDNA reads (using the -mb X option). However, the price for this flexibility is that mgatk will 1) split the master .bam file into (often) thousands of single-cell bams and then process them sequentially. While the snakemake scheduler achieves this relatively quickly, it can still be a strain on the file system to open thousands of tiles. One can use the -ns flag to reduce the number of files open at once depending on the file system specifications, though the more files that can be processed in parallel, the faster the computational will complete.

To use bcall, specify a valid .bam file containing mtDNA reads and further specify either the -b or -mb option with valid parameters:

mgatk bcall -i path_to_bam_file ...

The other benefit to using bcall is that since each cell is split into its own .bam file, one can elect to keep these filtered, deduplicated, per-cell .bam files by throwing the -qc flag, which may be useful for other downstream applications.

tenx

This mode utilizes the feature conventions of the 10x Genomics .bam file. Specifically, the 16bp barcode and the optional UMI are used to enable more intelligent processing that circumvents splitting the original .bam file into thousands of separate files. The run time is also faster. Importantly, for appropriate input (_i.e.) CellRanger or CellRanger-ATAC output), bcall and tenx will give identical results.

The basic input requires both a .bam file as well as a plaintext file specifying known HQ barcodes for analysis, such as those produced by the CellRanger knee call:

mgatk tenx -i path_to_bam_file -b known_barcodes_file ...

Other modes

The three modes highlighted above will be the meat and potatoes of using mgatk though the following two may also come in handy:

check

This mode functions as a drop-in for the examples shown above. Simply switch out bcall, call, or tenx with call and run your command with all appropriate flags. The mgatk main function will then attempt an execution and raise any problems (e.g. incorrect file paths, errant reference genome specification, or missing dependencies) prospectively. This is useful to make sure that you don't waste minutes or hours of your life only to realize that the genotyping command failed at the last step.

support

This is a simple mode that shows what built-in contigs are available:

Sun Aug 09 21:01:09 PDT 2020: mgatk v0.5.7
Sun Aug 09 21:01:09 PDT 2020: List of built-in genomes supported in mgatk:
Sun Aug 09 21:01:09 PDT 2020: ['GRCh37', 'GRCh38', 'GRCm38', 'GRCz10', 'NC_012920', 'hg19', 'hg19_chrM', 'hg38', 'mm10', 'mm9', 'rCRS']
Sun Aug 09 21:01:09 PDT 2020: Specify one of these genomes or provide your own .fasta file with the --mito-genome flag

See this part of the documentation if you don't see the reference genome of interest.

If you have a reference genome that you anticipate will be widely used but is currently absent, consider submitted a pull request with the .fasta file added to the fasta annotation package folder.