Skip to content

salmon v0.10.0

Compare
Choose a tag to compare
@rob-p rob-p released this 29 May 21:55
· 1088 commits to master since this release

Salmon 0.10.0 is a major feature release. It includes a family of algorithms to perform single cell analysis, but also a number of new feature and performance enhancements. We highly-recommend that all users upgrade when they have the chance.

Note : Due to the inclusion of the SHA512 hash in the salmon index (see in other changes below), existing salmon indices should be rebuilt.

alevin

Welcome alevin to the salmon family !

Working under the salmon engine, alevin brings new algorithms and infrastructure to perform single-cell quantification and analysis based on 3' tagged-end sequencing. The alevin mode is activated by using the alevin command, and currently supports quantification of Drop-seq (--dropseq) and 10x v1/2 (--chromium) single-cell protocols (v1 chemistry requires use of a special wrapper). Alevin works on raw-FASTA/Q files and performs the following tasks:

  • Intial Whitelisting: If not given --whitelist (an already known set of whitelisted barcodes e.g. as produced by Cell Ranger), alevin finds a rough estimate for the set of the whitelisted CB (Cellular Barcodes) based on their frequency.

  • Barcode Correction: In the first pass over the CB file, alevin constructs a dictionary for the correction of CB (if not on the whitelist) by correcting CB within 1-edit distance of the whitelisted CB. In case of multiple whitelist candidates, preference is given to SNP over indels. Optionally, a probabilistic model can be used to soft-assign barcodes, although that behavior is disabled by default. (--noSoftMap is true ).

  • UMI Correction & Deduplication: alevin introduces a novel method for deduplicating the UMIs (Unique Molecule identifiers) present in a sample. Alevin's algorithm uses equivalence-class-level information to infer when the same UMI must arise from different isoforms of a gene (to avoid over-collapsing UMI counts), but also accounts for the fact that collisions between UMIs within a gene are expected to be very rare (i.e. if UMIs arise within different equivalence classes of a gene, they are most likely to derive from different positions in the same underlying molecule). To use a baseline (i.e. simple gene-level) UMI deduplication algorthm, alevin can be used with --naive to disable its collision correction.

  • CB classification: Alevin uses various features in a machine-learning-based framework to classify the set of observed CBs that are likely to derive from valid captured cells (i.e. final whitelisting). This approach to CB classification is like that performed by the method of Petukhov et al.. Alevin uses features like the abundance of mitochndrial genes (--mrna), ribosomal geness (--rrna) and others, to for classification.

  • Cell-Gene count Matrix: By default, alevin outputs a cell-by-gene matrix out in a compressed binary format. However, --dumpCsvCounts can be used to dump a human-readable count matrix.

  • other features: --dumpfq does fast concatentation of corrected CB to the read names of the sequence containing fastq file; --dumpFeatures dumps the features and counts used by alevin to perform the ML-based CB classification; --dumpBfh dumps the full CB-Eqclass-UMI-Count data-structure used internally by alevin.

Note : We are actively developing and improving alevin, and are happy and excited to get feedback from the community. If you encounter an issue when using alevin, please be sure to tag your GitHub issue with the alevin tag when reporting the issue via GitHub.

mapping validation

Mapping validation is a new feature that allows salmon to validate its mappings via a traditional (affine-gap penalty) alignment procedure; it is enabled by passing the flag --validateMappings. This validation is made efficient (and fast) through a combination of :

  • using the very-efficient and highly-vectorized alignment implementation of @lh3's ksw2 library.

  • devising a novel caching heuristic that avoids re-aligning reads when sub-problems are redundant (this turns out to be a major computational bottleneck when aligning against the transcriptome).

Using the --validateMappings flag has two main potential benefits. First, this will help prevent salmon from considering potentially spurious mappings (i.e., mappings supported by only a few MMPs but which nonetheless would not support a high-quality read alignment). Second, this will help assign more appropriate mapping scores to reads that map to similar (but not identical) reference sequences --- essentially helping to appropriately down-weight sub-optimal mappings. Along with this flag, salmon introduces flags to set the match score (--ma), mismatch penalty (--mp), and gap open (--go) and extension (--ge) scores used when computing the alignment. It also allows the user to specify the minimum relative alignment score that will be considered as a valid mapping (--minScoreFraction). While these can all be customized, the defaults should be reasonable for typical use cases.

other changes

  • Salmon now enables the alignment error model by default in alignment-based mode. This means that the --useErrorModel flag is no longer valid, since its behavior is now the default. This flag has been removed, and a new flag added in its place. Passing alignment-based salmon the --noErrorModel flag will turn off the alignment error model in alignment-based mode.

  • Related to the above; the alignment error model works best in conjunction with range factorization. Thus, the default behavior is now to turn on range-based factorization in alignment mode (in conjunction with the error model).

  • New default VB prior : The default per-nucleotide VB prior has been changed to 1e-5. While this is still an ongoing area of research, a considerable amount of testing is suggesting that variational Bayesian optimization with a sparsity inducing prior regularly leads to more accurate abundance estimates than the default EM algorithm. While we are leaving the EM algorithm as the default for the offline-phase in the current release, this may change in future versions. We encourage users who may not already be doing so to explore the variational Bayesian-based offline optimization feature of salmon (enabled with --useVBOpt).

  • The library type compatibility is now enforced strictly. Previously mapping that disagreed with the inferred or provided library type simply had their probability decreased. Now, the default behavior is to discard such mappings. The new behavior is equivalent to running with the option --incompatPrior 0. The older behavior can be obtained by setting --incompatPrior to a small non-zero value.

  • The library format count statistics are now computed in a different (and hopefully less confusing) manner. Specifically, rather than being computed over the number of mappings of each type, the statistics are computed over the number of fragments that have at least one mapping of that type. This means that, e.g., if a fragment maps to 2 places in the forward orientation and 1 place in the reverse-complement orientation, this will now contribute only 1 count to the forward and reverse-complement compatibilites each. This should help reduce any reference bias when computing these summary statistics.

  • The default value of --gcSpeedSamp has been set to 5.

  • Inclusion of SHA512 hashes for the salmon index : When indexing salmon now computes both SHA256 and SHA512 indices for the reference. This is done to allow future-compatibility with GA4GH hashes (which will use a truncated variant of SHA512).

  • The default k-mer class (used for certain operations within salmon) has been migrated from the jellyfish implementation to a custom implementation. This results in a small performance increase on our testing systems under linux, and a moderate performance increase under OSX.

  • Salmon is now compiled in C++14 mode (i.e. --std=c++14) by default rather than C++11 mode. This is the last salmon release that will support C++11 (by compiling with -DCONDA_BUILD=TRUE). Moving forward, C++14 compliance will be considered the minimum requirement to compile salmon from source and C++14 features will be used in new code.