Contact: Ruibang Luo, Zhenxian Zheng, Lei Chen
Email: {rbluo,zxzheng,lchen}@cs.hku.hk
ClairS-TO (Somatic Tumor-Only) is a tool in the Clair series to support long-read somatic small variant calling with only tumor samples available.
Without a normal sample, non-somatic noises cannot be identified by finding common signals between a paired tumor and normal. The variant caller itself needs to be more proficient in telling noises from somatic signals.
In ClairS-TO, we use an ensemble of two neural networks with opposite objectives. With the same input, an Affirmative NN determines how likely a candidate is a somatic variant - P(YAff), and a Negational NN determines how likely a candidate is NOT a somatic variant - P(YNeg). A conditional probability P(YAff | YNeg) that determines how likely a candidate is a somatic variant given the probability that the candidate is not a somatic variant is calculated from the probability of both networks. A somatic variant candidate that doesn't look like a noise usually has a high P(YAff) but a low P(YNeg), while a somatic variant candidate that can also be a noise can have both a high P(YAff) and a high P(YNeg).
Below is a workflow of ClairS-TO.
Like other tumor-only somatic variant callers, ClairS-TO accepts panel of normals (PoNs) as input to remove non-somatic variants.
For somatic variant calling using paired tumor/normal samples, please try ClairS.
The latest performance figures as of Oct. 10th, 2024 (ClairS-TO v0.3.0) is available in this technical note.
- Latest Updates
- Installation
- Quick Demo
- Pre-trained Models
- Usage
- Tagging non-somatic variant using panel of normals
- Disclaimer
v0.3.0 (Oct. 11, 2024) : This version is a major update. The new features and benchmarks are explained in a technical note titled “Improving the performance of ClairS and ClairS-TO with new real cancer cell-line datasets and PoN”. A summary of changes: 1. Starting from this version, ClairS-TO will provide two model types. ssrs
is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs
), ss
is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss
). The ssrs
model provides better performance and fits most usage scenarios. ss
model can be used when missing a cancer-type in model training is a concern. In v0.3.0, four real cancer cell-line datasets (HCC1937, HCC1954, H1437, and H2009) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs
model training. 2. Added using CoLoRSdb (Consortium of Long Read Sequencing Database) as a PoN for tagging non-somatic variant. The idea was inspired by Park et al., 2024. The F1-score improved by ~10-20% for both SNV and Indel by using CoLoRSdb. 3. Added tagging indels at sequence with low entropy as LowSeqEntropy
. 4. Added the --indel_min_af
option and adjusted the default minimum allelic fraction requirement to 0.1 for Indels in ONT platform. 5. Removed limiting Indel calling to only confident and necessary regions (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). The practice was started in v0.1.0, and is deemed unnecessary and removed in v0.3.0. User can use --calling_indels_only_in_these_regions
option to specify Indel calling regions.
v0.2.0 (Jul. 12, 2024): 1. Added a module called verdict
to statistically classify a called variant into either a germline, somatic, or subclonal somatic variant based on the copy number alterations (CNA) profile and tumor purity estimation. To disable, use --disable_verdict
option. Please check out more technical details about Verdict here.
v0.1.0 (Apr. 25, 2024): 1. Added support for somatic Indel calling. To disable, use --disable_indel_calling
option. Indels are called only in the BED regions specified by the --calling_indels_only_in_these_regions
option. The default regions are (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). 2. Added --panel_of_normals_require_allele_matching
option that takes comma separated booleans to indicate whether to require allele matching for each of the PoNs given in --panel_of_normals
. By default, allele matching is enabled when using germline variants sources (e.g., gnomAD, dbSNP) for non-somatic tagging, and is disabled when using panels (e.g., 1000G PoN). 3. Added multiple filters to remove as many spurious calls as possible. Including the use of i. phasing information: how good the alternative alleles are from a single haplotype after phasing (Simpson, 2024); ii. ancestral haplotype support: can an ancestral haplotype be found for reads that contain the alternative allele (Zheng et al., 2023); iii. BQ, MQ of the alternative allele reads; iv. variant position in read: whether the supporting alleles are gathered at the start or end of reads; v. strand bias; vi. realignment effect: for short read, whether both the count of supporting alt alleles and AF decreased after realignment. 4. Added --qual_cutoff_phaseable_region
and --qual_cutoff_unphaseable_region
to allow different qual cutoffs for tagging (as LowQual) the variants in the phaseable and unphaseable regions. Variants in unphaseable regions are suitable for a higher quality cutoff than those in the phaseable regions. 5. Added tags: i. H
to indicate a variant is found in phaseable region; ii. SB
showing the p-value of Fisher’s exact test on strand bias.
v0.0.2 (Jan. 26, 2024): 1. Added ONT Guppy 5kHz HAC (-p ont_r10_guppy_hac_5khz
) and Dorado 4kHz HAC (-p ont_r10_dorado_hac_4khz
) models, check here for more details. 2. Added FAU
, FCU
, FGU
, FTU
, RAU
, RCU
, RGU
, and RTU
tags for the count of forward/reverse strand reads supporting A/C/G/T. 3. Revamped the way how panel of normals (PoNs) are inputted. Population databases are also considered as PoNs, and users can disable default population databases and add multiple other PoNs. 4. Added file
and md5
information of the PoNs to the VCF output header. 5. Enabled somatic variant calling in sex chromosomes. 6. Fixed an issue that misses PoNs tagging for low-quality variants.
v0.0.1 (Dec. 4, 2023): Initial release for early access.
- Oxford Nanopore (ONT) Q20+ data as input, see ONT Quick Demo.
- PacBio HiFi Revio data as input, see PacBio HiFi Quick Demo.
- Illumina NGS data as input, see Illumina Quick Demo.
After following installation, you can run ClairS-TO with one command:
./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz
## Final SNV output VCF file: output/snv.vcf.gz
## Final Indel output VCF file: output/indel.vcf.gz
Check Usage for more options.
ClairS-TO trained both Affirmative and Negational models using GIAB samples, and carry on benchmarking on HCC1395 tumor sample dataset. All models were trained with chr20 excluded (including only chr1-19, 21, 22).
Platform | Model name | Chemistry /Instruments | Basecaller | Latest update | Option (-p/--platform ) |
Reference | Aligner |
---|---|---|---|---|---|---|---|
ONT 1 | r1041_e82_400bps_sup_v420 | R10.4.1, 5khz | Dorado SUP | Sep. 30, 2024 | ont_r10_dorado_sup_5khz_ssrs |
GRCh38_no_alt | Minimap2 |
ONT 1 | r1041_e82_400bps_sup_v420 | R10.4.1, 5khz | Dorado SUP | Nov. 10, 2023 | ont_r10_dorado_sup_5khz_ss |
GRCh38_no_alt | Minimap2 |
ONT | r1041_e82_400bps_sup_v420 | R10.4.1, 5khz | Dorado SUP | Nov. 10, 2023 | ont_r10_dorado_sup_5khz |
GRCh38_no_alt | Minimap2 |
ONT | r1041_e82_400bps_sup_v410 | R10.4.1, 4khz | Dorado SUP | Nov. 10, 2023 | ont_r10_dorado_sup_4khz |
GRCh38_no_alt | Minimap2 |
ONT | r1041_e82_400bps_hac_v410 | R10.4.1, 4khz | Dorado HAC | Jan. 19, 2024 | ont_r10_dorado_hac_4khz |
GRCh38_no_alt | Minimap2 |
ONT | r1041_e82_400bps_sup_g615 | R10.4.1, 4khz | Guppy6 SUP | Nov. 10, 2023 | ont_r10_guppy_sup_4khz |
GRCh38_no_alt | Minimap2 |
ONT | r1041_e82_400bps_hac_g657 | R10.4.1, 5khz | Guppy6 HAC | Jan. 21, 2024 | ont_r10_guppy_hac_5khz |
GRCh38_no_alt | Minimap2 |
Illumina | ilmn | NovaSeq/HiseqX | - | Nov. 10, 2023 | ilmn |
GRCh38 | BWA-MEM |
PacBio HiFi | hifi_revio | Revio with SMRTbell prep kit 3.0 | - | Nov. 10, 2023 | hifi_revio |
GRCh38_no_alt | Minimap2 |
Caveats 1: Starting from v0.3.0 version, ClairS-TO will provide two model types. ssrs
is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs
), ss
is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss
). The ssrs
model provides better performance and fits most usage scenarios. ss
model can be used when missing a cancer-type in model training is a concern. In v0.3.0, four real cancer cell-line datasets (HCC1937, HCC1954, H1437, and H2009) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs
model training.
A pre-built docker image is available at DockerHub.
Caution: Absolute path is needed for both INPUT_DIR
and OUTPUT_DIR
in docker.
docker run -it \
-v ${INPUT_DIR}:${INPUT_DIR} \
-v ${OUTPUT_DIR}:${OUTPUT_DIR} \
hkubal/clairs-to:latest \
/opt/bin/run_clairs_to \
--tumor_bam_fn ${INPUT_DIR}/tumor.bam \ ## use your tumor bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
--platform ${PLATFORM} \ ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
--output_dir ${OUTPUT_DIR} ## output path prefix
Check Usage for more options.
Caution: Absolute path is needed for both INPUT_DIR
and OUTPUT_DIR
in singularity.
INPUT_DIR="[YOUR_INPUT_FOLDER]" # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]" # e.g. /home/user1/output (absolute path needed)
mkdir -p ${OUTPUT_DIR}
conda config --add channels defaults
conda create -n singularity-env -c conda-forge singularity -y
conda activate singularity-env
# singularity pull docker pre-built image
singularity pull docker://hkubal/clairs-to:latest
# run the sandbox like this afterward
singularity exec \
-B ${INPUT_DIR},${OUTPUT_DIR} \
clairs-to_latest.sif \
/opt/bin/run_clairs_to \
--tumor_bam_fn ${INPUT_DIR}/tumor.bam \ ## use your tumor bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
--platform ${PLATFORM} \ ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
--output_dir ${OUTPUT_DIR} \ ## output path prefix
--conda_prefix /opt/micromamba/envs/clairs-to
Check here to install the tools step by step.
Use micromamba (recommended):
Please install micromamba using the official guide or using the commands below:
wget -O linux-64_micromamba-1.5.1-2.tar.bz2 https://micro.mamba.pm/api/micromamba/linux-64/latest
mkdir micromamba
tar -xvjf linux-64_micromamba-1.5.1-2.tar.bz2 -C micromamba
cd micromamba
./bin/micromamba shell init -s bash -p .
source ~/.bashrc
Or use anaconda:
Please install anaconda using the official guide or using the commands below:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x ./Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
Install ClairS-TO using micromamba step by step:
# create and activate an environment named clairs-to
# install pypy and packages in the environment
# for micromamba
micromamba create -n clairs-to -c bioconda -c pytorch -c conda-forge pytorch tqdm clair3 bcftools einops scipy scikit-learn python=3.9.0 -y
micromamba activate clairs-to
## for anaconda
#conda create -n clairs-to -c bioconda -c pytorch -c conda-forge pytorch tqdm clair3 bcftools einops python=3.9.0 -y
#source activate clairs-to
git clone https://github.com/HKU-BAL/ClairS-TO.git
cd ClairS-TO
# make sure in clairs-to environment
# download pre-trained models and other resources
echo ${CONDA_PREFIX}
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_models
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_databases
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_cna_data
wget http://www.bio8.cs.hku.hk/clairs-to/models/clairs-to_models.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/databases/clairs-to_databases.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/cna_data/reference_files.tar.gz
tar -zxvf clairs-to_models.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_models/
tar -zxvf clairs-to_databases.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_databases/
tar -zxvf reference_files.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_cna_data/
#CLAIRSTO_PATH=`pwd`
## to enable realignment module
#sudo apt install g++ libboost-all-dev -y
#cd ${CLAIRSTO_PATH}/src/realign && g++ -std=c++14 -O1 -shared -fPIC -o realigner ssw_cpp.cpp ssw.c realigner.cpp && g++ -std=c++11 -shared -fPIC -o debruijn_graph -O3 debruijn_graph.cpp
## to install allele counter for verdict module
#sudo apt install curl zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev gcc -y
#cd ${CLAIRSTO_PATH}/src/verdict/allele_counter && chmod +x ./setup.sh && /bin/bash ./setup.sh ${CLAIRSTO_PATH}/src/verdict/allele_counter
#cd ${CLAIRSTO_PATH}
./run_clairs_to --help
This is the same as Option 1 except that you are building a docker image yourself. Please refer to Option 1 for usage.
git clone https://github.com/HKU-BAL/ClairS-TO.git
cd ClairS-TO
# build a docker image named hkubal/clairs-to:latest
# might require docker authentication to build docker image
docker build -f ./Dockerfile -t hkubal/clairs-to:latest .
# run the docker image like Option 1
docker run -it hkubal/clairs-to:latest /opt/bin/run_clairs_to --help
./run_clairs_to \
--tumor_bam_fn ${INPUT_DIR}/tumor.bam \ ## use your tumor bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
--platform ${PLATFORM} \ ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
--output_dir ${OUTPUT_DIR} ## output path prefix
## Final SNV output VCF file: output/snv.vcf.gz
## Final Indel output VCF file: output/indel.vcf.gz
Required parameters:
-T, --tumor_bam_fn TUMOR_BAM_FN Tumor BAM file input. The input file must be samtools indexed.
-R, --ref_fn FASTA Reference file input. The input file must be samtools indexed.
-o, --output_dir OUTPUT_DIR VCF output directory.
-t, --threads THREADS Max threads to be used.
-p, --platform PLATFORM Select the sequencing platform of the input. Possible options {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_dorado_sup_5khz_ss, ont_r10_dorado_sup_5khz_ssrs, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}.
Commonly used parameters:
-s SAMPLE_NAME, --sample_name SAMPLE_NAME
Define the sample name to be shown in the VCF file. Default: SAMPLE.
-c CTG_NAME, --ctg_name CTG_NAME
The name of the contigs to be processed. Split by ',' for multiple contigs. Default: all contigs will be processed.
--include_all_ctgs Call variants on all contigs, otherwise call in chr{1..22,X,Y} and {1..22,X,Y}.
-r REGION, --region REGION
A region to be processed. Format: `ctg_name:start-end` (start is 1-based, including both end positions).
-b BED_FN, --bed_fn BED_FN
Path to a BED file. Call variants only in the provided BED regions.
-G VCF_FN, --genotyping_mode_vcf_fn VCF_FN
VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.
-H VCF_FN, --hybrid_mode_vcf_fn VCF_FN
Enable hybrid calling mode that combines the de novo calling results and genotyping results at the positions in the VCF file given.
--print_ref_calls Show reference calls (0/0) in VCF file in genotyping or hybrid mode.
--disable_indel_calling
Disable Indel calling. Default: Enabled.
--snv_min_af FLOAT
Minimal SNV AF required for a variant to be called. Decrease SNV_MIN_AF might increase a bit of sensitivity, but in trade of precision, speed and accuracy. Default: 0.05.
--indel_min_af FLOAT
Minimal Indel AF required for a variant to be called. Default: 0.1.
--min_coverage INT
Minimal coverage required for a variant to be called. Default: 4.
-q INT, --qual INT If set, variants with >INT will be tagged as PASS, or LowQual otherwise. Default: ONT - 12 , PacBio HiFi - 8, Illumina - 4.
--qual_cutoff_phaseable_region INT
If set, variants called in phaseable regions with >INT will be tagged as PASS, or LowQual otherwise. Supersede by `--qual`.
--qual_cutoff_unphaseable_region INT
If set, variants called in unphaseable regions with >INT will be tagged as PASS, or LowQual otherwise. Supersede by `--qual`.
--panel_of_normals FILENAMES
The path of the panel of normals (PoNs) used for tagging non-somatic variants. Split by ',' if using multiple PoNs. Default: 'gnomad.r2.1.af-ge-0.001.sites.vcf.gz,dbsnp.b138.non-somatic.sites.vcf.gz,1000g-pon.sites.vcf.gz,CoLoRSdb.GRCh38.v1.1.0.deepvariant.glnexus.af-ge-0.001.vcf.gz'.
--panel_of_normals_require_allele_matching BOOLEANS
Use together with `--panel_of_normals`. Whether to require allele matching for each PoN. Split by ',' if using multiple PoNs. Default: 'True,True,False,False'.
--snv_output_prefix PATH_PREFIX
Prefix for SNV output VCF filename. Default: snv.
--indel_output_prefix PATH_PREFIX
Prefix for Indel output VCF filename. Default: indel.
--call_indels_only_in_these_regions BED_FN
Call Indel only in the provided regions. Supersede by `--bed_fn`. To call Indel in the whole genome, input a BED covering the whole genome.
--do_not_print_nonsomatic_calls
Do not print those non-somatic variants tagged by `--panel_of_normals`.
Other parameters:
--snv_pileup_affirmative_model_path PATH
Specify the path to your own SNV pileup affirmative model.
--snv_pileup_negational_model_path PATH
Specify the path to your own SNV pileup negational model.
--indel_pileup_affirmative_model_path PATH
Specify the path to your own Indel pileup affirmative model.
--indel_pileup_negational_model_path PATH
Specify the path to your own Indel pileup negational model.
-d, --dry_run Print the commands that will be ran, but do not run them.
--chunk_size INT
The size of each chuck for parallel processing. Default: 5000000.
--remove_intermediate_dir
Remove the intermediate directory before finishing to save disk space.
--python PATH Absolute path of python, python3 >= 3.9 is required.
--pypy PATH Absolute path of pypy3, pypy3 >= 3.6 is required.
--samtools PATH Absolute path of samtools, samtools version >= 1.10 is required.
--parallel PATH Absolute path of parallel, parallel >= 20191122 is required.
--longphase PATH
Absolute path of longphase, longphase >= 1.3 is required.
--whatshap PATH Absolute path of whatshap, whatshap >= 1.0 is required.
--use_longphase_for_intermediate_phasing
Use longphase for intermediate phasing.
--use_whatshap_for_intermediate_phasing
Use whatshap for phasing.
--use_longphase_for_intermediate_haplotagging USE_LONGPHASE_FOR_INTERMEDIATE_HAPLOTAGGING
Use longphase instead of whatshap for intermediate haplotagging.
--disable_intermediate_phasing
Disable intermediate phasing, runs faster but reduces precision.
--disable_nonsomatic_tagging
Disable non-somatic variants tagging and ignore `--panel_of_normals`.
--disable_verdict
Disable using verdict to tag the variants in CNA regions. We suggest using the parameter only for sample with tumor purity estimation lower than 0.8. Default: Enabled.
./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -C chr21,chr22
./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -r chr20:1000000-2000000
Call Variants at interested variant sites (genotyping) using the -G/--genotyping_mode_vcf_fn
parameter
./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -G input.vcf
We highly recommended using BED file to define multiple regions of interest like:
echo -e "${CTG1}\t${START_POS_1}\t${END_POS_1}" > input.bed
echo -e "${CTG2}\t${START_POS_2}\t${END_POS_2}" >> input.bed
...
Then:
./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -B input.bed
ClairS-TO by default tags variants if they exist in provided panel of normals (PoNs, i.e., gnomad.r2.1.af-ge-0.001.sites.vcf.gz
, dbsnp.b138.non-somatic.sites.vcf.gz
, 1000g-pon.sites.vcf.gz
, and CoLoRSdb.GRCh38.v1.1.0.deepvariant.glnexus.af-ge-0.001.vcf.gz
), and pass the filters listed in the table below.
Users can also use their own PoNs for tagging using the --panel_of_normals
option.
Particularly, if the --panel_of_normals
option is not specified, the four default PoNs will be included. And if users want to use all/part/none of the default PoNs as well as their own PoNs, corresponding file paths of the default PoNs (i.e., ${CONDA_PREFIX}/bin/clairs-to_databases/gnomad.r2.1.af-ge-0.001.sites.vcf.gz
, ${CONDA_PREFIX}/bin/clairs-to_databases/dbsnp.b138.non-somatic.sites.vcf.gz
, ${CONDA_PREFIX}/bin/clairs-to_databases/1000g-pon.sites.vcf.gz
, and ${CONDA_PREFIX}/bin/clairs-to_databases/CoLoRSdb.GRCh38.v1.1.0.deepvariant.glnexus.af-ge-0.001.vcf.gz
), and their own PoNs, should be included in the --panel_of_normals
option, split by ,
.
In addition, we recommend using --panel_of_normals_require_allele_matching
option that takes comma separated booleans to indicate whether to require allele matching for each of the PoNs given in --panel_of_normals
. By default, allele matching is enabled when using germline variants sources (e.g., gnomAD, dbSNP) for non-somatic tagging, and is disabled when using panels (e.g., 1000G PoN, CoLoRSdb).
NOTE: the content of this research code repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.