The m_xanthus_10_assemblies.unitigs.gfa
is produced from the 10 assemblies mentioned in the supplementary materials of the paper. The assemblies were concatenated together and then given to bcalm
to produce a de Briujn Graph with k-mer size of 41. The command used to produce the unitigs is:
$ bcalm -in m_xanthus_10_assemblies.fa -kmer-size 41 -abundance-min 1 -nb-cores 5
This produced the file m_xanthus_10_assemblies.unitigs.fa
which can be converted to a gfa file using this script from bcalm
:
$ python3 convertToGFA.py m_xanthus_10_assemblies.unitigs.fa m_xanthus_10_assemblies.unitigs.gfa 41
To detect bubble chains, the bchains
subcommand was used without arguments, which will only detect bubble chains and produces some statistics:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.gfa bchains
The number of Simple Bubbles is 40381
The number of Superbubbles is 41356
The number of insertions is 0
Sequence coverage of the bubble chains is 91.79359052995717%
Node coverage of the bubble chains is 94.32604093154552%
The longest chain seq-wise has 746155 bp
The longest chain bubble_wise has 916 bubbles
One can extract the biggest connected component in terms of number of nodes in the graph by using the following command:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.gfa biggestcomp m_xanthus_10_assemblies.unitigs.biggest_comp.gfa
To extract the chains into a separate GFA file, the following command can be used:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.gfa bchains --chains_gfa m_xanthus_10_assemblies.unitigs.bchains.gfa
To show that the new subgraph is actually only made of bubble chains. If we run BubbleGun again, we will get 100% coverage now and the same number of simple and super bubbles:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.bchains.gfa bchains
The number of Simple Bubbles is 40381
The number of Superbubbles is 41356
The number of insertions is 0
Sequence coverage of the bubble chains is 100.0%
Node coverage of the bubble chains is 100.0%
The longest chain seq-wise has 746155 bp
The longest chain bubble_wise has 916 bubbles
To produce the JSON file with all bubbles information, the following command can be used:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.gfa bchains --bubble_json m_xanthus_10_assemblies.unitigs.bchains.json
The JSON file has the following structure:
{
"str_chain_id": { "chain_id": "str_chain_id", "end": ["str_start_node_id", "str_end_node_id"], "bubbles": [ { "type": "type", "id": "int_bubble_id", "ends": ["str_start_node_id", "str_end_node_id"], "inside": ["str_node_id","..."] }, { "..." }, { "..." } ], "parent_sb": "int_id", "parent_chain": "int_id" }}
So it starts with a string id of the chain, inside of that there are these keys 'chain_id', 'ends', 'bubbles', 'parent_sb', 'parent_chain'
, if this chains was not nested, then there won't be a parent sb and parent chain.
One can for example load this JSON file into python and inspect the different chains:
import json
with open("m_xanthus_10_assemblies.unitigs.bchains.json", "r") as in_file:
bubble_chains = json.load(in_file)
# let us for example get a list of chains ids with length of 10 bubbles
chain_list = []
for chain_key, chain in bubble_chains.items():
if len(chain['bubbles']) == 10: chain_list.append(chain_key)# chain_list will be a list of strings that are the ids, and to get any chain, simple use that id as key
bubble_chains['425']
# {'chain_id': 425,
# 'ends': ['434125', '436190'],
# 'bubbles': [{'type': 'super',
# 'id': 1298,
# 'ends': ['434125', '324091'],
# 'inside': ['130678', '291683', '271121', '340902', '20122']},
# {'type': 'simple',
# 'id': 1299,
# 'ends': ['541197', '324091'],
# 'inside': ['454725', '485937']},
# {'type': 'simple',
# 'id': 1300,
# 'ends': ['541197', '436190'],
# 'inside': ['144223', '255360']}],
# 'parent_sb': 31898, # 'parent_chain': 8046}
You can separate a certain chain using chainout
subcommand. For example, we can extract the parent of chain 425 from the previous example,
to see if it was actually nested:
$ BubbleGun -g m_xanthus_10_assemblies.unitigs.bchains.gfa chainout --json_file m_xanthus_10_assemblies.unitigs.bchains.json --chain_ids 8046 --output_chain chain_8046.gfa
This will produce a separate GFA file with only that parent chain which we can easily visualize in tools like Bandage or gfaviz.
To show that BubbleGun also works on bluntified graphs, we constructed a pangenome from the 10 Myxococcus xanthus assemblies using minigraph. The command used was:
$ minigraph -xggs -t 2 GCA_000340515.1_Myxococcus_xanthus_DZF1_genomic.fna GCA_900106535.1_IMG-taxon_2693429903_annotated_assembly_genomic.fna GCA_000012685.1_ASM1268v1_genomic.fna GCA_000278585.2_ASM27858v2_genomic.fna GCA_006400955.1_ASM640095v1_genomic.fna GCA_006401215.1_ASM640121v1_genomic.fna GCA_006401635.1_ASM640163v1_genomic.fna GCA_006402015.1_ASM640201v1_genomic.fna GCA_006402415.1_ASM640241v1_genomic.fna GCA_006402735.1_ASM640273v1_genomic.fna > minigraph_m_xanthus_10_assemblies_pangenome.gfa
.
The graph output of minigraph has 3775 nodes and 5043 edges, and bubble gun was able to detect bubble chains in 0.4 seconds and reported:
$ BubbleGun -g minigraph_m_xanthus_10_assemblies_pangenome.gfa bchains
The number of Simple Bubbles is 655
The number of Superbubbles is 314
The number of insertions is 75
Sequence coverage of the bubble chains is 88.48293973667816%
Node coverage of the bubble chains is 98.72847682119205%
The longest chain seq-wise has 885640 bp
The longest chain bubble_wise has 82 bubbles
First of all, the public HG002 short reads was used, and corrected using Lighter, the following command was used:
$ lighter -r D1_S1_L001_R1_001.fastq.gz -r D1_S1_L001_R1_002.fastq.gz -r D1_S1_L001_R1_003.fastq.gz -r D1_S1_L001_R1_004.fastq.gz -r D1_S1_L001_R1_005.fastq.gz -r D1_S1_L001_R1_006.fastq.gz -r D1_S1_L001_R1_007.fastq.gz -r D1_S1_L001_R1_008.fastq.gz -r D1_S1_L001_R1_009.fastq.gz -r D1_S1_L001_R1_010.fastq.gz -r D1_S1_L001_R1_011.fastq.gz -r D1_S1_L001_R1_012.fastq.gz -r D1_S1_L001_R1_013.fastq.gz -r D1_S1_L001_R1_014.fastq.gz -r D1_S1_L001_R1_015.fastq.gz -r D1_S1_L001_R1_016.fastq.gz -r D1_S1_L001_R1_017.fastq.gz -r D1_S1_L001_R2_001.fastq.gz -r D1_S1_L001_R2_002.fastq.gz -r D1_S1_L001_R2_003.fastq.gz -r D1_S1_L001_R2_004.fastq.gz -r D1_S1_L001_R2_005.fastq.gz -r D1_S1_L001_R2_006.fastq.gz -r D1_S1_L001_R2_007.fastq.gz -r D1_S1_L001_R2_008.fastq.gz -r D1_S1_L001_R2_009.fastq.gz -r D1_S1_L001_R2_010.fastq.gz -r D1_S1_L001_R2_011.fastq.gz -r D1_S1_L001_R2_012.fastq.gz -r D1_S1_L001_R2_013.fastq.gz -r D1_S1_L001_R2_014.fastq.gz -r D1_S1_L001_R2_015.fastq.gz -r D1_S1_L001_R2_016.fastq.gz -r D1_S1_L001_R2_017.fastq.gz -r D1_S1_L002_R1_001.fastq.gz -r D1_S1_L002_R1_002.fastq.gz -r D1_S1_L002_R1_003.fastq.gz -r D1_S1_L002_R1_004.fastq.gz -r D1_S1_L002_R1_005.fastq.gz -r D1_S1_L002_R1_006.fastq.gz -r D1_S1_L002_R1_007.fastq.gz -r D1_S1_L002_R1_008.fastq.gz -r D1_S1_L002_R1_009.fastq.gz -r D1_S1_L002_R1_010.fastq.gz -r D1_S1_L002_R1_011.fastq.gz -r D1_S1_L002_R1_012.fastq.gz -r D1_S1_L002_R1_013.fastq.gz -r D1_S1_L002_R1_014.fastq.gz -r D1_S1_L002_R1_015.fastq.gz -r D1_S1_L002_R1_016.fastq.gz -r D1_S1_L002_R1_017.fastq.gz -r D1_S1_L002_R2_001.fastq.gz -r D1_S1_L002_R2_002.fastq.gz -r D1_S1_L002_R2_003.fastq.gz -r D1_S1_L002_R2_004.fastq.gz -r D1_S1_L002_R2_005.fastq.gz -r D1_S1_L002_R2_006.fastq.gz -r D1_S1_L002_R2_007.fastq.gz -r D1_S1_L002_R2_008.fastq.gz -r D1_S1_L002_R2_009.fastq.gz -r D1_S1_L002_R2_010.fastq.gz -r D1_S1_L002_R2_011.fastq.gz -r D1_S1_L002_R2_012.fastq.gz -r D1_S1_L002_R2_013.fastq.gz -r D1_S1_L002_R2_014.fastq.gz -r D1_S1_L002_R2_015.fastq.gz -r D1_S1_L002_R2_016.fastq.gz -r D1_S1_L002_R2_017.fastq.gz -t 30 -K 21 6000000000 &
1- The resulting corrected files are concatenated and the de Bruijn Graph was constructed using bcalm2 with the following command on the corrected reads:
$ bcalm -nb-cores 40 -in all_reads.cor.fq.gz -kmer-size 61 -abundance-min 3
2- The bcalm
command will generate a fasta file with unitigs that can be turned into a graph file in GFA format using this script, with the following command:
$ python3 convertToGFA.py all_reads.cor.fq.unitigs.fa all_reads.cor.fq.unitigs.gfa 61
3- Tips were removed using the UntipRelative.cpp from GraphAligner, the following command was used:
$ ./UntipRelative 100 200 0.05 < all_reads.cor.fq.unitigs.gf > HG002_short_reads_k61_unitigs_untipped.gfa
The file HG002_short_reads_k61_unitigs_untipped.gfa
is uploaded here and the user can use to test BubbelGun
with.
1- First thing we need to do is to compact the graph, as removing some tips might have generated linear stretches of nodes that can be compacted, the following copmmand is used:
$ BubbleGun -g HG002_short_reads_k61_unitigs_untipped.gfa compact HG002_short_reads_k61_unitigs_untipped.compacted.gfa
2- Now that the graph is compacted, we can detect bubbles and produce the bubbles information json file and a text file with simple stats about the bubbles using the following command:
$ BubbleGun -g HG002_short_reads_k61_unitigs_untipped.compacted.gfa bchains --bubble_json hg002_bubbles.json > hg002_bubbles_stats.txt
3- To validate the simple bubbles and compare them to the GIAB high-confidence variants, we used this Snakemake pipeline that takes two FASTA file where each haplotype is in a file. To produce such haplotypes from simple bubbles chains using BubbleGun
the following command was used:
$ BubbleGun -g HG002_short_reads_k61_unitigs_untipped.compacted.gfa bchains --only_simple --out_haplos
Where this command will produce two FASTA files named haplotype1.fasta
and haplotype2.fasta
, where each chain of simple bubbles is split into two complementary linear paths (two haplotypes) and outputted into the fasta files.
4- To use the Snakemake pipeline to produce variants, this Snakemake pipeline is used, and the user just need to provide a reference and the two haplotype FASTA files produced in the config file, example config.json
:
{
"reference":
{
"filename": "GRCh38_full_analysis_set_plus_decoy_hla.fa",
"prefix": "chr"
},
"assemblies":
{
"sample1" : ["abundance_min_3/haplotype1.fasta", "abundance_min_3/haplotype2.fasta"]
},
"trios":
{
},
"scripts": "scripts",
"outdir" : "merged/"
}