-
Notifications
You must be signed in to change notification settings - Fork 9
07. Comparison
In this workflow, sourmash version 2.1.0 performs pairwise-comparisons of the k-mer content across multiple samples without the use of a reference database. Because many variables can impact k-mer content estimates, such as varying depths of sequencing in different samples, it is recommended that this workflow only be used to compare the k-mer content of the same sample after using different trimming and assembly parameters. This allows users to evaluate how their sample's overall k-mer content was impacted by read filtering and assembly. Logistically, this means only one sample name should be present in the config file when executing rules within the comparison workflow.
The comparison workflow first generates MinHash signatures on the filtered reads (i.e., the Read Filtering Workflow output) or the assembled contigs (i.e., the Assembly Workflow output) with sourmash compute
. The sourmash compare
functionality then compares signatures from multiple datasets to each other and outputs their pairwise comparisons as Jaccard indexes in a final *.csv file. Ideally, this workflow is designed to compare MinHash signatures of different quality trimmed values and assemblers for the same sample and output results in csv files and heatmap visualizations.
This workflow has been tested to run offline following the execution of the Read Filtering and Assembly workflows. If you have not already, you will need to perform the Offline Setup for the comparison workflow (note: make sure you have activated you metscale environment):
[user@localhost ~]$ conda activate metscale
(metscale)[user@localhost ~]$ cd metscale/workflows
(metscale)[user@localhost ~]$ python download_offline_files.py --workflow comparison
In the metscale/container_images/
directory, you should see the following Singularity images that were created when running the comparison or all flag during the Offline Setup:
File Name | File Size |
---|---|
sourmash_2.1.0--py27he1b5a44_0.sif |
271 MB |
If you are missing this image, you should re-run the setup command, as per instructions in the Offline Setup.
The comparison workflow can use filtered reads (outputs from the Read Filtering Workflow) and/or assembled contigs (outputs from the Assembly Workflow) as its inputs. These input files should be located in the metscale/workflows/data
directory:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
365 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
359 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
313 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
300 MB |
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
153 MB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
142 MB |
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
127 MB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
115 MB |
If these files look good to go, then you may proceed to run the rest of the workflow offline for the example dataset.
Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.
After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows
directory before running the comparison workflow.
cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"
Execution of the comparison workflow can then be performed with the following command:
snakemake --use-singularity {rules} {other options}
The following rules are available for execution in the comparison workflow (yellow star indicates the terminal rule):
The comparison rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings
config file.
Rule | Description |
---|---|
comparison_reads_workflow |
Sourmash computes the MinHash signatures of filtered reads, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_assembly_workflow |
Sourmash computes the MinHash signatures of assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_reads_assembly_workflow |
Sourmash computes the MinHash signatures of filtered reads and assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes |
comparison_output_heatmap_plots_reads_workflow |
Generates a heatmap plot to visualize the csv output from the compared reads |
comparison_output_heatmap_plots_assembly_workflow |
Generates a heatmap plot to visualize the csv output from the compared assemblies |
comparison_output_heatmap_plots_reads_assembly_workflow |
Generates a heatmap plot to visualize the csv output from the compared reads and assemblies |
comparison_output_heatmap_plots_all_workflow |
Generates a heatmap plot to visualize all csv outputs. When run alone, this rule essentially performs all the former rules to generate all comparisons and data outputs. |
For this workflow, both the filtered reads and the assembled contigs can be compared with the following command:
snakemake --use-singularity comparison_reads_assembly_workflow
The following command will compare only filtered reads:
snakemake --use-singularity comparison_reads_workflow
The following command will compare only assembled contigs:
snakemake --use-singularity comparison_assembly_workflow
The following command will view the read comparisons as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_reads_workflow
The following command will view the assembly comparisons as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_assembly_workflow
The following command will view the combined read and assembly comparison as a heatmap visualization:
snakemake --use-singularity comparison_output_heatmap_plots_reads_assembly_workflow
In the event that you wish to execute all the available options for the comparison workflow, you may do so by simply executing the following rule:
snakemake --use-singularity comparison_output_heatmap_plots_all_workflow
Additional options for snakemake can be found in the snakemake documentation.
To specify your own parameters for this or any of the workflows prior to execution, see Workflow Architecture for more information.
After successful execution of the comparison workflow, you will find all of your outputs in the metscale/workflows/data/
directory. You should expect to see the following files for each sample and quality threshold value:
Tool Output | File Name | Description |
---|---|---|
Sourmash | {sample}_1_reads_trim{qual}_ scaled{scaled_value}.k{value_of_k}.sig |
MinHash signatures of filtered reads generated by sourmash compute |
Sourmash | {sample}_1_reads_trim{qual}.{assembler}_ scaled{scaled_value}.k{value_of_k}.sig |
MinHash signatures of assembled contigs generated by sourmash compute |
Sourmash | {sample}_1_reads_trim{qual}and{qual} _read_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads |
Sourmash | {sample}_1_reads_trim{qual}and{qual} _assembly_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of assemblies |
Sourmash | {sample}_1_reads_and{sample}_1_reads_ read_assembly_comparison.k{value_of_k}.csv |
Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads and assemblies |
Custom R script | {sample}_1_reads_trim{qual}and{qual}_ read_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of filtered reads |
Custom R script | {sample}_1_reads_trim{qual}and{qual}_ assembly_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of assembled contigs |
Custom R script | {sample}_1_readsand{sample}_reads_ read_assembly_comparison.{value_of_k}.png |
Heatmap visualization of the Jaccard indexes from comparisons of filtered reads and assembled contigs |
MinHash signature files are not meant for human interpretation, but these are critical to the comparison workflow, as well as to sourmash gather in the Taxonomic Classification workflow.
The first line of the output *.csv file with the Jaccard indexes describes the order of sample names in the comparison matrix. A sample compared to itself will have a Jaccard index of 1, and a sample with no similarity to another will have a Jaccard index of 0.
To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools. Note that the file names in the below examples may not be exact replicates of the file naming conventions in the current workflows, but the commands are equivalent.
Note that MinHash signatures for different k-mer lengths (i.e., values of k) are merged in the comparison workflow to produce a single signature file, but there is one command for each value of k in the examples below. The information is all the same, but more individual signature files are produced by the below example commands (i.e., one signature file for each value of k) than are generated by the snakemake workflow.
The MinHash signatures of reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute --merge {sample}_reads_trim2_1.fq.gz {sample}_reads_trim2_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_reads_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2_scaled10000.k51.sig
The MinHash signatures of reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute --merge {sample}_reads_trim30_1.fq.gz {sample}_reads_trim30_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30_scaled10000.k51.sig
After multiple MinHash signatures have been generated from the filtered reads, they are compared with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_read_comparison.{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_read_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k31.sig SRR606249_subset10_trim30_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_read_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k51.sig SRR606249_subset10_trim30_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_read_comparison.k51.csv
The comparison of assembled contigs from both MEGAHIT and metaSPAdes is also performed in a multiple step process, first by running sourmash compute
on both sets of assembled contigs to generate signatures, then by running sourmash compare
to determine the similarity between these signature files. This is equivalent to running the commands below.
The MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute {sample}_trim2.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.megahit_scaled10000.k51.sig
The MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compute {sample}_trim30.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim30_metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.megahit_scaled10000.k51.sig
After multiple MinHash signatures have been generated from the assembled contigs, they are compared with the following commands (one command for each value of k - 21, 31, and 51):
sourmash compare {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k51.csv
The comparison of both reads and assembled contigs is performed by first running all the sourmash compute
commands listed above to generate signatures for all of the trimmed reads and assembled contigs. Then sourmash compare
is run with all of these signature files, which is equivalent to running the following command:
sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}and{sample}_read_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.31.sig SRR606249_subset10_trim30_scaled10000.k31.sig SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.51.sig SRR606249_subset10_trim30_scaled10000.k51.sig SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k51.csv
Below is a more detailed description of the output files expected in the metscale/workflows/data/
directory after the comparison workflow has been successfully run.
Using the filtered reads generated by the Read Filtering Workflow and/or assembled contigs generated by the Assembly Workflow:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_1.fq.gz |
365 MB |
SRR606249_subset10_1_reads_trim2_2.fq.gz |
359 MB |
SRR606249_subset10_1_reads_trim30_1.fq.gz |
313 MB |
SRR606249_subset10_1_reads_trim30_2.fq.gz |
300 MB |
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa |
153 MB |
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa |
142 MB |
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa |
127 MB |
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa |
115 MB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read datasets with the comparison_workflow_reads
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1 MB |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
903 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k21.csv |
0.1 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k31.csv |
0.1 KB |
SRR606249_subset10_1_reads_trim2and30_read_comparison.k51.csv |
0.1 KB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different assembled contig datasets with the comparison_workflow_assembly
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig |
702 KB |
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig |
646 KB |
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig |
605 KB |
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig |
547 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k21.csv |
0.4 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k31.csv |
0.4 KB |
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k51.csv |
0.4 KB |
The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read and assembled contig datasets with the comparison_workflow_reads_assembly
rule:
File Name | File Size |
---|---|
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig |
1.1 MB |
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig |
883 KB |
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig |
693 KB |
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig |
639 KB |
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig |
593 KB |
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig |
536 KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k21.csv |
0.9 KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads read_assembly_comparison.k31.csv |
0.9 KB |
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k51.csv |
0.9 KB |
The tables below summarize the Jaccard indexes in the final *.csv files after analyzing the SRR606249_subset10_1_reads dataset.
k21 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.91 |
0.61 |
0.72 |
0.56 |
0.67 |
reads_trim30 |
0.91 |
1.00 |
0.71 |
0.82 |
0.66 |
0.79 |
megahit_trim2 |
0.61 |
0.71 |
1.00 |
0.82 |
0.89 |
0.82 |
metaspades_trim2 |
0.72 |
0.82 |
0.82 |
1.00 |
0.75 |
0.91 |
megahit_trim30 |
0.56 |
0.66 |
0.89 |
0.75 |
1.00 |
0.81 |
metaspades_trim30 |
0.67 |
0.79 |
0.82 |
0.91 |
0.81 |
1.00 |
k31 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.90 |
0.60 |
0.70 |
0.55 |
0.65 |
reads_trim30 |
0.90 |
1.00 |
0.70 |
0.81 |
0.66 |
0.78 |
megahit_trim2 |
0.60 |
0.70 |
1.00 |
0.81 |
0.88 |
0.81 |
metaspades_trim2 |
0.70 |
0.81 |
0.81 |
1.00 |
0.75 |
0.90 |
megahit_trim30 |
0.55 |
0.66 |
0.88 |
0.75 |
1.00 |
0.80 |
metaspades_trim30 |
0.65 |
0.78 |
0.81 |
0.90 |
0.80 |
1.00 |
k51 results | reads_trim2 |
reads_trim30 |
megahit_trim2 |
metaspades_trim2 |
megahit_trim30 |
metaspades_trim30 |
---|---|---|---|---|---|---|
reads_trim2 |
1.00 |
0.87 |
0.61 |
0.69 |
0.56 |
0.64 |
reads_trim30 |
0.87 |
1.00 |
0.70 |
0.77 |
0.67 |
0.76 |
megahit_trim2 |
0.61 |
0.70 |
1.00 |
0.81 |
0.87 |
0.81 |
metaspades_trim2 |
0.69 |
0.77 |
0.81 |
1.00 |
0.74 |
0.90 |
megahit_trim30 |
0.56 |
0.67 |
0.87 |
0.74 |
1.00 |
0.80 |
metaspades_trim30 |
0.64 |
0.76 |
0.81 |
0.90 |
0.80 |
1.00 |
Overall, this example showed the same trends in similarity across different k-mer lengths, regardless of trimming quality threshold or assembly status. Additional observations from this example include the following:
- Datasets showed slightly more similarity to each other when evaluated with smaller k-mer lengths (i.e., k = 21) than larger k-mer lengths (i.e., k = 51)
- Raw reads trimmed at different quality thresholds were slightly more similar to each other than the assembled contigs
- Assembled contigs shared slightly more of their k-mer content if they were created by the same assembler