Skip to content

05. Read Filtering

Krista Ternus edited this page Nov 1, 2020 · 16 revisions

Read Filtering

Table of Contents

Workflow Overview

The tools within this workflow perform read filtering and adapter trimming with Trimmomatic version 0.36, quality assessment of paired-end Illumina reads with FastQC version 0.11.7 and MultiQC version 1.4, and additional read processing options with scripts from Khmer version 2.1. To successfully run this workflow, it is assumed that you first ran through the Install and Offline Setup pages of this documentation.

Required Files

If you have not already, you will need to activate your metscale environment and perform the Offline Setup for the read filtering workflow before proceeding:

[user@localhost ~]$ source activate metscale

(metscale)[user@localhost ~]$ cd metscale/workflows 

(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow test_files 

(metscale)[user@localhost workflows]$ python download_offline_files.py --workflow read_filtering

Singularity Images

In the metscale/container_images/ directory, you should see the following Singularity images that were created when running the read_filtering or all flag during the Offline Setup:

File Name File Size
trimmomatic_0.36--5.sif 76 MB
fastqc_0.11.7--pl5.22.0_2.sif 107 MB
multiqc_1.4--py35_0.sif 453 MB
khmer_2.1--py35_0.sif 47 MB

Input Files

If you ran the setup with the test_files and read_filtering flags (or the all flag), the following files should be present in the metscale/workflows/data/ directory.

File Name Description
SRR606249_subset10_1_reads.fq.gz Forward reads, downloaded with the test_files flag
SRR606249_subset10_2_reads.fq.gz Reverse reads, downloaded with the test_files flag
adapters_combined_256_unique.fasta Adapter file downloaded with the read_filtering flag
File Name File Size MD5 Checksum
SRR606249_subset10_1_reads.fq.gz 358 MB 44f730ecada62ee6bd4a3543e092c907
SRR606249_subset10_2_reads.fq.gz 352 MB 0d469c0cc4e220e94e2c312f3187d34e
adapters_combined_256_unique.fasta 26 KB 755f8f496747675bcdb56b90f7ac51a2

If you are missing any of these files, you should re-run the appropriate offline setup command in the metscale/workflows directory, as per instructions in the Offline Setup.

It is recommended that users edit and run custom configs based on their samples' default {sample}_1_reads.fq.gz or Illumina {sample}_*_S*_L*_R1_001.fastq.gz naming pattern.

Workflow Execution

Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.

After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows directory before running the read filtering workflow.

cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"

You can then execute of the workflow through snakemake using the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the read filtering workflow (yellow stars indicate terminal rules):

The read filtering rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings config file.

Rule Description
read_filtering_pretrim_workflow FastQC generates quality reports for raw reads
read_filtering_posttrim_workflow Trimmomatic trims raw reads, and FastQC generates quality reports for filtered reads
read_filtering_multiqc_workflow MultiQC aggregates all FastQC outputs into a single report for each sample
read_filtering_khmer_interleave_reads_workflow Khmer interleaves the quality trimmed, paired-end reads (forward and reverse) into a single *.fq.gz file
read_filtering_khmer_count_unique_kmers_workflow Khmer counts the number of unique k-mers in the interleaved dataset
read_filtering_khmer_subsample_interleaved_reads_workflow Khmer subsamples a smaller percentage of the reads from the full interleaved dataset
read_filtering_khmer_split_interleaved_reads_workflow Khmer splits the subsampled interleaved read file into two paired-end read files

For the read filtering workflow, rules can be run independently, or they can be run together by listing them back to back in the command as such:

snakemake --use-singularity read_filtering_pretrim_workflow read_filtering_posttrim_workflow read_filtering_multiqc_workflow read_filtering_khmer_interleave_reads_workflow read_filtering_khmer_count_unique_kmers_workflow read_filtering_khmer_subsample_interleaved_reads_workflow read_filtering_khmer_split_interleaved_reads_workflow

Note that the default config will count the number of unique k-mers in the original interleaved dataset.

Additional options for snakemake can be found in the snakemake documentation: https://snakemake.readthedocs.io/en/stable/

To specify your own parameters for this or any of the workflows prior to execution, see the Workflow Architecture page for more information.

Output

After successful execution of the read filtering workflow, you will find all of your outputs in the metscale/workflows/data/ directory. The output files generated with the test dataset were created with the default input file pattern of {sample}_1_reads_{direction}.fq.gz, so the outputs include the {sample}_1_reads_* pattern in their file names. If the default Illumina FASTQ naming convention or another one is used, then a pattern other than {sample}_1_reads_* will appear in the output file names. For the input file pattern of {sample}_1_reads_{direction}.fq.gz, the following files will be created for each sample and read direction:

Tool Output File Name Description
FastQC {sample}_1_reads_{direction}_reads_fastqc.html FastQC HTML report before running Trimmomatic
FastQC {sample}_1_reads_{direction}_reads_fastqc.zip FastQC files before running Trimmomatic
Trimmomatic {sample}_1_reads_trim{quality_threshold} _{direction}.fq.gz Reads that retained their corresponding paired-end after running Trimmomatic
Trimmomatic {sample}_1_reads_trim{quality_threshold} _{direction}_se Reads that lost their paired-end partner after running Trimmomatic; now single end (SE) reads
Trimmomatic {sample}_1_reads_trim{quality_threshold} _trimmomatic_pe.log Log file generated by Trimmomatic
FastQC {sample}_1_reads_trim{quality_threshold} _{direction}_fastqc.html FastQC HTML report after running Trimmomatic
FastQC {sample}_1_reads_trim{quality_threshold} _{direction}_fastqc.zip FastQC files after running Trimmomatic
MultiQC {sample}_1_reads_fastqc_multiqc_report.html MultiQC HTML report, aggregating multiple FastQC outputs
MultiQC {sample}_1_reads_fastqc_multiqc_report_data/ Directory with additional MultiQC data and statistics
khmer script: interleave-reads.py {sample}_1_reads_trim{quality_threshold} _interleaved_reads.fq.gz Interleaved forward and reverse paired-end reads
khmer script: unique-kmers.py {sample}_1_reads_trim{quality_threshold} _interleaved_uniqueK{k-mer_length}.txt A text file for each k-mer length (default: k=21, k=31, and k=51) that lists the number of unique k-mers counted in an interleaved file
khmer script: sample-reads-randomly.py {sample}_1_reads_trim{quality_threshold} _subset{percent}_interleaved_reads.fq.gz A smaller subsampled percentage of the interleaved reads
khmer script: split-paired-reads.py {sample}_1_reads_trim{quality_threshold} _subset{percent}_{direction}.fq.gz Paired-end FASTQ reads from the subsampled data

After you have reviewed these files, make sure they stay in the metscale/workflows/data/ directory. You are now ready to proceed to the Assembly workflow page. Alternatively, the trimmed reads can now be used with specific rules in the Taxonomic Classification, Comparison, or Functional Inference workflows.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools. Note that the file names in the below examples may not be exact replicates of the file naming conventions in the current workflows, but the commands are equivalent.

The pre-trimming quality control step is equivalent to running FastQC with these commands:

fastqc {input_1_reads.fq.gz} -o {output_1_reads_fastqc}
fastqc {input_2_reads.fq.gz} -o {output_2_reads_fastqc}
fastqc SRR606249_subset10_1_reads.fq.gz -o SRR606249_subset10_1_reads_fastqc
fastqc SRR606249_subset10_2_reads.fq.gz -o SRR606249_subset10_2_reads_fastqc

Read filtering with a quality score threshold of 2 is equivalent to running Trimmomatic with these commands:

trimmomatic PE {input_1_reads.fq.gz} {input_2_reads.fq.gz} {trimmomatic_output_trim2_1_fq.gz} {trimmomatic_output_trim2_1_se} {trimmomatic_output_trim2_2_fq.gz} {trimmomatic_output_trim2_2_se} ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:25
trimmomatic PE SRR606249_subset10_1_reads.fq.gz SRR606249_subset10_2_reads.fq.gz SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_1_se SRR606249_subset10_trim2_2.fq.gz SRR606249_subset10_trim2_2_se ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:25

Read filtering with a quality score threshold of 30 is equivalent to running Trimmomatic with these commands:

trimmomatic PE {input_1_reads.fq.gz} {input_2_reads.fq.gz} {trimmomatic_output_trim30_1_fq.gz} {trimmomatic_output_trim30_1_se} {trimmomatic_output_trim30_2_fq.gz} {trimmomatic_output_trim30_2_se} ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:30 TRAILING:30 SLIDINGWINDOW:4:30 MINLEN:25
trimmomatic PE SRR606249_subset10_1_reads.fq.gz SRR606249_subset10_2_reads.fq.gz SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_1_se SRR606249_subset10_trim30_2.fq.gz SRR606249_subset10_trim30_2_se ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:30 TRAILING:30 SLIDINGWINDOW:4:30 MINLEN:25

The post-trimming quality control step is the equivalent of running FastQC with these commands:

fastqc {input_trim2_1_fq.gz} -o {output_trim2_1_fastqc}
fastqc {input_trim2_2_fq.gz} -o {output_trim2_2_fastqc}
fastqc {input_trim30_1_fq.gz} -o {output_trim30_1_fastqc}
fastqc {input_trim30_2_fq.gz} -o {output_trim30_2_fastqc}
fastqc SRR606249_subset10_trim2_1.fq.gz -o SRR606249_subset10_trim2_1_fastqc
fastqc SRR606249_subset10_trim2_2.fq.gz -o SRR606249_subset10_trim2_2_fastqc
fastqc SRR606249_subset10_trim30_1.fq.gz -o SRR606249_subset10_trim30_1_fastqc
fastqc SRR606249_subset10_trim30_2.fq.gz -o SRR606249_subset10_trim30_2_fastqc

The MultiQC step in the read filtering workflow is the equivalent of running this command (where each input file name is individually listed):

multiqc {output_reads_1_fastqc.zip} {output_reads_2_fastqc.zip} {output_trim2_1_fastqc.zip} {output_trim2_2_fastqc.zip} {output_trim30_1_fastqc.zip} {output_trim30_2_fastqc.zip} -n {sample_multiqc_fastqc_report}
multiqc SRR606249_subset10_1_reads_fastqc.zip SRR606249_subset10_2_reads_fastqc.zip SRR606249_subset10_trim2_1_fastqc.zip SRR606249_subset10_trim2_2_fastqc.zip SRR606249_subset10_trim30_1_fastqc.zip SRR606249_subset10_trim30_2_fastqc.zip -n SRR606249_subset10_multiqc_fastqc_report

An equivalent MultiQC command could also be run by inputting all FastQC results in the directory (which in this case is from the same sample):

multiqc *fastqc.zip -n {sample_multiqc_fastqc_report}
multiqc *fastqc.zip -n SRR606249_subset10_multiqc_fastqc_report

The khmer interleave-reads.py script produces an interleaved file from each set of trimmed paired-end reads with the equivalent of these commands:

interleave-reads.py {sample_trim2_1.fq.gz} {sample_trim2_2.fq.gz} --no-reformat -o {sample_trim2_interleave_reads.fq.gz} --gzip
interleave-reads.py {sample_trim30_1.fq.gz} {sample_trim30_2.fq.gz} --no-reformat -o {sample_trim30_interleave_reads.fq.gz} --gzip
interleave-reads.py SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --no-reformat -o SRR606249_subset10_trim2_interleave_reads.fq.gz --gzip
interleave-reads.py SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --no-reformat -o SRR606249_subset10_trim30_interleave_reads.fq.gz --gzip

The khmer unique-kmers.py script estimates the number of unique k-mers within an interleaved file at specified k-mer lengths (default: k=21, k=31, k=51) with the equivalent of these commands:

unique-kmers.py -k 21 {sample_trim2_interleave_reads.fq.gz} -R {sample_trim2_interleaved_uniqueK21.txt}
unique-kmers.py -k 31 {sample_trim2_interleave_reads.fq.gz} -R {sample_trim2_interleaved_uniqueK31.txt}
unique-kmers.py -k 51 {sample_trim2_interleave_reads.fq.gz} -R {sample_trim2_interleaved_uniqueK51.txt}
unique-kmers.py -k 21 {sample_trim30_interleave_reads.fq.gz} -R {sample_trim30_interleaved_uniqueK21.txt}
unique-kmers.py -k 31 {sample_trim30_interleave_reads.fq.gz} -R {sample_trim30_interleaved_uniqueK31.txt}
unique-kmers.py -k 51 {sample_trim30_interleave_reads.fq.gz} -R {sample_trim30_interleaved_uniqueK51.txt}
unique-kmers.py -k 21 SRR606249_subset10_trim2_interleaved_reads.fq.gz -R SRR606249_subset10_trim2_interleaved_uniqueK21.txt
unique-kmers.py -k 31 SRR606249_subset10_trim2_interleaved_reads.fq.gz -R SRR606249_subset10_trim2_interleaved_uniqueK31.txt
unique-kmers.py -k 51 SRR606249_subset10_trim2_interleaved_reads.fq.gz -R SRR606249_subset10_trim2_interleaved_uniqueK51.txt
unique-kmers.py -k 21 SRR606249_subset10_trim30_interleaved_reads.fq.gz -R SRR606249_subset10_trim30_interleaved_uniqueK21.txt
unique-kmers.py -k 31 SRR606249_subset10_trim30_interleaved_reads.fq.gz -R SRR606249_subset10_trim30_interleaved_uniqueK31.txt
unique-kmers.py -k 51 SRR606249_subset10_trim30_interleaved_reads.fq.gz -R SRR606249_subset10_trim30_interleaved_uniqueK51.txt

The khmer sample-reads-randomly.py script uniformly subsamples reads from an interleaved file, using reservoir sampling. Note that the command line equivalent requires the exact number of subsampled reads to be specified (-N), so the read filtering workflow converts the specified percentage in the config file to a corresponding read number by multiplying it with the total number of reads in the sample. The following are the command line equivalents for subsampling reads:

sample-reads-randomly.py -N {10%_of_both_paired-ends_in_full_interleaved_file} -M {subsample_interleave_max_reads} -o {sample_trim2_subset_interleaved_reads.fq.gz} --gzip {sample_trim2_subset_interleaved_reads.fq.gz}
sample-reads-randomly.py -N {10%_of_both_paired-ends_in_full_interleaved_file} -M {subsample_interleave_max_reads} -o {sample_trim30_subset_interleaved_reads.fq.gz} --gzip {sample_trim30_subset_interleaved_reads.fq.gz}
sample-reads-randomly.py -N {number of reads representing a smaller percentage of the full interleaved file} -M 100000000 -o SRR606249_subset10_trim2_subset_interleaved_reads.fq.gz --gzip SRR606249_subset10_trim2_interleaved_reads.fq.gz 
sample-reads-randomly.py -N {number of reads representing a smaller percentage of the full interleaved file} -M 100000000 -o SRR606249_subset10_trim30_subset_interleaved_reads.fq.gz --gzip SRR606249_subset10_trim30_interleaved_reads.fq.gz 

The khmer split-paired-reads.py script splits an interleaved file into two paired-end reads. The following command line equivalent is used to split the subsampled interleaved file into two paired-end reads:

split-paired-reads.py {sample_trim2_subset_interleaved_reads.fq.gz} -1 {sample_trim2_subset10_1.fq.gz} -2 {sample_trim2_subset10_2.fq.gz} --gzip
split-paired-reads.py {sample_trim30_subset_interleaved_reads.fq.gz} -1 {sample_trim30_subset10_1.fq.gz} -2 {sample_trim30_subset10_2.fq.gz} --gzip
split-paired-reads.py SRR606249_subset10_trim2_subset_interleaved_reads.fq.gz -1 SRR606249_subset10_trim2_subset10_1.fq.gz -2 SRR606249_subset10_trim2_subset10_2.fq.gz --gzip
split-paired-reads.py SRR606249_subset10_trim30_subset_interleaved_reads.fq.gz -1 SRR606249_subset10_trim30_subset10_1.fq.gz -2 SRR606249_subset10_trim30_subset10_2.fq.gz --gzip

Expected Output Files for the Example Dataset

Below is a more detailed description of the output files expected in the metscale/workflows/data/ directory after the read filtering workflow has been successfully run.

Using these example raw FASTQ files:

File Name File Size
SRR606249_subset10_1_reads.fq.gz 358 MB
SRR606249_subset10_2_reads.fq.gz 352 MB

The following files are produced by FastQC after running the read_filtering_pretrim_workflow rule to generate quality reports on the raw reads:

File Name File Size
SRR606249_subset10_1_reads_1_reads_fastqc.html 218 KB
SRR606249_subset10_1_reads_1_reads_fastqc.zip 230 KB
SRR606249_subset10_1_reads_2_reads_fastqc.html 220 KB
SRR606249_subset10_1_reads_2_reads_fastqc.zip 233 KB

There are 5,400,000 reads of 101bp in length within these FastQC HTML reports.

Trimmomatic generates the following files after running the read_filtering_posttrim_workflow rule with a quality score threshold of 2:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 365 MB
SRR606249_subset10_1_reads_trim2_1_se 3.1 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 359 MB
SRR606249_subset10_1_reads_trim2_2_se 4.1 MB
SRR606249_subset10_1_reads_trim2_trimmomatic_pe.log 348 MB

Trimmomatic generates the following files after running the read_filtering_posttrim_workflow rule with a quality score threshold of 30:

File Name File Size
SRR606249_subset10_1_reads_trim30_1.fq.gz 313 MB
SRR606249_subset10_1_reads_trim30_1_se 23 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 300 MB
SRR606249_subset10_1_reads_trim30_2_se 21 MB
SRR606249_subset10_1_reads_trim30_trimmomatic_pe.log 343 MB

FastQC generates the following files after running the read_filtering_posttrim_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_1_fastqc.html 215 KB
SRR606249_subset10_1_reads_trim2_1_fastqc.zip 224 KB
SRR606249_subset10_1_reads_trim2_2_fastqc.html 216 KB
SRR606249_subset10_1_reads_trim2_2_fastqc.zip 226 KB
SRR606249_subset10_1_reads_trim30_1_fastqc.html 222 KB
SRR606249_subset10_1_reads_trim30_1_fastqc.zip 233 KB
SRR606249_subset10_1_reads_trim30_2_fastqc.html 223 KB
SRR606249_subset10_1_reads_trim30_2_fastqc.zip 235 KB

There are 5,351,062 reads of 25-101bp in length within the FastQC HTML reports after conservative trimming (trim2) and 5,054,386 reads of 25-101bp in length after aggressive trimming (trim30).

MultiQC produces the following HTML report of aggregated FastQC results, as well as a directory with supporting data files, after running the read_filtering_multiqc_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_fastqc_multiqc_report.html 1.2 MB
SRR606249_subset10_1_reads_fastqc_multiqc_report_data 408 KB

MultiQC allows all of the FastQC HTML reports to be visualized at one time. Compared to the raw reads, the quality of the example dataset improves slightly after trim2 and moderately after trim30. Overall, this dataset was of fairly high quality before and after trimming, so the filtered reads do not look drastically different than the raw reads. In a lower quality dataset, the difference between raw and filtered reads would be more pronounced.

Khmer is used within the read filtering workflow to process quality filtered reads. The following files are generated after interleaving paired-end reads with the khmer script interleave-reads.py, which is called by the read_filtering_khmer_interleave_reads_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_interleaved_reads.fq.gz 688 MB
SRR606249_subset10_1_reads_trim30_interleaved_reads.fq.gz 579 MB

There are 10,702,124 reads in the interleaved example dataset that was trimmed with a quality threshold of 2, and there are 10,108,772 reads in the interleaved example dataset that was trimmed with a quality threshold of 30. This is the expected result, since there were individually 5,351,062 reads in the paired-end files from trim2 (5,351,062 forward reads + 5,351,062 reverse reads = 10,702,124 total interleaved reads) and 5,054,386 reads in trim30 (5,054,386 forward reads + 5,054,386 reverse reads = 10,108,772 total interleaved reads).

The number of unique k-mers with default lengths of k=21, k=31, and k=51 are estimated within an interleaved file using the khmer script unique-kmers.py. The following files are created when that script is called by the read_filtering_khmer_count_unique_kmers_workflow rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK21.txt 1.9 KB
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK31.txt 1.9 KB
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK51.txt 1.9 KB
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK21.txt 1.9 KB
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK31.txt 1.9 KB
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK51.txt 1.9 KB

These are the number of unique k-mers recorded within the above text files, as estimated at different k-mer lengths from the original interleaved files (note that as the length of the k-mer size increased, the number of unique k-mers decreased):

File Name Unique K-mer Count
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK21.txt 203,920,417
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK31.txt 201,392,577
SRR606249_subset10_1_reads_trim2_interleaved_uniqueK51.txt 186,474,457
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK21.txt 170,720,241
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK31.txt 164,955,240
SRR606249_subset10_1_reads_trim30_interleaved_uniqueK51.txt 148,631,398

In cases where there is expected to be substantial redundancy in sequence information (beyond what is needed to capture the diversity of organisms within a sample), it may be helpful to work with a smaller percentage of the original dataset. The khmer script sample-reads-randomly.py can be used to down-sample a smaller percentage of the interleaved file with the read_filtering_khmer_subsample_interleaved_reads_workflow rule, which will generate the following files for the example dataset:

File Name File Size
SRR606249_subset10_1_reads_trim2_subset10_interleaved_reads.fq.gz 138 MB
SRR606249_subset10_1_reads_trim30_subset10_interleaved_reads.fq.gz 116 MB

There are 2,140,424 reads in the subsampled SRR606249_subset10_1_reads_trim2_subset10_interleaved_reads.fq.gz dataset (2,140,424 subsampled reads / 10,702,124 reads in the original SRR606249_subset10_1_reads_trim2_interleaved_reads.fq.gz dataset = 20% of the original dataset) and 2,021,754 reads in the subsampled SRR606249_subset10_1_reads_trim30_subset10_interleaved_reads.fq.gz dataset (2,021,754 subsampled reads / 10,108,772 reads in the original SRR606249_subset10_1_reads_trim30_interleaved_reads.fq.gz dataset = 20% of the original dataset). Trimmomatic is not run again on the subsampled dataset, since the subsampling was performed on data that had already been trimmed.

In this example, the original dataset was actually a subsample of the full Shakya dataset, so the nomenclature has multiple "subset" instances in the file name. The second instance of "subset" after its trim quality information is meant to indicate that it is a smaller subsample of the original dataset, and in this case the first instance of "subset" occurred because original sample name was named "SRR606249_subset10."