-
Notifications
You must be signed in to change notification settings - Fork 52
Directory and file structure
All fastq files to be analyzed have to be provided in one directory with the following layout:
/path/to/fastqs/sample1/PAIREDEND/fastq1_R1.fastq.gz
/path/to/fastqs/sample1/PAIREDEND/fastq1_R2.fastq.gz
/path/to/fastqs/sample1/PAIREDEND/fastq2_R1.fastq.gz
/path/to/fastqs/sample1/PAIREDEND/fastq2_R2.fastq.gz
/path/to/fastqs/sample2/PAIREDEND/fastq1_R1.fastq.gz
/path/to/fastqs/sample2/PAIREDEND/fastq1_R2.fastq.gz
/path/to/fastqs/sample2/PAIREDEND/fastq2_R1.fastq.gz
/path/to/fastqs/sample2/PAIREDEND/fastq2_R2.fastq.gz
/path/to/fastqs/sample1/SINGLEEND/fastq1.fastq.gz
/path/to/fastqs/sample1/SINGLEEND/fastq2.fastq.gz
/path/to/fastqs/sample2/SINGLEEND/fastq1.fastq.gz
/path/to/fastqs/sample2/SINGLEEND/fastq2.fastq.gz
In order to appropriately create read groups by the employed read mapper there must be a tab separated .tsv file for each pair of fastq files in the paired-end mode and one .tsv file for each fastq file in single-end mode. The .tsv must contain the following key words:
RUN_NAME_FOLDER # sample
LANE_NUMBER # lane the reads of the fastq file
SAMPLE_CODE # the library used for the reads
SAMPLE_TYPE # the technique used to generate the reads, e.g. ILLUMINA
In many cases there is already a fixed file structure in place which cannot be modified. However, in order to simulate the appropriate structure the user can use symbolic links.
ln -s source/file.txt target/file.txt
A more sophisticated example is:
localrules: linkFastqs
rule linkFastqs:
input:
fastq = OUTDIR + 'sra/{srr}/PAIREDEND/{srr}_{mate}.fastq.gz'
output:
fastq = OUTDIR + 'sra/{sample}/PAIREDEND/{srr}_R{mate}.fastq.gz'
params:
outdir = OUTDIR + 'sra/{sample}/PAIREDEND/'
shell:
'cd {params.outdir}; ln -s ../../{wildcards.srr}/PAIREDEND/{wildcards.srr}_{wildcards.mate}.fastq.gz {wildcards.srr}_R{wildcards.mate}.fastq.gz && touch -h {wildcards.srr}_R{wildcards.mate}.fastq.gz'
which is used in the wes example to rearrange the fastq files downloaded from the SRA repository.
This file contains necessary information about the samples. For instance, which sample belong together in the test-control setting. The file contains four columns:
-
- column: experiment id
-
- column: sample id
-
- column: sample type: possible values are T for tumor and N for normal
-
- column: time point (this information is currently not used, but in future extensions time series data will be included)
An example could be:
exp1 PCL-016_TEST T 1
exp1 PCL-016_CTRL N 1
exp2 PCL-019_TEST T 1
exp2 PCL-019_CTRL N 1
exp3 PCL-026_TEST T 1
exp3 PCL-026_CTRL N 1