Skip to content

Directory and file structure

singerj edited this page Jun 27, 2017 · 5 revisions

Directory and file structure

Fastq files

All fastq files to be analyzed have to be provided in one directory with the following layout:

Paired-end fastq files

/path/to/fastqs/sample/PAIREDEND/fastq1_R1.fastq.gz
/path/to/fastqs/sample/PAIREDEND/fastq1_R2.fastq.gz
/path/to/fastqs/sample/PAIREDEND/fastq2_R1.fastq.gz
/path/to/fastqs/sample/PAIREDEND/fastq2_R2.fastq.gz

Single-end fastq files

/path/to/fastqs/sample/SINGLEEND/fastq1.fastq.gz
/path/to/fastqs/sample/SINGLEEND/fastq2.fastq.gz

In order to appropriately create read groups by the employed read mapper there must be a tab separated .tsv file for each pair of fastq files in the paired-end mode and one .tsv file for each fastq file in single-end mode. The .tsv must contain the following key words:

TSV files

RUN_NAME_FOLDER # sample
LANE_NUMBER     # lane the reads of the fastq file 
SAMPLE_CODE     # the library used for the reads
SAMPLE_TYPE     # the technique used to generate the reads, e.g. ILLUMINA

Adapting existing file structures

In many cases there is already a fixed file structure in place which cannot be modified. However, in order to simulate the appropriate structure the user can use symbolic links.

ln -s source/file.txt target/file.txt

A more sophisticated example is:

localrules: linkFastqs
rule linkFastqs:
    input:
        fastq = OUTDIR + 'sra/{srr}/PAIREDEND/{srr}_{mate}.fastq.gz'
    output:
        fastq = OUTDIR + 'sra/{sample}/PAIREDEND/{srr}_R{mate}.fastq.gz'
    params:
        outdir = OUTDIR + 'sra/{sample}/PAIREDEND/'
    shell:
        'cd {params.outdir}; ln -s ../../{wildcards.srr}/PAIREDEND/{wildcards.srr}_{wildcards.mate}.fastq.gz {wildcards.srr}_R{wildcards.mate}.fastq.gz && touch -h {wildcards.srr}_R{wildcards.mate}.fastq.gz'

which is used in the wes example to rearrange the fastq files downloaded from the SRA repository.

Sample mapping file

This file contains necessary information about the samples. For instance, which sample belong together in the test-control setting. The file contains four columns:

    1. column: experiment id
    1. column: sample id
    1. column: sample type: possible values are T for tumor and N for normal
    1. column: time point (this information is currently not used, but in future extensions time series data will be included)

An example could be:

exp1    PCL-016_TEST    T   1
exp1    PCL-016_CTRL    N   1
exp2    PCL-019_TEST    T   1
exp2    PCL-019_CTRL    N   1
exp3    PCL-026_TEST    T   1
exp3    PCL-026_CTRL    N   1
Clone this wiki locally