WARNING: This program is under active development and this documentation might not reflect reality. Please file a GitHub issue and we will take care of it as soon as we can.
- 'build' is for building a bloom filter from a reference file. It supports large genome files (>4GB), human genome, for instance.
- 'query' is for querying a fastq/fasta file against the bloom filter.
- 'remove' is for removing contamination sequences from a fastq/fasta file.
In order to fetch the source code, compile and run tests:
$ git clone https://github.com/tzcoolman/DRASS.git && cd drass && make -j8 && make tests
Please note that python's virtualenv is needed to run the tests.
Facs uses a similar commandline structure to the one found in the popular bwa. There are three main commands: build, query and remove. Each of them might have slightly different flags but should behave similarly.
$ ./facs -h
Program: facs (Sequence decontamination using bloom filters)
Version: 0.1
Contact: Enze Liu <[email protected]>
Usage: facs <command> [options]
Command: build build a bloom filter from a FASTA reference file
query query a bloom filter given a FASTQ/FASTA file
remove remove (contamination) sequences from FASTQ/FASTA file
For example, to build a bloom filter out of a FASTA reference genome, one should type:
$ ./facs build -r ecoli.fasta -o ecoli.bloom
That would generate a ecoli bloom filter that could be used to query a FASTQ file:
$ ./facs query -b ecoli.bloom -r contaminated_sample.fastq.gz
Note that both plaintext fastq files and gzip-compressed files are supported transparently to the user.
Which would return some metrics indicating how many reads might be contaminated with ecoli in that particular sample:
{
"total_read_count": 201,
"contaminated_reads": 1,
"total_hits": 90,
"contamination_rate": 0.004975,
"bloom_filename":"tests/data/bloom/U00096.2.bloom"
}
Finally, if one wants to remove those reads from the sample, one should run the following command:
$ ./facs remove -b ecoli.bloom -r contaminated_sample.fastq.gz -o discarded_reads.fastq
Where "discarded_reads.fastq" is the reads that have been filtered out from the original fastq file.
A python C-Extension provides a very simple API to build, query and remove sequences, just as described above with the plain C-based commandline.
$ python
Python 2.6.6 (r266:84292, Jun 18 2012, 09:57:52)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import facs
>>> facs.build("ecoli.fasta", "ecoli.bloom")
>>> facs.query("contaminated_sample.fastq.gz", "ecoli.bloom")
>>> facs.remove("contaminated_sample.fastq.gz", "ecoli.bloom")
- All three scripts can be executed on both Linux and Mac system. But they don't support large bloom filter building and loading on MAC system.
- FACS supports fasta and fastq formats. Make sure you use the correct extension name: .fna or .fasta and .fastq, respectively.