CS 225 Extreme Extra Credit Projects
Our project attempts to reproduce the genome of the S. Aureus bacteria. Sourcing the data from National Library of Medcine's Sequence Read Archive. The data is proccesed and the De Bruijn Graph is constructed, all single weighted edges are removed and the graph is traversed. The FM-Index is constructed for the k-mers traversed through in the De Bruijn Graph. Now the most repeating patterns are highlighted and extracted using the backward search algorithm. Finally, the generated string is compared to the output genome using the global alignment Needleman-Wunsch algorithm.
All code files can be found in the code/
directory. To run the code:
- Compile using
make exec
- Run using
./bin/exec
- Enter the data set to be used eg.
data/small.fasta
- Enter the length of k-mers to be used eg.
7
There are a few test cases written to check the construction of the DeBruijnGraph and ReadFile method. make tests
followed by ./bin/tests
can be used to test it.
Our data is originally sourced from the SRA Archive available for download here.
The data files on this repo are subsets of the original file.
data/small.fasta
data/evensmaller.fasta
data/smallest.fasta
is not a subset of the original file and was updated throughout the project to test the functionality of the code.
The graph and the other data generated is also outputted to the data/
directory.
On running the program data/outputgraph.txt
is generated.
Our signed contract and development log can be found in the documents/
directory.
All feedback from our project mentor can be found in the feedback/
directory.