Table of contents:
- Develop a complex project in a team by applying good software engineering practices such as agile development, version control, and data persistence along with testing and code instrumentation.
- Demonstrate effective teamwork as a member or a leader of a team.
- Design, implement, and evaluate a computing-based solution to a given set of computing requirements for a problem from a specific domain.
- Learn how to implement a BTree external data structure on the disk.
- Demonstrate how to use caching to improve performance of an application.
- Understand how bitwise operators reduce the memory footprint of data.
- Learn how to run an application in the cloud.
This repository contains:
- the expected project package structure, in the src/ folder
- some partial implementation of classes (including the interface to the BTree class), in the src/main/java/cs321/ folder
- sample JUnit tests, in the src/test/java/cs321/ folder
- sample input data, in the data/ folder
- sample expected results, in the results/ folder
- script for integration testing at the top-level of the repository
- a wrapper for the gradle build tool, which simplifies installing and running gradle. In turn, gradle facilitates and handles:
- Java library (e.g., JUnit) dependency management
- Compiling the code
- Generating self-containing jars
- Running classes
- Running unit tests
β NOTE: Do NOT fork this repository, because the forked repository cannot have its own GitHub issues, which will be used as Scrum tasks.
β NOTE: Do NOT modify the package structure in the src/ folder, otherwise the project may not be built correctly using gradle.
Use the following command to check our installed JDK version:
$ javac -version
This project does not work with JDK 24 (to be released in March 2025).
It is recommended to use any of the following versions:
- JDK 8 (LTS)
- JDK 11 (LTS)
- JDK 17 (LTS)
- JDK 21 (LTS)
π See this wiki page for additional details regarding the supported Java versions and links to download the correct JDK versions.
One team member should:
- Create a new private GitHub repository and
- Make sure to name the private repository as specified by your instructor
- Give repository write permissions to all the other team members
- Add your instructor and teaching assistants as collaborators (they will provide you with their GitHub ids)
- Clone (not fork) this repository and verify that gradle can be run.
$ git clone https://github.com/BoiseState/CS321_Bioinformatics.git
$ cd CS321_Bioinformatics
$ ./gradlew tasks
The last command should perform a one-time gradle setup, followed by listing all the available gradle tasks and their descriptions.
NOTE: On Windows, the ./gradlew
command should be replaced with gradlew
(which will call the gradlew.bat
file)
The same team member should push the cloned repository to the new private repository. This can be done by changing the remote URL of the cloned repository to the new private repository's URL.
$ git remote set-url origin NEW_URL_OF_YOUR_NEW_PRIVATE_REPOSITORY
$ git remote -v
$ git push
The other team members should then clone the newly created student repository containing the starter code.
Gradle allows running unit tests and code from IDEs, or the command line, as described below.
Run all the JUnit tests and print a summary of the results:
$ ./gradlew test
Run the main
method from GeneBankCreateBTree.java
and pass the appropriate <arguments>
:
$ ./gradlew createJarGeneBankCreateBTree
$ java -jar build/libs/GeneBankCreateBTree.jar <arguments>
Run the main
method from GeneBankSearchBTree.java
and pass the appropriate <arguments>
:
$ ./gradlew createJarGeneBankSearchBTree
$ java -jar build/libs/GeneBankSearchBTree.jar <arguments>
This repository is an Eclipse project, and can be directly opened in Eclipse.
π See this wiki page for additional instructions to run this project in Eclipse.
This project can be opened with IntelliJ IDEA.
π‘ HINT: As a student, you can get IntelliJ IDEA for free (using an academic license) by signing up with your Boise State email.
π See this wiki page for additional instructions to run this project in IntelliJ IDEA.
Alternatively, this project can be opened with VSCode.
π See this wiki page for detailed instructions to run this project in VSCode.
We can add as many classes as we want in src/main/java
, and gradle should build
them automatically. In other words, we should not have to make any changes to the build.gradle
.
Also, we can add new test files with new tests cases in src/test/java
and those will be run
automatically by gradle or our IDE.
The focus of this project is to learn about data structures, while working effectively in a group. In addition, given the small project scope, and the fixed set of requirements that are already defined (and will not need to be elicited with the use of a Product Owner), the team can customize the Scrum process learned in CS-HU 208 and focus exclusively on:
- creating tasks
- linking commits to task IDs (e.g.,
Implements task #123
) - Test-Driven Development and unit testing. The starter code already contains a few sample unit tests that can be run from the command line.
Creating the tasks upfront will allow dividing and assigning the work in order to provide transparency and accountability within the team.
Use the Projects tab (i.e., a simplified version of ZenHub) to configure our own team Scrum board, based on this project example (feel free to copy the contents of these tasks to your Scrum board).
Your Scrum board should contain the following columns (pipelines):
Column Name | Description |
---|---|
Product Backlog | All (unassigned) tasks that are going to be completed by the team throughout the duration of the project |
Sprint Backlog | Tasks proposed to be implemented in the current week (sprint), assigned to developers |
In Progress | Tasks currently being worked on |
Review/QA | Tasks ready to be reviewed by another team member |
Closed | Completed tasks, whose corresponding code is integrated in the master branch |
Tasks should be assigned to the developer working on them. Each team
member should add to the project log file, Project-Log.md, the tasks (e.g.,
https://github.com/StudentUserNameHostingRepo/CS321_Bioinformatics/issues/123
) completed that
week, as described in the progress reports.
Here is an example of a valid task
written in engineering language that is assigned to a developer. This task should be referenced
by a commit containing a message similar to Implements task #3
.
As a warm up exercise, each team member should create a task similar to task #2
and then edit the
README-submission.md
file with their information.
Here is a sample project log from a team from a previous semester: Project-Log-sample.md This has been included in the starter repository so you use it as a template.
Table of contents:
- Introduction
- Background
- Specifications
- Design Issues
- Implementation
- Using a Cache
- Using a Database
- Useful Examples
- Test Scripts
- Testing in the Cloud
- Progress Reports
- Submission
In this assignment, we will solve a problem from the field of Bioinformatics using BTrees. The amount of data that we have to handle is large and any data structure is not likely to fit in memory. Hence, a BTree is a good choice for the data structure.
Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. One of the primary aims of Bioinformatics is to attempt to determine the structure and meaning of the human genome. The human genome is a complete set of human DNA. The Human Genome project was started in 1990 by the United States Department of Energy and the U.S. National Institutes of Health. By April 14, 2003 99% of the Human Genome had been sequenced with 99.9% accuracy. The Human Genome is a big strand of 4 different organic chemicals, known as bases, which are:
- Adenine
- Cytosine
- Thiamine
- Guanine
Biologists often call them A
, C
, T
, G
for short. The bases A
and T
are always
paired together. Similarly, the bases C
and G
are always paired together. So when we look
at the DNA representation, only one side is listed. For example: the DNA sequence: AATGC
actually represents two sequences: AATGC
and its complement TTACG
(replace A
by T
,
T
by A
, C
by G
and G
by C
). Even with only half the bases represented in a DNA
sequence, the human genome is about 2.87 billion characters long!
See below an image of the DNA as well as the chemical structure of the bases.
The primary source for getting the human genome (as well as all other mapped organisms) is in the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/). See this page for downloading human genome data
We will be using the GeneBank files from NCBI. The format is described with a sample file at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Most of the information in a GeneBank file is of interest to biologists. We will only be interested in the actual DNA sequences that are embedded in these files rather than in the intervening annotations.
The GeneBank files have a bunch of annotations followed by the keyword ORIGIN
. The DNA sequences
start from the next line. Each line has 60 characters (one of A
, T
, C
, G
, could be
lower/upper case) until the end of sequence, which is denoted by //
on a line by itself.
Sometimes we will see the character N
or 'n', which denotes that the sequence is not known
at that character. You would skip these characters as described below.
when we reach a character n
, we assume that the sequence has ended. similarly, when we reach
//
, we also assume that the sequence has ended. so at those points, we reset the sequence
that we were building and start over when we find the next valid character after seeing a n
or when we find the next origin
tag.
One GeneBank file may have several DNA sequences in it, each marked by ORIGIN
and //
tags.
Note that the GeneBank files typically have all bases and the N
character in lowercase but they
could be in upper or lower case. For all our output, we will always have everything in lower case.
Here is a sample GeneBank file with the ORIGIN
and //
tags as well as some the N
characters
highlighted.
Sample GeneBank files (having the *.gbk
extension) can be found in the
data/files_gbk/ folder.
[amit@fedora-linux files_gbk(master)]$ ls -lh
total 2.8M
-rw-r--r--. 1 amit amit 11K Nov 1 07:55 test0.gbk
-rw-r--r--. 1 amit amit 11K Nov 1 07:55 test1.gbk
-rw-r--r--. 1 amit amit 21K Nov 1 07:55 test2.gbk
-rw-r--r--. 1 amit amit 21K Nov 1 07:55 test3.gbk
-rw-r--r--. 1 amit amit 4.1K Nov 1 07:55 test4.gbk
-rw-r--r--. 1 amit amit 2.7M Nov 1 07:55 test5.gbk
The files test0.gbk
through test4.gbk
are small files to help in debugging and testing our
code. The test5.gbk
is a relatively large file that we can also use to test the performance
of our solution.
The data/queries/ folder contains:
- sample query files
- a sample program named
QueryGenerator.java
that generates random queries for testing
An optional large test file is the Y chromosome for human beings. That is a large file (31MB, but the smallest of all human chromosomes). To run our program on this file is optional as it may take a long time! Note that this file isn't available on GitHub due to their limitation on file size. You can download it from here: hs_ref_chrY.gbk
The biological motivation for the problem is to study the frequency of different length subsequences to see if they are random or that some subsequences are more likely to be found in the DNA.
For a given GeneBank file, we want to convert it into a BTree with each object being a DNA
subsequence of specified length k
(where 1
β€ k
β€ 31
) We will take the DNA sequence
from the GeneBank file and break it into subsequences of length k
each and insert them into
a BTree. Each subsequence will act as a key. We are interested in all subsequences with length
k
. For example, in the sequence AATTCG
, the subsequences of length three are: AAT
, ATT
,
TTC
and TCG
.
Once we have a BTree for a length k
, we want to be able to search for query subsequences of
length k
. The search returns the frequency of occurrence of the query string (which can be
zero if it is not found).
We will also create a SQL database (for a specific length k
) of subsequences and their
frequency to aid in searching. A database will be directly created from the provided
BTree dump files.
We could represent each DNA base as a character. In Java, each character is stored in Unicode,
which requires 16 bits (or 2 bytes). Since we only have four possible bases (A
, C
, G
and T
), we can optimize on space by converting each DNA base to a 2 bit binary numbers. This
is because, a 2-bit binary number can represent four unique values, one for each DNA base. The
following table show the encoding we will use for the four bases (assuming lowercase for A, T,
C, G). We have included the Unicode values so we can see that it takes 8 times more space!
DNA Base | 2-bit binary | Unicode (decimal) | 16-bit Unicode |
---|---|---|---|
A | 00 | 97 | 0000 0000 0110 0001 |
T | 11 | 116 | 0000 0000 0111 0100 |
C | 01 | 100 | 0000 0000 0110 0100 |
G | 10 | 103 | 0000 0000 0110 0111 |
Note that we have made the binary representations for complementary bases be binary complements
as well. For example, complement of the base A
is T
--- the complement of 00
is 11
.
With this compact representation, we can store a 31 length subsequence in a 64-bit long
primitive type in Java.
Hint: The reason why the upper limit of the sequence length is 31 but not 32 can be explained as follows. The long type has 8 bytes or 64 bits to represent integers between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807. The first bit is been reserved for the sign (+ or -) of the number and thus the effective number of bits for storing data is 63, which can hold up to 31 2-bit binary codes (e.g., 00 for A, etc), using up 62 bits and then we run out of space to add another 2-bit representation of a DNA base.
Note that the binary compact representation of the subsequences will result in a unique 64-bit integer value. Hence, we can directly use that as our key value.
We will need a BTree
class as well as a BTreeNode
class. The objects that we store in the
BTree will be similar to the objects we stored in the previous Hashtable assignment. You may
call the relevant class TreeObject
to represent the objects.
We will create three programs:
- one that creates a BTree from a given GeneBank file
- a second for searching in the specified BTree for subsequences of given length. The search program assumes that the user specified the proper BTree to use depending upon the query length.
- a third for searching in the SQL database for subsequences of specified length. This database should be created by loading provided btree dump files using the SQLite interface outside our program.
The main Java classes should be named GeneBankCreateBTree
, GeneBankSearchBTree
, and
GeneBankSearchDatabase
.
The required arguments for the three programs are shown below:
java -jar build/libs/GeneBankCreateBTree.jar --cache=<0|1> --degree=<btree-degree>
--gbkfile=<gbk-file> --length=<sequence-length> [--cachesize=<n>] [--debug=0|1]
java -jar build/libs/GeneBankSearchBTree.jar --cache=<0/1> --degree=<btree-degree>
--btreefile=<b-tree-file> --length=<sequence-length> --queryfile=<query-file>
[--cachesize=<n>] [--debug=0|1]
java -jar build/libs/GeneBankSearchDatabase.jar --database=<SQLite-database-path>
--queryfile=<query-file>
Note that the arguments can be provided in any order.
If the name of the GeneBank file is xyz.gbk
, the subsequence length is <k>
and the B-Tree degree
is <t>
, then the name of the B-Tree file should be xyz.gbk.btree.data.<k>.<t>
-
<cache>
specifies whether the program should use cache (value1
) or no cache (value0
); if the value is1
, the<cache-size>
has to be specified -
<degree>
is the degree to be used for the B-Tree. If the user specifies0
, then our program should choose the optimum degree based on a disk block size of4096
bytes and the size of our B-Tree node on disk -
<gbk-file>
is the input*.gbk
file containing the input DNA sequences -
<sequence-length>
is an integer that must be between1
and31
(inclusive) -
<b-tree-file>
is the name of the B-Tree file generated by theGeneBankCreateBTree
program -
<query-file>
contains all the DNA strings of a specific subsequence length that we want to search for in the specified B-Tree file. The strings are one per line and they all must have the same length as the DNA subsequences in the B-Tree file. The DNA strings useA
,C
,T
, andG
(either lower or upper case) -
[<cache-size>]
is an integer between100
and10000
(inclusive) that represents the maximum number ofBTreeNode
objects that can be stored in memory -
<SQLite-database-path>
the path to a SQL database created by loading a dump file using SQLite's .import command. The name of the database file should bexyz.k.db
where the sequence length is<k>
, and the GeneBank file isxyz.gbk
. -
[<debug-level>]
is an optional argument with a default value of zero-
It must support at least the following values for
GeneBankSearchBTree
:-
0
: The output of the queries should be printed on the standard output stream. Any diagnostic messages, help and status messages must be printed on the standard error stream -
'1': The program displays more verbose messages. For example, it can show details of each search process
-
-
It must support at least the following values for
GeneBankCreateBTree
:-
0
: Any diagnostic messages, help and status messages must be printed on standard error stream -
1
: The program writes a text file nameddump
, containing the frequency and the DNA string (corresponding to the key stored) in an inorder traversal, and has the following line format:
-
-
<DNA string> <frequency>
The following shows a segment of the dumpfile test0.gbk.dump.5
. It has a total of 954 lines in it.
aaaaa 21
aaaac 16
aaaag 10
aaaat 18
aaaca 5
aaacc 5
aaacg 10
aaact 12
...
Write the root node to disk file only at the end of the program and read it in when the program
starts up. In addition, our program can only hold a few nodes in memory. In other words,
we can only use a few BTreeNode variables (including the root) in our program (e.g., root,
parent, current, child, temporary node for split). However, if the cache is enabled, we can
store <cache-size>
BTreeNode
objects in the cache.
We need to store some metadata about the B-Tree on disk. For example, we can store the degree of the tree, the byte offset of the root node (so we can find it), the number of nodes, and other information. This information should be stored at the beginning of the B-Tree file. We read the metadata when we open the B-Tree file and we write it back (as it may have changed) when we close the B-Tree file at the end of the program.
The B-Tree is stored as a binary file on disk. This is the most efficient and compact way to store the B-Tree data structure so it persists beyond the program. While it is possible to store the B-Tree as a text file, it will lead to severe slowdown in the runtime.
The BTree data file will have an initial metadata section. The metadata section should contain at least the byte offset of the root node. It may also optionally contain the degree of the BTree and the number of nodes. After the metadata, the rest of the file consists of BTreeNodes laid out one after the other. A new node is added to the end of the file.
If the name of the GeneBank file is xyz.gbk
, the subsequence length is <k>
and the B-Tree
degree is <t>
, then the name of the B-Tree file should be xyz.gbk.btree.data
.<k>.<t>
.
-
We will read/write one BTreeNode at a time. If the degree
<t>
is small, this would be inefficient. To improve efficient, we should set the degree<t>
to an optimum value such that a BTreeNode fits one disk block (we are using 4096 bytes for the disk block size) as close as possible with some empty padding space at the end (if needed). -
We will store the byte offset of a node on disk as the child pointers in the BTreeNodes. Note that we never need real child pointers in memory.
-
We will use RandomAccessFile and FileChannel classes to read/write to the BTree data file. This allows us to quickly set the file cursor to anywhere in the file in O(1) time using the
position(long pos)
method. We will use the ByteBuffer class to read/write to the BTree data file. Please see the example of writing to a random access binary data file shown in DiskReadWrite.java in the Disk IO example folder in CS321-resources repo. This example shows a complete binary search tree as an external data structure in a binary file on disk.
Note that our BTree stores only one side of the DNA strand but there is a complementary strand as well. This implies that when we search for a subsequence, we also need to search for its complement to get the correct answer.
For example: the DNA sequence: AATGC
actually represents two sequences: AATGC
and its
complement TTACG
(replace A
by T
, T
by A
, C
by G
and G
by C
). To search for
AATGC
, we will search for both AATGC
and its complement TTACG
and then add the frequencies
to get the result.
We will incorporate the generic Cache class from Project 1
to improve the performance of
our B-Tree implementation. The size of the cache should be a command line argument. An entry
in the cache is a BTreeNode
. With the cache enabled command line option, the <cache-size>
needs to be specified as well.
π Report the time improvement for sequences of length 20 on test5.gbk
using a cache of size
100
, 500
, 1000
, 5000
, and 10,000
in your README-submission.md
file.
For example, the table below shows the improvement the instructors got on their solution. Note that your times will be different due to different hardware and differences in the implementation.
gbk file | degree | sequence length | cache | cache size | cache hit rate | run time |
---|---|---|---|---|---|---|
test5.gbk | 102 | 20 | no | 0 | 0% | 29.52s |
test5.gbk | 102 | 20 | yes | 100 | 66.14% | 12.56s |
test5.gbk | 102 | 20 | yes | 500 | 77.84% | 10.22s |
test5.gbk | 102 | 20 | yes | 1000 | 81.45% | 10.05s |
test5.gbk | 102 | 20 | yes | 5000 | 95.08% | 5.08s |
test5.gbk | 102 | 20 | yes | 10000 | 99.51% | 3.15s |
Using a cache sped up the execution by a factor of 9.37! Using a cache will also speed up search in
larger BTrees. The test5.gbk
BTree isn't large enough because a search only takes
We were able to create a B-Tree for the full human Y chromosome (data/files_gbk/hs_ref_chrY.gbk
)
in about 14 minutes using a cache of size 10,000. With a cache of size 100,000 (about 400 MB
of memory), we were able to bring the time to create the BTree down to only 2m19s.
A common technique is to provide project results in a common format. The results will then be accessible to other researchers much more easily.
In this project, your team will practice this technique by creating an SQL database that has subsequence data in it.
In the real world, the GeneBankCreateBtree program would need to be modified to create
the database from an existing BTree. In this project, the correct BTree dump files are
provided for test0.gbk
and test5.gbk
.
Using these dumpfiles, load the data into an SQLite database using the SQLite .import command. See the documentation here.
Loading the data from the dumpfiles prevents your team from needing a fully functioning BTree to complete this requirement.
Then create a separate search program named GeneBankSearchDatabase
that uses the database instead of the BTree. This is a common pattern in real life
applications, where we may crunch lots of data using a data
structure and then store the results in a database for ease of access.
Note: Since correct dumpfiles are provided in the results folder, GeneBankSearchDatabase can be started and completed before GeneBankCreateBTree.
$ ./gradlew createJarGeneBankSearchDatabase
$ java -jar build/libs/GeneBankSearchDatabase.jar --database=<SQLite-database-path>
--queryfile=<query-file>
Use the embedded SQLite database for searching the database. The SQLite driver is fully contained in a jar file that gradle will automatically pull down for us. See the database example in the section below on how to use SQLite.
The following examples from the class examples repository will be useful for this project.
- Disk IO example: In particular, look at DiskReadWrite.java. It shows the implementation of an external binary search tree on disk.
- SQLite example: A quick starter example on how to set up and use SQLite.
- Bitwise operators example: In particular, look at SequenceUtils.java for helpful sequence utility code. A copy of this has also been provided in the starter code for the project.
Several test gbk files are provided in the folder: data/files_gbk.
An optional large test file is the Y chromosome for human beings. That is a large file (but the smallest of all chromosomes!). To run our program on this file is optional as it may take a long time!! For this test, we recommend a cache size of 10,000 nodes. Note that this file isn't available on GitHub due to their limitation on file size. You can download it from here: hs_ref_chrY.gbk
If you want to compare your dumpfiles for the Y chromosome, you can download our dump files from here: hs_ref_chrY dumpfiles.
Several pre-generated query files are provided in the folder:
data/queries.
To generate additional queries, we can use the data/queries/QueryGenerator.java
program.
The expected dump files and query results are provided for test0.gbk
and test5.gbk
in the folders:
results/dumpfiles,
results/query-results
In particular, results are provided for query1
,..., query10
, query20
, and query31
query files
for test0.gbk
and test5.gbk
GeneBank files.
Three test scripts are provided at the top-level of the project (for integration testing). These compare your results to the results files mentioned above.
./create-btrees.sh
Usage: create-btrees.sh <datafile>
./check-dumpfiles.sh
Usage: check-dumpfiles.sh <datafile>
./check-queries.sh
Usage: check-queries.sh <datafile>
Please note that the scripts require the name of the test file and then it will add the
appropriate path to the name to find it. The check-btrees.sh
script creates B-Trees for the given
data file for strings of length 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 31. The check-dumpfiles.sh
compares the dump files from our code to the reference dump files. The check-queries.sh
script compares the results of queries to our program with the reference results.
The results are provided for test0.gbk and test5.gbk data files. You can use the test scripts to run and compare results using the three test scripts as follows.
./gradlew createJarGeneBankCreateBTree
./gradlew createJarGeneBankSearchBTree
./create-btrees.sh test0.gbk
./check-dumpfiles.sh test0.gbk
./check-queries.sh test0.gbk
Then, repeat for test5.gbk
. The instructors will use these test scripts for the final testing of
your project.
Start off by running tests on your machine. If you do need to run them on onyx
please only
run the smallest test (test0.gbk
) to avoid overloading the onyx
server.
Using the AWS Accounts provided earlier in the course, you can run the large Y-Chromosome file on a cloud instance, provided you create a larger instance.
Creating a BTree with the large file is very time intensive. It will take too long to run unless your cache implementation is efficient and your BTree is well-designed.
To be rewarded with the extra credit, request permissions to create a larger instance. Then, capture a screenshot of check-queries.sh completed and include it with your submission.
βοΈ π
Please see the AWS notes for a step-by-step guide on running your project on the AWS cloud.
Each team member will fill out a progress report (via a survey) each week. The link to the survey will be provided by the instructor.
In addition, each team member should log their project-related
activities for the week, including the URL to the tasks (e.g.,
https://github.com/StudentUserNameHostingRepo/CS321_Bioinformatics/issues/123
) completed that
week, in a separate file named Project-Log.md
.
Here is a sample log file: Project-log-sample.md
As a reminder, each commit should link (reference) in the commit message the completed task
(e.g., Implements task #123
), in order to automatically link the task to the commit, and make
the code changes directly available from the task itself.
It is expected that each team should have at least one meeting every week.
Progress reports are confidential.
Before submission, make sure that you:
- that your
README-submission.md
file is complete! - can compile and run the program from the command line and obtain the expected results (just try test0.gbk)
- run the test scripts
- add a note in your
README-submission.md
that the team is done and then commit and push.
Make sure the instructor and the teaching assistant(s) have access to the repository.