Ontogeny tools are designed for biologists with no background in bioinformatics. They use a lot of color and simplicity on the command line to make the transition from wet lab to computer lab more manageable.
Learn more
The name ontogeny refers to the development of an individual from embryo to maturity. I chose this name as my hope is these tools help you to go from terrified of a command line to proficient in bioinformatics.
They are bash shell scripts cobbled together while learning how to work with biological data on UNIX/Linux servers (data wrangling).
They follow kent command conventions. This means executing the command with no arguments will show usage/help. Most also follow UNIX/Linux conventions by showing usage when run with the -h
or --help
flags.
Most UNIX software is designed to be minimimalist. This is ideal for UNIX power tools, as it makes dealing with data easier in pipelines.
On the other hand, most of my software is not designed to be part of a pipeline. These tools were designed to format the data for non-programmers to read more easily. Output tends to have columns formatted to align, lots of color, and spacing on the top, left and bottom. This would throw a wrench in the gears of most data pipelines.
- Contribute
- General purpose software (standalone scripts)
- Ontogeny Toolkit (extensions to your bash startup file)
- Data Wrangling
- Installation
$ git clone https://github.com/claymfischer/ontogeny.git
$ vi file.txt
$ git pull
$ git add file.txt
$ git commit -m "Adding file.txt"
$ git diff --stat --cached origin/master
$ git log --stat --pretty=short --graph
$ git push
This software is not specific to internal work projects, and much of it can be employed for any general command-line use in bioinformatics.
- [view source] Highlight
- [view source] Color-code columns
- [view source] Color-code sequence data
- [view source] Transfer files
- [view source] [view source] New ls and list
- [view source] About
- [view source] List contents
- [view source] Change your command prompt for no good reason
Highlight any number of search patterns with a variety of colors. Can accept stdin (piped input) or use files, and can pipe out (for example to less -R
). It also has extensive regex support. Protips and specifics are available in the documentation and usage.
$ highlight file.txt pattern1 pattern2 ... pattern{n}
Learn more
Input: stdin
pipedinput
file.txt
"multiple.txt files.txt"
file.*
Input examples:
$ highlight *.txt pattern1 pattern2 ... pattern*n*
$ highlight "file.txt file2.txt" pattern1 pattern2 ... pattern*n*
$ cat file.txt | grep pattern1 | highlight stdin pattern2 pattern3 | less -R
$ cat file.txt | grep pattern1 | highlight pipedinput pattern2 pattern3 | less -R
pipedinput
and stdin
are both the same, but stdin
will show you a color legend of what you're highlighting.
Note: adding multiple files will filter to only lines containing all the patterns. You can trick it to filter withinin a single file by also including the empty file
/dev/null
, for example:$ highlight "/dev/null file.txt" pattern1 pattern2
As this can handle any number of patterns (and will color them randomly with 256 colors), it's useful for a lot of QA purposes, making visual connections easier. For example, you could use command substitution to generate your pattern list:
$ highlight file.txt $( cat listOfAssemblyNames.tsv | cut -f 2 | awk NF | sort | uniq | tr '\n' ' ' )
Note: there are patterns with special meaning, such as
CLEANUP
to help location errant tabs and spaces in biological data storage.
In bioinformatics we deal with the lowest common denominator format for data, which is generally plain text in tab-separated columns. These tab-separated columns are computer-readable moreso than human-readable, as the columns do not line up. It can be difficult to tell which column you are looking at when you have a screen of line-wrapped text.
This takes advantage of a simple grep
loop to color-code the columns. Accepts stdin
, you'll need to provide the argument stdin
instead of file.tsv
. There are color legends both at the top at bottom, allowing you to pipe to head
or tail
.
$ cat example.tsv
$ columns example.tsv
Learn more
Any additional arguments will color specific columns for comparison. This example also shows how to use stdin
.
$ cat example.tsv | columns stdin 3 6 9 10 17 25
View more standalone software
Color-codes bases in a gzipped fastq file.
$ fastq SRR123.fastq.gz
Learn more
You can also color-code the quality score. Set any third argument.
$ fastq SRR123.fastq.gz x
This lists directories first, then files. It can color-code different types of files.
If you are new to shell scripting, these are fantastic examples to begin modifying. They were written as tutorials for how to write shell scripts. They are similar, except list
will also do a line count for text files.
This will tell you about any file or directory. It has lazy usage, or more verbose usage that allows detailed previews of the file.
This was my first shell script, and really is not a great example of code. However, it's fast and it does what it needs so I've never updated it.
Learn more
### About filesIt will tell you file size, encoding (ASCII or non-ASCII), when the file was last modified in human terms (seconds, minutes, days, weeks, months, years), how many lines it has (and of those, how many are non-blank and how many are actual content, not comments), how many columns (default delimiter is a tab, but you can set it). It also previews the head and foot of a file.
$ about file.txt
Gives you the real and apparent size of directory (eg. if transferring the contents over a network), the number of files in the top level as well as in all subdirectories, when the directory was last modified, any file extensions and examples with those extensions, and groups files by date modified.
This is an extension of a script I found in 'Wicked Cool Shell Scripts.'
This is a simple script that generates a color-coded SCP command to upload or download files. It was written as a tutorial in bash shell scripting.
Learn more
`$ transfer file1.txt file2.txt ... file{n}.txt`It also takes advantage of filename expansion
$ transfer *.txt
Learn more
This is a silly piece of software with no practical purpose, it was written as an exercise challenge when learning bash shell scripting.
It allows you to change your command prompt to any character. It can give you a new character at each prompt, or keep the same character, or return you to your old command prompt when done. The prompts chosen requires changing the settings of LC_ALL to allow UNICODE, so will affect sort
behavior.
If you'd like to start using colors, here is the output from bin/paletteTest.sh
:
The ontogeny_toolkit.sh
extends your .bashrc
by adding aliases to the above software and adding the following functionality:
noWrap
l
showMatches
grabBetween
grabLines
checkFastq
fixLastLine
fixNewLines
deleteBlankLines
reduceMultipleBlankLines
reduceMultipleBlankSpaces
screenHelp
and changes to your prompt when in a screenhowToGrep
mkdirRand
,mkdirNow
,MkdirTime
,foo
cleanUp
See examples
Execute noWrap
to temporarily halt line wrapping around your terminal. After 20 seconds your terminal is back to default.
Execute l
(lowercase L) to list everything in the directory in a more human-readable fashion, including the time stamps. It's a simple alias.
Execute showMatches file.txt pattern
to show all matches (highlighted) with context. Add another argument to set amount of context you want to include: showMatches file.txt pattern 10
.
Very fast and useful for parsing files with multiple matches, for example looking for a certain type of error in an error logs.
Patterns have extensive regex support, as shown in image above
grabBetween file.txt pattern1 pattern2
Note that this will grab the first match of the pattern found, and will ignore further matches.
Patterns have extensive regex support
grabLines file.txt 100 250
This will return all content between line numbers.
This grabs the content between specific line numbers in a gzipped fastq file. Same usage as grabLines
.
cat file.txt | fixLastLine > file2.txt
Pipe to this to fix issues with CRLF lines. Very common with data saved from spreadsheets or text files from Windows PCs.
Note that this is useful because many programs define a line as ending in a line break. If the last line does not ending with a line break, it may cause issues with some software.
cat file.txt | fixNewLines > file2.txt
Pipe to this to fix CRLF lines in a file. Very common with data saved from spreadsheets or text files from Windows PCs.
cat file.txt | deleteBlankLines > file2.txt
Removes blank lines from a file. Used in a pipe.
cat file.txt | reduceMultipleBlankLines > file2.txt
This will fix up a file by reducing regions with multiple blank lines to only one blank line.
cat file.txt | reduceMultipleBlankSpaces > file2.txt
This will clean up a file, reducing areas with more than one space to only one space.
Your prompt will automatically change when entering a screen
to alert you that you're in a screen
session.
You can also invoke help by simply running screenHelp
either in the screen session or on the command line for a quick refresher. It will also show you a list of running screen
sessions or the name of your current screen, if in one.
Since grep
is such an important tool for bioinformaticians to learn, there's also a howtogrep
refresher.
If you find yourself making a lot of tmp
temp
or foo
directories and getting them mixed up, here are a few commands to make a unique directory that you can keep track of.
inspect
formatted
align
alternateRows
colorRows
blocks
grid
linesNotEmpty
linesContent
writing
nonascii
ascii
See examples
inspect file.txt
Default is to include first and last 5 lines, but you can set a different number: inspect file.txt 20
+
cat file.txt | processing | formatted
This will allow you to align files in a pipe. Can set any delimiter, but defaults to tab. For example, a csv file: cat file.txt | formatted ","
align file.txt "delimiter"
Aligns a file according to delimiter. If no delimiter set, defaults to tab.
cat file.txt | alternateRows
Gives every other row a gray background color.
cat file.txt | colorRows
Gives every other row a random color. Set an argument to give it a background color, instead.
blocks file.tab
Can set any delimiter as the second argument, defaults to tab.
grid file.tab
Can set any delimiter as the second argument, defaults to tab.
Third argument will truncate line, for example to truncate each column to 10 characters: grid file.tab tab 10
It can also truncate to the average
or avg
of the column characters: grid file.tab tab avg
cat file.txt | linesNotEmpty
Returns number of lines that are not empty or white space.
`cat file.txt | linesContent
Returns number of lines containing content and which so not begin with a hashtag.
Simple trick to see if a directory size changes over one second.
Note: this uses
du
, then sleeps and usesdu
again to determine speed of writing. Large directories can take a while to rundu
, so the rate of writing may be inaccurate.
cat file.txt | ascii
cat file.txt | nonascii
cleanUp
andhighlight file.tab CLEANUP
whichColumn
,whichColumns
describeColumns
summarizeColumns
cutColumns
columnAverage
columnLengths
numColumns
maxColumns
minColumns
Learn more
Visually locate multiple spaces/tabs, helpful when data isn't validating the way you expected.
Tip:
highlight
has a special pattern which works even better.highlight file.txt CLEANUP
.
cat file.txt | whichColumns
Figure out which column number you need.
This way will preview the second line of the file to help you confirm it's the correct column.
Tip: sometimes the file has funky line breaks if copied and pasted from a spreadsheet, so try
cat file.txt | fixNewLines | whichColumns.png
if you encounter troubles.
describeColumns file.tsv
A fork of whichColumns
, and also provides the column number, column header and first row value for a tab-separated file.
summarizeColumns file.tsv
will give a detailed overview of each column and let you know if the column numbers are inconsistent or the file uses Windows-style CRLF line breaks. You can set any delimiter, it defaults to tab.
Note that it gives 5 random values from each column so you get an idea of what's going on. You can instruct it to give a specific number of examples, and even truncate each example so they all fix on your screen.
cutColumns file.tsv 1 2 3
Returns the file, but without the columns specified as arguments. Can be in any order.
cat file.tsv | cut -f 1 | columnAverage
This will return the average number of characters. This is for piped input, one column of data.
cat file.tsv | columnLengths
This will return the average characters in each column. Used in a pipe.
numColumns file.tsv
Returns the number of columns in a tab-separated file.
cat file.tsv | maxColumns
Returns the highest number of columns found in a tab-separated file.
Returns the lowest number of columns found in a tab-separted file.
Tab-separated data can be difficult to read if the rows vary in character length. Here's an example of using the format alias. Note that to align this, a character needs to be placed in columns or rows with blanks. This will insert a period (.) character. Seeing it aligned can be easier to read than coloring the columns.
It's even easier to read than the color-coded column
program from above:
If you want to contribute some bash functions, there's a library of functions available for handling argument validity (checking if integers, etc), checking for files existing and making suggestions, etc.
The library functions begin with the prefix lib_
. There are example bash functions as well, allTheArguments
to show how handle multiple files and accept unlimited arguments (as well as color results randomly), and functionFlags
to show how to use flags in a bash function.
The following software is developed for specific use in data wrangling work. I do keep a repository of it here so we can all collaboratively develop (and the source code may be useful to some), but it is unlikely to find general-purpose use.
Learn more about our internal file formats and the software to work with them
A lot of this software is designed to work for:
ra file, or Tag Storm
An ra (relational-alternative) file establishes a record as a set of related tags and values in a blank line-delimited stanza (block of text). Parent stanzas convey tags and values shared with the rest of the file. Indented stanzas inherit parent stanzas, and can override parent settings.
These are designed to be human-readable, and reduce redundancy of tab-separated files.
manifest file
This is a list of files with a unique identifer to link the file with metadata about it. Tab separated columns.
spreadsheets
In collaborating with off-site folks who are not familiar with the command-line, it can often be easier to share Google Sheets or Excel Spreadsheets. There is some software to generate input for spreadsheets.
- Check submission
- Generate spreadsheet input
- Generate a tag storm summary
- Generate a tag summary
cat meta.txt | processing | emptyTags
Useful to see if your processing messes up any tags.
cat meta.txt | processing | removeEmptyTags > meta.new
listAllTags meta.txt
head -n 1 meta.tab | convertMisceFields >> fixedHeader.txt
This will lower-case, convert spaces and dashes to underbars, and even change camel-cased to tag-format.
This gives a summary of a relational-alternative, or ra, file.
If an md5sum file is present, it will also validate that there are no collisions and compare it to the md5sum file.
This takes a tag storm as input, does some calculations and gives a tab-separated output for importing into a tag reconciliation spreadsheet.
This gives you a tag-by-tag count of values and totals them for you. Very useful for a high-level look at a tag storm.
This gives a summary of a tag from a tag storm, providing counts and showing all the different values and the stanza indentation for each.
Clone
First you need to clone. This will create directory called ontogeny
wherever you run this command:
$ git clone https://github.com/claymfischer/ontogeny.git
If you want to learn more about git
and why it is useful when dealing with biological data, I highly recommend the book Bioinformatics Data Skills. It has a fantastic chapter on git
and what you need to know, and explains it in a no-nonsense manner, assuming you have no background in bioinformatics. The entire book is an amazing resource well worth every penny.
bash startup file
Add the following to your .bashrc
and edit the ONTOGENY_INSTALL_PATH
:
# Ontogeny repository path:
ONTOGENY_INSTALL_PATH=/path/to/the/repository
source $ONTOGENY_INSTALL_PATH/lib/ontogeny_toolkit.sh
Protip: put this at the top of your .bashrc
file. This way it won't override your own settings of the same variables. For instance, if you have a PS1
set in your .bashrc, it won't get overridden if this is sourced at the top.
make
Currently looking into enabling users to simply make install
from the repository directory to copy executables to where they need to be.