About the ontogeny toolkit

Ontogeny tools are designed for biologists with no background in bioinformatics. They use a lot of color and simplicity on the command line to make the transition from wet lab to computer lab more manageable.

Learn more

The name ontogeny refers to the development of an individual from embryo to maturity. I chose this name as my hope is these tools help you to go from terrified of a command line to proficient in bioinformatics.

They are bash shell scripts cobbled together while learning how to work with biological data on UNIX/Linux servers (data wrangling).

They follow kent command conventions. This means executing the command with no arguments will show usage/help. Most also follow UNIX/Linux conventions by showing usage when run with the -h or --help flags.

Most UNIX software is designed to be minimimalist. This is ideal for UNIX power tools, as it makes dealing with data easier in pipelines.

On the other hand, most of my software is not designed to be part of a pipeline. These tools were designed to format the data for non-programmers to read more easily. Output tends to have columns formatted to align, lots of color, and spacing on the top, left and bottom. This would throw a wrench in the gears of most data pipelines.

Contribute
General purpose software (standalone scripts)
Ontogeny Toolkit (extensions to your bash startup file)
Data Wrangling
Installation

Contribute

$ git clone https://github.com/claymfischer/ontogeny.git

$ vi file.txt

$ git pull

$ git add file.txt

$ git commit -m "Adding file.txt"

$ git diff --stat --cached origin/master

$ git log --stat --pretty=short --graph

$ git push

Standalone software

This software is not specific to internal work projects, and much of it can be employed for any general command-line use in bioinformatics.

[view source] Highlight
[view source] Color-code columns
[view source] Color-code sequence data
[view source] Transfer files
[view source] [view source] New ls and list
[view source] About
[view source] List contents
[view source] Change your command prompt for no good reason

Visualize patterns in files: `highlight`

[view source]

Highlight any number of search patterns with a variety of colors. Can accept stdin (piped input) or use files, and can pipe out (for example to less -R). It also has extensive regex support. Protips and specifics are available in the documentation and usage.

$ highlight file.txt pattern1 pattern2 ... pattern{n}

Learn more

Input: stdin pipedinput file.txt "multiple.txt files.txt" file.*

Input examples:

$ highlight *.txt pattern1 pattern2 ... pattern*n*

$ highlight "file.txt file2.txt" pattern1 pattern2 ... pattern*n*

$ cat file.txt | grep pattern1 | highlight stdin pattern2 pattern3 | less -R

$ cat file.txt | grep pattern1 | highlight pipedinput pattern2 pattern3 | less -R

pipedinput and stdin are both the same, but stdin will show you a color legend of what you're highlighting.

Note: adding multiple files will filter to only lines containing all the patterns. You can trick it to filter withinin a single file by also including the empty file /dev/null, for example: $ highlight "/dev/null file.txt" pattern1 pattern2

As this can handle any number of patterns (and will color them randomly with 256 colors), it's useful for a lot of QA purposes, making visual connections easier. For example, you could use command substitution to generate your pattern list:

Note: there are patterns with special meaning, such as CLEANUP to help location errant tabs and spaces in biological data storage.

Color-code columnar data: `columns`

In bioinformatics we deal with the lowest common denominator format for data, which is generally plain text in tab-separated columns. These tab-separated columns are computer-readable moreso than human-readable, as the columns do not line up. It can be difficult to tell which column you are looking at when you have a screen of line-wrapped text.

This takes advantage of a simple grep loop to color-code the columns. Accepts stdin, you'll need to provide the argument stdin instead of file.tsv. There are color legends both at the top at bottom, allowing you to pipe to head or tail.

$ cat example.tsv

$ columns example.tsv

Learn more

Any additional arguments will color specific columns for comparison. This example also shows how to use stdin.

$ cat example.tsv | columns stdin 3 6 9 10 17 25

View more standalone software

Color-code sequence and quality score data: `fastq`

Color-codes bases in a gzipped fastq file.

$ fastq SRR123.fastq.gz

Learn more

You can also color-code the quality score. Set any third argument.

$ fastq SRR123.fastq.gz x

New ls and new list: `newls` and `list`

This lists directories first, then files. It can color-code different types of files.

If you are new to shell scripting, these are fantastic examples to begin modifying. They were written as tutorials for how to write shell scripts. They are similar, except list will also do a line count for text files.

Quick inspection of file or directory: `about`

This will tell you about any file or directory. It has lazy usage, or more verbose usage that allows detailed previews of the file.

This was my first shell script, and really is not a great example of code. However, it's fast and it does what it needs so I've never updated it.

Learn more

### About files

It will tell you file size, encoding (ASCII or non-ASCII), when the file was last modified in human terms (seconds, minutes, days, weeks, months, years), how many lines it has (and of those, how many are non-blank and how many are actual content, not comments), how many columns (default delimiter is a tab, but you can set it). It also previews the head and foot of a file.

$ about file.txt

About directories

Gives you the real and apparent size of directory (eg. if transferring the contents over a network), the number of files in the top level as well as in all subdirectories, when the directory was last modified, any file extensions and examples with those extensions, and groups files by date modified.

Quickly transfer files to-and-from your server: `transfer`

This is a simple script that generates a color-coded SCP command to upload or download files. It was written as a tutorial in bash shell scripting.

Learn more

`$ transfer file1.txt file2.txt ... file{n}.txt`

It also takes advantage of filename expansion

$ transfer *.txt

Change your command prompt `promptScript`

Learn more

This is a silly piece of software with no practical purpose, it was written as an exercise challenge when learning bash shell scripting.

It allows you to change your command prompt to any character. It can give you a new character at each prompt, or keep the same character, or return you to your old command prompt when done. The prompts chosen requires changing the settings of LC_ALL to allow UNICODE, so will affect sort behavior.

Display 256-color palette library: `palette`

If you'd like to start using colors, here is the output from bin/paletteTest.sh:

ontogeny_toolkit.sh - extension to your .bashrc

The ontogeny_toolkit.sh extends your .bashrc by adding aliases to the above software and adding the following functionality:

General utilities included

noWrap
l
showMatches
grabBetween
grabLines
checkFastq
fixLastLine
fixNewLines
deleteBlankLines
reduceMultipleBlankLines
reduceMultipleBlankSpaces
screenHelp and changes to your prompt when in a screen
howToGrep
mkdirRand, mkdirNow, MkdirTime, foo
cleanUp

See examples

Temporarily disable text wrapping `noWrap`

Execute noWrap to temporarily halt line wrapping around your terminal. After 20 seconds your terminal is back to default.

Human-readable ls: `l`

Execute l (lowercase L) to list everything in the directory in a more human-readable fashion, including the time stamps. It's a simple alias.

Display at matches with context: `showMatches`

Execute showMatches file.txt pattern to show all matches (highlighted) with context. Add another argument to set amount of context you want to include: showMatches file.txt pattern 10.

Very fast and useful for parsing files with multiple matches, for example looking for a certain type of error in an error logs.

Patterns have extensive regex support, as shown in image above

Display all content between two patterns: `grabBetween`

grabBetween file.txt pattern1 pattern2

Note that this will grab the first match of the pattern found, and will ignore further matches.

Patterns have extensive regex support

Display content between two lines: `grabLines`

grabLines file.txt 100 250

This will return all content between line numbers.

Grab specific lines out of a gzipped fastq: `checkFastq`

This grabs the content between specific line numbers in a gzipped fastq file. Same usage as grabLines.

Fix errors where last line of a file is a funky line break: `fixLastLine`.

cat file.txt | fixLastLine > file2.txt

Pipe to this to fix issues with CRLF lines. Very common with data saved from spreadsheets or text files from Windows PCs.

Note that this is useful because many programs define a line as ending in a line break. If the last line does not ending with a line break, it may cause issues with some software.

Fix files that have funky line breaks: `fixNewLines`

cat file.txt | fixNewLines > file2.txt

Pipe to this to fix CRLF lines in a file. Very common with data saved from spreadsheets or text files from Windows PCs.

Remove all blank lines: `deleteBlankLines`

cat file.txt | deleteBlankLines > file2.txt

Removes blank lines from a file. Used in a pipe.

Reduce multiple blank lines to one: `reduceMultipleBlankLines`

cat file.txt | reduceMultipleBlankLines > file2.txt

This will fix up a file by reducing regions with multiple blank lines to only one blank line.

Reduce multiple blank spaces to one: `reduceMultipleBlankSpaces`

cat file.txt | reduceMultipleBlankSpaces > file2.txt

This will clean up a file, reducing areas with more than one space to only one space.

Make it clear when in a screen, and show screen sessions: `screenHelp`

Your prompt will automatically change when entering a screen to alert you that you're in a screen session.

You can also invoke help by simply running screenHelp either in the screen session or on the command line for a quick refresher. It will also show you a list of running screen sessions or the name of your current screen, if in one.

grep cheatsheet: `howToGrep`

Since grep is such an important tool for bioinformaticians to learn, there's also a howtogrep refresher.

Make better tmp directories: `mkdirRand`, `mkdirNow`, `MkdirTime`, `foo`

If you find yourself making a lot of tmp temp or foo directories and getting them mixed up, here are a few commands to make a unique directory that you can keep track of.

Inspecting files and directories

inspect
formatted
align
alternateRows
colorRows
blocks
grid
linesNotEmpty
linesContent
writing
nonascii
ascii

See examples

Quickly view the beginning and end of a file: `inspect`

inspect file.txt

Default is to include first and last 5 lines, but you can set a different number: inspect file.txt 20+

Quickly align delimited output: `formatted`

cat file.txt | processing | formatted

This will allow you to align files in a pipe. Can set any delimiter, but defaults to tab. For example, a csv file: cat file.txt | formatted ","

Align file content: `align file.txt`

align file.txt "delimiter"

Aligns a file according to delimiter. If no delimiter set, defaults to tab.

Alternate row color, so data is easier to read: `alternateRows`

cat file.txt | alternateRows

Gives every other row a gray background color.

Alternate row output with color: `colorRows`

cat file.txt | colorRows

Gives every other row a random color. Set an argument to give it a background color, instead.

Organize content into gray blocks: `blocks`

blocks file.tab

Can set any delimiter as the second argument, defaults to tab.

Turn small delimited content into colorful, spreadsheet-like grids: `grid`

grid file.tab

Can set any delimiter as the second argument, defaults to tab.

Third argument will truncate line, for example to truncate each column to 10 characters: grid file.tab tab 10

It can also truncate to the average or avg of the column characters: grid file.tab tab avg

`linesNotEmpty`

cat file.txt | linesNotEmpty

Returns number of lines that are not empty or white space.

`linesContent`

`cat file.txt | linesContent

Returns number of lines containing content and which so not begin with a hashtag.

Test if your current directory is actively writing anything: `writing`

Simple trick to see if a directory size changes over one second.

Note: this uses du, then sleeps and uses du again to determine speed of writing. Large directories can take a while to run du, so the rate of writing may be inaccurate.

Check if a file has non-ascii characters: `ascii` and `nonascii`

cat file.txt | ascii

cat file.txt | nonascii

Tab-separated data

cleanUp and highlight file.tab CLEANUP
whichColumn, whichColumns
describeColumns
summarizeColumns
cutColumns
columnAverage
columnLengths
numColumns
maxColumns
minColumns

Learn more

Visually inspect for multiple spaces or tabs where they shouldn't be. `cat file.txt | cleanUp`

Visually locate multiple spaces/tabs, helpful when data isn't validating the way you expected.

Tip: highlight has a special pattern which works even better. highlight file.txt CLEANUP.

Decipher which column number has your data of interest: `whichColumns`

cat file.txt | whichColumns

Figure out which column number you need.

This way will preview the second line of the file to help you confirm it's the correct column.

Tip: sometimes the file has funky line breaks if copied and pasted from a spreadsheet, so try cat file.txt | fixNewLines | whichColumns.png if you encounter troubles.

Decipher which column number has your data of interest: `describeColumns`

describeColumns file.tsv

A fork of whichColumns, and also provides the column number, column header and first row value for a tab-separated file.

`summarizeColumns`

summarizeColumns file.tsv will give a detailed overview of each column and let you know if the column numbers are inconsistent or the file uses Windows-style CRLF line breaks. You can set any delimiter, it defaults to tab.

Note that it gives 5 random values from each column so you get an idea of what's going on. You can instruct it to give a specific number of examples, and even truncate each example so they all fix on your screen.

`cutColumns`

cutColumns file.tsv 1 2 3

Returns the file, but without the columns specified as arguments. Can be in any order.

`columnAverage`

cat file.tsv | cut -f 1 | columnAverage This will return the average number of characters. This is for piped input, one column of data.

`columnLengths`

cat file.tsv | columnLengths This will return the average characters in each column. Used in a pipe.

`numColumns`

numColumns file.tsv Returns the number of columns in a tab-separated file.

`maxColumns`

cat file.tsv | maxColumns Returns the highest number of columns found in a tab-separated file.

`minColumns`

Returns the lowest number of columns found in a tab-separted file.

Tab-separated data can be difficult to read if the rows vary in character length. Here's an example of using the format alias. Note that to align this, a character needs to be placed in columns or rows with blanks. This will insert a period (.) character. Seeing it aligned can be easier to read than coloring the columns.

It's even easier to read than the color-coded column program from above:

Want to extend the toolkit with new functionality?

If you want to contribute some bash functions, there's a library of functions available for handling argument validity (checking if integers, etc), checking for files existing and making suggestions, etc.

The library functions begin with the prefix lib_. There are example bash functions as well, allTheArguments to show how handle multiple files and accept unlimited arguments (as well as color results randomly), and functionFlags to show how to use flags in a bash function.

Specific use software

The following software is developed for specific use in data wrangling work. I do keep a repository of it here so we can all collaboratively develop (and the source code may be useful to some), but it is unlikely to find general-purpose use.

Learn more about our internal file formats and the software to work with them

A lot of this software is designed to work for:

ra file, or Tag Storm

An ra (relational-alternative) file establishes a record as a set of related tags and values in a blank line-delimited stanza (block of text). Parent stanzas convey tags and values shared with the rest of the file. Indented stanzas inherit parent stanzas, and can override parent settings.

These are designed to be human-readable, and reduce redundancy of tab-separated files.

manifest file

This is a list of files with a unique identifer to link the file with metadata about it. Tab separated columns.

spreadsheets

In collaborating with off-site folks who are not familiar with the command-line, it can often be easier to share Google Sheets or Excel Spreadsheets. There is some software to generate input for spreadsheets.

Check submission
Generate spreadsheet input
Generate a tag storm summary
Generate a tag summary

List all empty tags: `emptyTags`

cat meta.txt | processing | emptyTags

Useful to see if your processing messes up any tags.

Remove all empty tags: `removeEmptyTags`

cat meta.txt | processing | removeEmptyTags > meta.new

Make a list of tags in a Tag Storm: `listAllTags`

listAllTags meta.txt

Show currently-valid tags from cdwCheckValidation.c: `listValidTags`

Test if a Tag Storm has any invalid tags: `checkTagsValid`

Convert to tag-style: `convertMisceFields`

head -n 1 meta.tab | convertMisceFields >> fixedHeader.txt

This will lower-case, convert spaces and dashes to underbars, and even change camel-cased to tag-format.

Visually see how you Tag Storm maps to your manifest: `mapToManifest`

Visually see how your manifest files maps to your Tag Storm: `mapToMeta`

Check submission

This gives a summary of a relational-alternative, or ra, file.

If an md5sum file is present, it will also validate that there are no collisions and compare it to the md5sum file.

Generate spreadsheet input

This takes a tag storm as input, does some calculations and gives a tab-separated output for importing into a tag reconciliation spreadsheet.

Generate a tag storm summary

This gives you a tag-by-tag count of values and totals them for you. Very useful for a high-level look at a tag storm.

Generate a tag summary

This gives a summary of a tag from a tag storm, providing counts and showing all the different values and the stanza indentation for each.

Installation

Clone

First you need to clone. This will create directory called ontogeny wherever you run this command:

$ git clone https://github.com/claymfischer/ontogeny.git

If you want to learn more about git and why it is useful when dealing with biological data, I highly recommend the book Bioinformatics Data Skills. It has a fantastic chapter on git and what you need to know, and explains it in a no-nonsense manner, assuming you have no background in bioinformatics. The entire book is an amazing resource well worth every penny.

bash startup file

Add the following to your .bashrc and edit the ONTOGENY_INSTALL_PATH:

# Ontogeny repository path:
ONTOGENY_INSTALL_PATH=/path/to/the/repository
source $ONTOGENY_INSTALL_PATH/lib/ontogeny_toolkit.sh

Protip: put this at the top of your .bashrc file. This way it won't override your own settings of the same variables. For instance, if you have a PS1 set in your .bashrc, it won't get overridden if this is sourced at the top.

make

Currently looking into enabling users to simply make install from the repository directory to copy executables to where they need to be.

Files

README.md

Latest commit

History

README.md

File metadata and controls

About the ontogeny toolkit

Contribute

Standalone software

Visualize patterns in files: highlight

Color-code columnar data: columns

Color-code sequence and quality score data: fastq

New ls and new list: newls and list

Quick inspection of file or directory: about

About directories

More informative directory contents: contents

Quickly transfer files to-and-from your server: transfer

Change your command prompt promptScript

Display 256-color palette library: palette

ontogeny_toolkit.sh - extension to your .bashrc

General utilities included

Temporarily disable text wrapping noWrap

Human-readable ls: l

Display at matches with context: showMatches

Display all content between two patterns: grabBetween

Display content between two lines: grabLines

Grab specific lines out of a gzipped fastq: checkFastq

Fix errors where last line of a file is a funky line break: fixLastLine.

Fix files that have funky line breaks: fixNewLines

Remove all blank lines: deleteBlankLines

Reduce multiple blank lines to one: reduceMultipleBlankLines

Reduce multiple blank spaces to one: reduceMultipleBlankSpaces

Make it clear when in a screen, and show screen sessions: screenHelp

grep cheatsheet: howToGrep

Make better tmp directories: mkdirRand, mkdirNow, MkdirTime, foo

Inspecting files and directories

Quickly view the beginning and end of a file: inspect

Quickly align delimited output: formatted

Align file content: align file.txt

Alternate row color, so data is easier to read: alternateRows

Alternate row output with color: colorRows

Organize content into gray blocks: blocks

Turn small delimited content into colorful, spreadsheet-like grids: grid

linesNotEmpty

linesContent

Test if your current directory is actively writing anything: writing

Check if a file has non-ascii characters: ascii and nonascii

Tab-separated data

Visually inspect for multiple spaces or tabs where they shouldn't be. cat file.txt | cleanUp

Decipher which column number has your data of interest: whichColumns

Decipher which column number has your data of interest: describeColumns

summarizeColumns

cutColumns

columnAverage

columnLengths

numColumns

maxColumns

minColumns

Want to extend the toolkit with new functionality?

Specific use software

List all empty tags: emptyTags

Remove all empty tags: removeEmptyTags

Make a list of tags in a Tag Storm: listAllTags

Show currently-valid tags from cdwCheckValidation.c: listValidTags

Test if a Tag Storm has any invalid tags: checkTagsValid

Convert to tag-style: convertMisceFields

Visually see how you Tag Storm maps to your manifest: mapToManifest

Visually see how your manifest files maps to your Tag Storm: mapToMeta

Check submission

Generate spreadsheet input

Generate a tag storm summary

Generate a tag summary

Installation

Visualize patterns in files: `highlight`

Color-code columnar data: `columns`

Color-code sequence and quality score data: `fastq`

New ls and new list: `newls` and `list`

Quick inspection of file or directory: `about`

More informative directory contents: `contents`

Quickly transfer files to-and-from your server: `transfer`

Change your command prompt `promptScript`

Display 256-color palette library: `palette`

Temporarily disable text wrapping `noWrap`

Human-readable ls: `l`

Display at matches with context: `showMatches`

Display all content between two patterns: `grabBetween`

Display content between two lines: `grabLines`

Grab specific lines out of a gzipped fastq: `checkFastq`

Fix errors where last line of a file is a funky line break: `fixLastLine`.

Fix files that have funky line breaks: `fixNewLines`

Remove all blank lines: `deleteBlankLines`

Reduce multiple blank lines to one: `reduceMultipleBlankLines`

Reduce multiple blank spaces to one: `reduceMultipleBlankSpaces`

Make it clear when in a screen, and show screen sessions: `screenHelp`

grep cheatsheet: `howToGrep`

Make better tmp directories: `mkdirRand`, `mkdirNow`, `MkdirTime`, `foo`

Quickly view the beginning and end of a file: `inspect`

Quickly align delimited output: `formatted`

Align file content: `align file.txt`

Alternate row color, so data is easier to read: `alternateRows`

Alternate row output with color: `colorRows`

Organize content into gray blocks: `blocks`

Turn small delimited content into colorful, spreadsheet-like grids: `grid`

`linesNotEmpty`

`linesContent`

Test if your current directory is actively writing anything: `writing`

Check if a file has non-ascii characters: `ascii` and `nonascii`

Visually inspect for multiple spaces or tabs where they shouldn't be. `cat file.txt | cleanUp`

Decipher which column number has your data of interest: `whichColumns`

Decipher which column number has your data of interest: `describeColumns`

`summarizeColumns`

`cutColumns`

`columnAverage`

`columnLengths`

`numColumns`

`maxColumns`

`minColumns`

List all empty tags: `emptyTags`

Remove all empty tags: `removeEmptyTags`

Make a list of tags in a Tag Storm: `listAllTags`

Show currently-valid tags from cdwCheckValidation.c: `listValidTags`

Test if a Tag Storm has any invalid tags: `checkTagsValid`

Convert to tag-style: `convertMisceFields`

Visually see how you Tag Storm maps to your manifest: `mapToManifest`

Visually see how your manifest files maps to your Tag Storm: `mapToMeta`