BIOINFORMATICS IN ONE LINE

In biology, there are many ways to solve a task. However, biologists sometimes need simple ways to get easy tasks done quick - without actually compromising efficiency and accuracy. If that is what you need, the following list might help you. Below I have gathered some commands which I did wrote or find useful to solve different bioinformatics problems. Be aware that some of these scripts are perhaps a bit too specific, so bear that in mind when using them for your own work.

Fasta / FASTQ

Convert multiple-line Fasta To single-line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < INPUT.fasta | tail -n +2 > OUTPUT.fasta

Count bases (A, C, G, T) and missing (N, ?) per sample in single-line Fasta (slow for long files!)

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c | paste - - - - | tr "\n" "\t" ;  echo $line | grep -v '>' | grep -o "[?N]" | sort | uniq -c | sort -k2r | paste - - ; echo $line | grep '>' | tr "\n" "\t" ; done < INPUT.fasta

Count total number of bases in Fasta (across all sequences)

grep -v ">" INPUT.fasta | wc | awk '{print $3-$1}'

Count number of sequences in FASTQ.gz file

parallel “echo {} && gunzip -c {} | wc -l | awk ‘{d=\$1; print d/4;}’” ::: INPUT.gz

Using a list of identifiers (one sequence name per line), extract sequences from Fasta/FASTQ files (needs seqtk)

seqtk subseq INPUT.fq name.list > out.fq

Parse a Fasta (useful to modify identifiers)

awk '/^>/{printf $0"\n",++i; next}{print}' INPUT.fasta

Sequence length from Fasta

cat INPUT.fasta|awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,200) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Alignments

For a particular sample in a phylip alignment, count occurrences of a single character (eg. "A")

grep "SAMPLE" ALIGNMENT.phylip | awk '{print $2}' | awk -F"A" '{print NF-1}'

Phylogenetic trees

Fast way to display a tree in newick format (needs 'ete3')

ete3 view --text -t TREE.nw

Compare topology of a list of trees vs reference tree (needs 'ete3')

ete3 compare --src_tree_list TREES.list -r REFERENCE.nw

Reroot newick tree (needs 'newick_utils')

nw_reroot -l INPUT.tree OUTGROUP

SNPs

Count heterozygous SNPs in a beagle file

echo $(awk ‘{if ($3 != $4) print $3, $4 }’ INPUT.bgl | wc -l )/$total*100 | bc -l

Others

General use:

Compare two unsorted lists

comm -13 <(sort file1) <(sort file2)

Delete all files with size 0 in current directory

find . -type f -empty -delete

Add extension (e.g. ".tre") to multiple files

find . -type f -print0 | xargs -0 -I{} mv "{}" "{}".tre

Remove extension (e.g. ".txt") for multiple files

find -type f -name '*.txt' | while read f; do mv "$f" "${f%.txt}"; done

Fast way (multi-core) to compress or decompress big files (using pigz)

tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

pigz -dc target.tar.gz | tar xf -

Sum a column of numbers

<cmd> | paste -sd+ | bc

Specific use:

Count the number of a specific character (e.g. "NA") in each line (prints also the 1st word, delimited by space)

paste <(while read LINE ; do echo -n "$LINE" | awk -F" " '{print $1}' ; done < INPUT.file) <(awk -F\NA '{print NF-1}' INPUT.file)

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIOINFORMATICS IN ONE LINE

Fasta / FASTQ

Alignments

Phylogenetic trees

SNPs

Others

General use:

Specific use:

About

Releases

Packages

biomendi/BIOINFORMATICS-IN-ONE-LINE

Folders and files

Latest commit

History

Repository files navigation

BIOINFORMATICS IN ONE LINE

Fasta / FASTQ

Alignments

Phylogenetic trees

SNPs

Others

General use:

Specific use:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages