HMMER3Di - This is a fork of hmmer3.3.2 and easel 0.48 patched to support the Foldseek (3Di) alphabet
This program was used in the work:
Johnson, Sean R., et al. “Sensitive Remote Homology Search by Local Alignment of Small Positional Embeddings from Protein Language Models.” eLife, vol. 12, Feb. 2024. elifesciences.org, https://doi.org/10.7554/eLife.91415.2.
Note that this patched version of HMMER doesn't seem to perform any better on 3Di sequences than the original (amino acid tuned) version. I'm not sure exactly why.
The original hmmer and easel repositories are here
and here: github.
% git clone [email protected]:seanrjohnson/hmmer3di.git
% autoconf
% ./configure --prefix /your/install/path
% make
% source copy_executables.sh
% # copy executables will create a new directory called hmmer3Di, then it will copy
% # hmmalign, hmmbuild, hmmpress, hmmsearch, hmmscan, and phmmer into that directory
% # with 3Di_ added to the start of their names. From there you can execute them
% # or copy them into your $PATH
3Di_background_frequencies.txt
To generate a set of 3Di MSAs, we converted the AlphaFold UniProt Foldseek database (Jumper et al., 2021; van Kempen et al., 2023; Varadi et al., 2022) to a 3Di fasta file. We then looked up every sequence name from the Pfam 35 seed file in the UniProt 3Di fasta file and, for cases where the corresponding sequence was identifiable, extracted the sub-sequence corresponding to the Pfam 35 seed. 3Di seeds from each profile were aligned using MAFFT. MSA columns with more than 10 rows were used to calculate background frequencies and Dirichlet priors using the HMMER3 program esl-mixdchlet fit with options -s 17 9 20. pfam_35_3Di_msa_counts_lb_10.mixdchlet.txt
A full list of changes can be seen in the following diff: https://github.com/seanrjohnson/hmmer3di/compare/2637afc..87a5d15