DenseHMM is a modification of Hidden Markov Models (HMMs) that allows to learn dense vector representations of both the hidden states and the observables via gradient-descent methods. The code accompanies our paper "DenseHMM: Learning Hidden Markov Models by Learning Dense Representations" and allows to reproduce the results therein.
- DenseHMM uses a parameter-efficient, non-linear matrix factorization to describe transition probabilities of HMMs.
- Two approaches for model training: a) EM-optimization with a gradient-based M-step or b) direct optimization of observation co-occurrences, which provides better scalability compared to EM-based multi-step schemes.
- Competitive model performance in extensive empirical evaluations.
DenseHMM is compared to various hidden Markov models. We base our code on the hmmlearn library.
We used a conda environment on Linux Debian Version 9. Use the provided dense_hmm.yml
to create this environment as follows:
conda env create --name dense_hmm --file=dense_hmm.yml
We use the natural language toolkit library (nltk python module) to download the Penn Treebank dataset.
Version of the module: nltk=3.4.5 (as specified in dense_hmm.yml
).
We obtained the sequences in April 2020 using (as in data.py
):
from nltk.corpus import treebank
sequences = treebank.tagged_sents()
nltk.download('treebank')
We downloaded the RCSB PDB protein sequences in October 2019 from https://www.rcsb.org/#Subcategory-download_sequences. We used the gzipped FASTA file containing all PDB sequences. Once downloaded, put the pdb_seqres.txt.gz
file in the data directory.
- The following Jupyter notebook contains the source for running the experiments of section 4:
start_matrix_fit_experiment.ipynb
. Just run all cells of the notebook. This will create a new directory in the same folder as the notebook, in which the results are stored.
- The following files contain the source for running the experiments of section 5:
data.py
(data pre-processing),experiment.py
(parses experiment parameters, starts experiments),models.py
(standard HMM and DenseHMM models),utils.py
(various utility functions used throughout the package),hmmc/_hmmc.c
(from hmmlearn, function for the E-step was modified to log additional data),start_protein_experiment.ipynb
,start_synthetic_experiment.ipynb
,start_penntree_experiment.ipynb
.
- In the Jupyter notebooks listed above, please set the ROOT_PATH variable to the directory containing the source files (ROOT_PATH must end on /).
- During training, log-likelihood scores, model parameters and sequence samples are stored in a new directory that is created in ROOT_PATH. These values are stored in a dictionary that is subsequently used for evaluation and to create visualizations.
- Run the Jupyter notebooks to start the respective model training.
- The following Jupyter notebook contains the source for evaluating the experiments of section 4:
evaluate_matrix_fit_experiment.ipynb
. Fill in the exp_dir path in the notebook and run all cells.
-
The following files contain the source for evaluating the experiments of section 5:
utils.py
plot.py
evaluate.ipynb
-
Fill in the paths in
evaluate.ipynb
and run the cells to evaluate and plot the results. -
Due to random train-test splits and random initializations, the obtained results might slightly deviate from those reported in the paper.
All experiments were conducted on a Intel(R) Xeon(R) Silver 4116
CPU @ 2.10GHz and a NVidia Tesla V100
GPU.
Using the training parameters specified in the Jupyter notebooks, we observed the following approximate runtimes:
Matrix fit experiment usually takes less than 34 h.
Penn Treebank training:
- Training a standard HMM usually takes less than 4 min.
- Training a DenseHMM in cooc mode usually takes less than 2 min.
- Training a DenseHMM in EM mode usually takes less than 6 min.
- A single experiment usually takes less than 16 min.
- Whole experiment run (100 experiments) usually takes less than 27 h.
Protein training:
- Fitting a DenseHMM model in EM mode usually takes less than 12 min.
- Fitting a dense cooc model in cooc mode usually takes less than 1 min.
- Fitting standard HMM models usually takes less than 8 min.
- A single experiment run usually takes less than 30 min.
- Whole experiment run (100 experiments) usually takes less than 48 h.
Synthetic training:
- Fitting the standard HMM models usually takes less than 20 s.
- Fitting the DenseHMM models usually takes less than 40 s.
- A single experiment usually takes less than 2 min.
- Whole experiment run (100 experiments) usually takes less than 4 h.
DenseHMM is released under the MIT license.
If you use or reference DenseHMM in your research, please use the following BibTeX entry.
@article{densehmm,
author = {Joachim Sicking and Maximilian Pintz and Maram Akila and Tim Wirtz},
title = {DenseHMM: Learning Hidden Markov Models by Learning Dense Representations},
journal = {NeurIPS 2020 Workshop on Learning Meaningful Representations of Life (LMRL)}
year = {2020}
}