-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
35 changed files
with
48 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,11 @@ This repository contains code and instructions to reproduce the results presente | |
|
||
## Requirements | ||
- [geniml](https://github.com/databio/geniml) | ||
- beautifulsoup4 | ||
- python=3.9 | ||
- pybedtools | ||
- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html) | ||
|
||
``` | ||
git clone [email protected]:databio/geniml.git | ||
cd geniml | ||
|
@@ -25,7 +29,7 @@ EVAL_RESULTS_FOLDER: folder that stores all the evaluation results | |
### Download the dataset | ||
Run the following command: | ||
```bash | ||
python download_dataset.py | ||
python -m src.download_dataset | ||
``` | ||
Or download all the [content](http://big.databio.org/region2vec_eval/tfbs_dataset/) to `DATA_FOLDER`. | ||
### Prepare universes | ||
|
@@ -34,17 +38,17 @@ We provided all the seven universes used in the paper at [hg19 universes](http:/ | |
We used the following code to generate the universes except the DHS universe, which is an external universe. You can use the same code to generate the universes based on your data, only to change `DATA_FOLDER` in `config.py` and the total number of files passed to `-n`. | ||
```bash | ||
# The Merge (100) universe | ||
python gen_universe.py -m merge -n 690 -d 100 | ||
python -m src.gen_universe -m merge -n 690 -d 100 | ||
# The Merge (1k) universe | ||
python gen_universe.py -m merge -n 690 -d 1000 | ||
python -m src.gen_universe -m merge -n 690 -d 1000 | ||
# The Merge (10k) universe | ||
python gen_universe.py -m merge -n 690 -d 10000 | ||
python -m src.gen_universe -m merge -n 690 -d 10000 | ||
# The Tiling (1k) universe | ||
python gen_universe.py -m tile -v hg19 -n 690 -t 1000 | ||
python -m src.gen_universe -m tile -v hg19 -n 690 -t 1000 | ||
# The Tiling (5k) universe | ||
python gen_universe.py -m tile -v hg19 -n 690 -t 5000 | ||
python -m src.gen_universe -m tile -v hg19 -n 690 -t 5000 | ||
# The Tiling (25k) universe | ||
python gen_universe.py -m tile -v hg19 -n 690 -t 25000 | ||
python -m src.gen_universe -m tile -v hg19 -n 690 -t 25000 | ||
``` | ||
### Train embedding models | ||
You can download all the trained models to `MODELS_FOLDER` (in `config.py`) at [models](http://big.databio.org/region2vec_eval/tfbs_models/). Note that `Large`, `Medium` and `Small` correspond to `Merge (100)`, `Merge (1k)` and `Merge (10k)`, respectively, in the paper. | ||
|
@@ -53,7 +57,7 @@ We used the following steps to get all the models. | |
|
||
1. Generate training scripts via | ||
```bash | ||
python gen_train_scripts.py | ||
python -m src.gen_train_scripts | ||
``` | ||
2. Then, go to the `TRAIN_SCRIPTS_FOLDER` (specified in `config.py`) folder, and run all the scripts there to get trained models. | ||
|
||
|
@@ -62,22 +66,22 @@ We used the following steps to get all the models. | |
|
||
3. After training Region2Vec models, run the following code to generate base embeddings, namely Binary, PCA-10D, and PCA-100D, for each of the seven universes. | ||
```bash | ||
python get_base_embeddings.py | ||
python -m src.get_base_embeddings | ||
``` | ||
|
||
To obtain the results in Table S2, run the following code | ||
```bash | ||
python assess_universe.py | ||
python -m src.assess_universe | ||
``` | ||
Note that we do not assess the original universes. Since Region2Vec will filter out some low-frequency regions in a universe based on the training data, we focused on the acutal universes with regions that have embeddings. | ||
|
||
## Evaluate region embeddings | ||
Run the following scripts to obtain the evaluation results. | ||
```bash | ||
python eval_script.py --type GDSS | ||
python eval_script.py --type NPS | ||
python eval_script.py --type CTS | ||
python eval_script.py --type RCS | ||
python -m src.eval_script --type GDSS | ||
python -m src.eval_script --type NPS | ||
python -m src.eval_script --type CTS | ||
python -m src.eval_script --type RCS | ||
``` | ||
|
||
To speed up the process, you can split the universes into batches (Line 209, `eval_script.py`) | ||
|
@@ -90,7 +94,7 @@ batches = [ | |
``` | ||
Then, run the evaluation on each batch in parallel. For example, | ||
```bash | ||
python eval_script.py --type GDSS --batch 0 | ||
python -m src.eval_script --type GDSS --batch 0 | ||
``` | ||
will evaluate models for the Tiling (1k) and Tiling (25k) universes. | ||
|
||
|
@@ -99,7 +103,7 @@ We designed cell type and antibody type classification tasks for the trained reg | |
|
||
Run the classification using the following script: | ||
```bash | ||
python classification.py | ||
python -m src.classification | ||
``` | ||
|
||
## Analyze results | ||
|
@@ -110,5 +114,5 @@ The visualizations of different sets of region embeddings can be found at [embed | |
|
||
We used the following command to generate UMAP visualizations of all sets of region embeddings. | ||
```bash | ||
python visualization.py | ||
python -m src.visualization | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,16 @@ | ||
DATA_URL = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/" | ||
|
||
|
||
TRAIN_SCRIPTS_FOLDER = '/root_path/region2vec_eval/train_scripts/' | ||
DATA_FOLDER = '/root_path/region2vec_eval/tfbs_datasets/' | ||
MODELS_FOLDER = '/root_path/region2vec_eval/tfbs_region2vec_models/' | ||
UNIVERSES_FOLDER = '/root_path/region2vec_eval/hg19_universes/' | ||
EVAL_RESULTS_FOLDER = '/root_path/region2vec_eval/eval_results/' | ||
# TRAIN_SCRIPTS_FOLDER = '/root_path/region2vec_eval/train_scripts/' | ||
# DATA_FOLDER = '/root_path/region2vec_eval/tfbs_datasets/' | ||
# MODELS_FOLDER = '/root_path/region2vec_eval/tfbs_region2vec_models/' | ||
# UNIVERSES_FOLDER = '/root_path/region2vec_eval/hg19_universes/' | ||
# EVAL_RESULTS_FOLDER = '/root_path/region2vec_eval/eval_results/' | ||
|
||
|
||
|
||
TRAIN_SCRIPTS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/train_scripts/' | ||
DATA_FOLDER = '/bigtemp/gz5hp/region2vec_eval/tfbs_datasets/' | ||
MODELS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/tfbs_region2vec_models/' | ||
UNIVERSES_FOLDER = '/bigtemp/gz5hp/region2vec_eval/hg19_universes/' | ||
EVAL_RESULTS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/eval_results/' |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.