Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
gtzheng committed Jun 14, 2024
1 parent 5ce4ada commit 3296ebb
Show file tree
Hide file tree
Showing 35 changed files with 48 additions and 81 deletions.
38 changes: 21 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@ This repository contains code and instructions to reproduce the results presente

## Requirements
- [geniml](https://github.com/databio/geniml)
- beautifulsoup4
- python=3.9
- pybedtools
- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)

```
git clone [email protected]:databio/geniml.git
cd geniml
Expand All @@ -25,7 +29,7 @@ EVAL_RESULTS_FOLDER: folder that stores all the evaluation results
### Download the dataset
Run the following command:
```bash
python download_dataset.py
python -m src.download_dataset
```
Or download all the [content](http://big.databio.org/region2vec_eval/tfbs_dataset/) to `DATA_FOLDER`.
### Prepare universes
Expand All @@ -34,17 +38,17 @@ We provided all the seven universes used in the paper at [hg19 universes](http:/
We used the following code to generate the universes except the DHS universe, which is an external universe. You can use the same code to generate the universes based on your data, only to change `DATA_FOLDER` in `config.py` and the total number of files passed to `-n`.
```bash
# The Merge (100) universe
python gen_universe.py -m merge -n 690 -d 100
python -m src.gen_universe -m merge -n 690 -d 100
# The Merge (1k) universe
python gen_universe.py -m merge -n 690 -d 1000
python -m src.gen_universe -m merge -n 690 -d 1000
# The Merge (10k) universe
python gen_universe.py -m merge -n 690 -d 10000
python -m src.gen_universe -m merge -n 690 -d 10000
# The Tiling (1k) universe
python gen_universe.py -m tile -v hg19 -n 690 -t 1000
python -m src.gen_universe -m tile -v hg19 -n 690 -t 1000
# The Tiling (5k) universe
python gen_universe.py -m tile -v hg19 -n 690 -t 5000
python -m src.gen_universe -m tile -v hg19 -n 690 -t 5000
# The Tiling (25k) universe
python gen_universe.py -m tile -v hg19 -n 690 -t 25000
python -m src.gen_universe -m tile -v hg19 -n 690 -t 25000
```
### Train embedding models
You can download all the trained models to `MODELS_FOLDER` (in `config.py`) at [models](http://big.databio.org/region2vec_eval/tfbs_models/). Note that `Large`, `Medium` and `Small` correspond to `Merge (100)`, `Merge (1k)` and `Merge (10k)`, respectively, in the paper.
Expand All @@ -53,7 +57,7 @@ We used the following steps to get all the models.

1. Generate training scripts via
```bash
python gen_train_scripts.py
python -m src.gen_train_scripts
```
2. Then, go to the `TRAIN_SCRIPTS_FOLDER` (specified in `config.py`) folder, and run all the scripts there to get trained models.

Expand All @@ -62,22 +66,22 @@ We used the following steps to get all the models.

3. After training Region2Vec models, run the following code to generate base embeddings, namely Binary, PCA-10D, and PCA-100D, for each of the seven universes.
```bash
python get_base_embeddings.py
python -m src.get_base_embeddings
```

To obtain the results in Table S2, run the following code
```bash
python assess_universe.py
python -m src.assess_universe
```
Note that we do not assess the original universes. Since Region2Vec will filter out some low-frequency regions in a universe based on the training data, we focused on the acutal universes with regions that have embeddings.

## Evaluate region embeddings
Run the following scripts to obtain the evaluation results.
```bash
python eval_script.py --type GDSS
python eval_script.py --type NPS
python eval_script.py --type CTS
python eval_script.py --type RCS
python -m src.eval_script --type GDSS
python -m src.eval_script --type NPS
python -m src.eval_script --type CTS
python -m src.eval_script --type RCS
```

To speed up the process, you can split the universes into batches (Line 209, `eval_script.py`)
Expand All @@ -90,7 +94,7 @@ batches = [
```
Then, run the evaluation on each batch in parallel. For example,
```bash
python eval_script.py --type GDSS --batch 0
python -m src.eval_script --type GDSS --batch 0
```
will evaluate models for the Tiling (1k) and Tiling (25k) universes.

Expand All @@ -99,7 +103,7 @@ We designed cell type and antibody type classification tasks for the trained reg

Run the classification using the following script:
```bash
python classification.py
python -m src.classification
```

## Analyze results
Expand All @@ -110,5 +114,5 @@ The visualizations of different sets of region embeddings can be found at [embed

We used the following command to generate UMAP visualizations of all sets of region embeddings.
```bash
python visualization.py
python -m src.visualization
```
18 changes: 13 additions & 5 deletions config.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
DATA_URL = "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/"


TRAIN_SCRIPTS_FOLDER = '/root_path/region2vec_eval/train_scripts/'
DATA_FOLDER = '/root_path/region2vec_eval/tfbs_datasets/'
MODELS_FOLDER = '/root_path/region2vec_eval/tfbs_region2vec_models/'
UNIVERSES_FOLDER = '/root_path/region2vec_eval/hg19_universes/'
EVAL_RESULTS_FOLDER = '/root_path/region2vec_eval/eval_results/'
# TRAIN_SCRIPTS_FOLDER = '/root_path/region2vec_eval/train_scripts/'
# DATA_FOLDER = '/root_path/region2vec_eval/tfbs_datasets/'
# MODELS_FOLDER = '/root_path/region2vec_eval/tfbs_region2vec_models/'
# UNIVERSES_FOLDER = '/root_path/region2vec_eval/hg19_universes/'
# EVAL_RESULTS_FOLDER = '/root_path/region2vec_eval/eval_results/'



TRAIN_SCRIPTS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/train_scripts/'
DATA_FOLDER = '/bigtemp/gz5hp/region2vec_eval/tfbs_datasets/'
MODELS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/tfbs_region2vec_models/'
UNIVERSES_FOLDER = '/bigtemp/gz5hp/region2vec_eval/hg19_universes/'
EVAL_RESULTS_FOLDER = '/bigtemp/gz5hp/region2vec_eval/eval_results/'
File renamed without changes.
File renamed without changes.
54 changes: 9 additions & 45 deletions result_analysis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -23,7 +23,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -81,7 +81,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -104,7 +104,7 @@
"outputs": [],
"source": [
"def get_cts_results(universe):\n",
" paths = glob.glob(f\"{EVAL_RESULTS_FOLDER}/cts_results_uniform/{universe}/cts_eval_seed*\")\n",
" paths = glob.glob(f\"{EVAL_RESULTS_FOLDER}/cts_results/{universe}/cts_eval_seed*\")\n",
" cts_scores = {}\n",
" for path in paths:\n",
" with open(path, \"rb\") as f:\n",
Expand Down Expand Up @@ -246,51 +246,15 @@
" fig.savefig(f\"{EVAL_RESULTS_FOLDER}/scores_visualization/rcs_results/rcs_{universe}.svg\", bbox_inches=\"tight\")\n",
" plt.close()\n",
" # plt.show()\n",
"# for u in universes:\n",
"# get_rcs_plot(u, True)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"eval_rcs_pd = []\n",
"eval_rcs_pd_10d = []\n",
"for u in universes:\n",
" rcs_scores, rcs_stds, names = get_rcs_results(u)\n",
" eval_rcs_pd.append(pd.DataFrame({'universe':[u]*len(names), 'RCS':rcs_scores, 'model':names})) \n",
" \n",
" paths = glob.glob(f\"{EVAL_RESULTS_FOLDER}/rcs_10d_results/{u}/rcs_eval_seed*\")\n",
" rcs_scores, rcs_stds, names = get_rcs_results(u, paths)\n",
" eval_rcs_pd_10d.append(pd.DataFrame({'universe':[u]*len(names), 'RCS_10D':rcs_scores, 'model':names})) \n",
"eval_rcs_pd = pd.concat(eval_rcs_pd)\n",
"eval_rcs_pd_10d = pd.concat(eval_rcs_pd_10d)\n",
"eval_rcs_pd = eval_rcs_pd.merge(eval_rcs_pd_10d, how='left', on=['universe', 'model'])"
" get_rcs_plot(u, True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"corrs = []\n",
"for u in universes:\n",
" condition = (eval_rcs_pd['universe']==u)\n",
" rcss = eval_rcs_pd[condition]['RCS'].to_numpy()\n",
" rcss_10d = eval_rcs_pd[condition]['RCS_10D'].to_numpy()\n",
" corr, _ = scipy.stats.pearsonr(rcss, rcss_10d)\n",
" corrs.append((u,corr))\n",
"corrs"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def get_gdss_results(result_path):\n",
" universe = result_path.split('/')[-2]\n",
Expand Down Expand Up @@ -356,14 +320,14 @@
" fig.savefig(file_name, bbox_inches=\"tight\")\n",
" plt.show()\n",
"\n",
"gdss_result_batch = [f\"{EVAL_RESULTS_FOLDER}/gdss_results/{}/gdss_eval_seed*\".format(u) for u in universes]\n",
"gdss_result_batch = [f\"{EVAL_RESULTS_FOLDER}/gdss_results/{u}/gdss_eval_seed*\" for u in universes]\n",
"for path in gdss_result_batch:\n",
" produce_gdss_plots(path, f\"{EVAL_RESULTS_FOLDER}/scores_visualization/gdss_results/\", True)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -433,7 +397,7 @@
" fig.clf()\n",
" plt.close()\n",
" \n",
"nps_result_batch = [f\"{EVAL_RESULTS_FOLDER}/nps_results/{}/nps_eval_seed*\".format(u) for u in universes]\n",
"nps_result_batch = [f\"{EVAL_RESULTS_FOLDER}/nps_results/{u}/nps_eval_seed*\" for u in universes]\n",
"for path in nps_result_batch:\n",
" produce_nps_plots(path, f\"{EVAL_RESULTS_FOLDER}/scores_visualization/nps_results/\", 9)"
]
Expand Down Expand Up @@ -603,7 +567,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand Down
File renamed without changes.
File renamed without changes.
11 changes: 1 addition & 10 deletions download_dataset.py → src/download_dataset.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,14 @@
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import urllib
import re
import gensim, logging
from tqdm import tqdm
import hashlib
import os
import gzip, bz2
import gzip
import shutil
import pickle
import time
import matplotlib.pyplot as plt
import logging
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from config import DATA_FOLDER, DATA_URL
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


class TBFS_Downloader:
Expand Down
File renamed without changes.
3 changes: 1 addition & 2 deletions gen_train_scripts.py → src/gen_train_scripts.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,9 @@ def combine_all(list_of_lists):
file_ptrs[universes[i]] = open(name, 'w')
params = combine_all([universes, init_lrs, embed_dims, win_sizes])
save_folder = MODELS_FOLDER
program_path = glob.glob("train_region2vec.py")[0]
for param in params:
universe, init_lr, embed_dim, win_size = param
script = f'python {program_path} --embed_dim {embed_dim} --init_lr {init_lr} --win_size {win_size} --universe {universe} --num_shufflings 100 --save_folder {save_folder}'
script = f'python -m src.train_region2vec --embed_dim {embed_dim} --init_lr {init_lr} --win_size {win_size} --universe {universe} --num_shufflings 100 --save_folder {save_folder}'
file_ptrs[universe].write(script)
file_ptrs[universe].write('\n')
for u in file_ptrs:
Expand Down
5 changes: 3 additions & 2 deletions gen_universe.py → src/gen_universe.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def bedtools_merge(base, df, merge_dist):


def save_segmentation(name, segmentation, numfiles, overlap, folder='../segmentations/'):
fp = "{}_{}overlap_{}files_tfbs_universe.txt".format(name, overlap, numfiles)
fp = "{}_{}overlap_{}files_tfbs_universe.bed".format(name, overlap, numfiles)
path = os.path.join(folder, fp)
print("saving segmentation to ", path)
segmentation.to_csv(path, sep='\t', index=False, header=False)
Expand Down Expand Up @@ -117,6 +117,7 @@ def segmentation_tile(tile_size, chrom_sizes, assembly, save_folder):
creates a segmentation from tiling the genome and saves it to the folder
returns: None
'''
os.makedirs(save_folder, exist_ok=True)
save_path = os.path.join(save_folder, f"tiles{tile_size}.{assembly}.bed")
count = 0
with open(save_path, "w") as fout:
Expand Down Expand Up @@ -159,7 +160,7 @@ def segmentation_tile(tile_size, chrom_sizes, assembly, save_folder):

elif args.m == "tile":
if args.v == "hg19":
chrom_sizes = "./hg19_chrom.sizes"
chrom_sizes = "data/hg19_chrom.sizes"
segmentation_tile(args.t, chrom_sizes, "hg19", UNIVERSES_FOLDER)
else:
raise ValueError(f"not implemented for args.v={args.v}")
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit 3296ebb

Please sign in to comment.