WARNING
cbiohub
is a Python package that provides convenience functions for analyzing
data files from cBioPortal. Although several Python
API clients exist, they work on slices of the cBioPortal data retrieved via the
REST rather than that they enable easy analysis of all the data files in bulk.
This package aims to provide a more user-friendly interface for accessing data
from cBioPortal like those stored in the public
datahub. By using parquet files, rather
than flat csv/tsv files, the data can be analyzed much more quickly and
efficiently.
You can e.g. download the cBioPortal datahub files:
git clone [email protected]:cbioportal/datahub ~/git/datahub
Now ingest them i.e. convert them into parquet files on your local machine:
cbiohub ingest ~/git/datahub/public/
All the data by default gets stored in ~/cbiohub/
. Combine all the study data together into a single study:
cbiohub combine
Now you can use the cbiohub
package to analyze the data quickly. For example,
you can load the combined study data into a pandas DataFrame:
import cbiohub
df = cbiohub.get_combined_df()
Or you can use the cbiohub cli to do quick analyses:
> cbiohub find BRAF V600E
✅ Variant found in 3595 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...
or search for the same BRAF V600E variant but with a specific genomic change (A>T):
> cbiohub find 7 140453136 140453136 A T
✅ Variant found in 3571 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...
Remove all local parquet files.
cbiohub clean
To set up the development environment, install the development dependencies:
poetry install
You can run the cli using e.g.:
poetry ingest ~/git/datahub/public/
and
poetry run cbiohub find BRAF V600E
You can also use IPython for interactive exploration:
poetry run ipython
- Add github action datahub that usies cbiohub to push combined parquet data to hugging face (https://huggingface.co/datasets/cBioPortal/datahub)