Skip to content

Build a DuckDB database for publicly available cancer data

Notifications You must be signed in to change notification settings

t-silvers/public-cancer-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
t-silvers
Mar 15, 2024
003571c · Mar 15, 2024

History

21 Commits
Mar 15, 2024
Mar 3, 2024
Feb 11, 2024
Feb 17, 2024
Mar 15, 2024
Mar 3, 2024

Repository files navigation

Create a Peristent DuckDB Database for Public Cancer Data

Description

Build a DuckDB database for performant querying of publicly available cancer data. Code is in a functional state for use in personal projects. While it works as intended, it includes several quick and dirty implementations.

Data Sources

Consult links for details on data collection, processing, and guidelines on usage.

Data Models

Raw data is prepared using "models" saved as .sql files in ./models, which (occasionally) follow best practices outlined here.

Prerequisites

  • aria2 (optional) [docs (fallback: wget)]
  • DuckDB [docs]
  • make
  • wget

Usage

make -C /path/to/public_cancer_db

With custom configuration:

make -C /path/to/public_cancer_db DIR="/path/to/large_data_storage" MEMORY_LIMIT=32GB NCORES=16 DOWNLOADER=aria2

Adding GDC data to an existing database:

make gdc -C /path/to/public_cancer_db DB="/path/to/data.db"

Note on Concurrency

Quoting from the DuckDB docs on concurrency,

DuckDB has two configurable options for concurrency:

  1. One process can both read and write to the database.
  2. Multiple processes can read from the database, but no processes can write (access_mode = 'READ_ONLY'). When using option 1, DuckDB supports multiple writer threads ...

To benefit from make parallelism, the database can be built in two steps using phony targets,

echo "Fetching data ..."
make fetch -C /path/to/public_cancer_db -j 8

echo "Building database ..."
make ingest -C /path/to/public_cancer_db -j 1

where NCORES, etc. can be configured separately to best utilize available resources for multi-threaded database writes.

About

Build a DuckDB database for publicly available cancer data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published