Skip to content

beyond-all-reason/data-processing

Repository files navigation

Data processing

Repository for ETL workflow that processes BAR data.

It periodically produces public dumps of the matches data combining information from teiserver and replays database. Check out Gallery section to see how community uses this data.

Data access

Data dumps are available as Parquet under:

and Compressed CSV file under:

More documentation is available at https://beyond-all-reason.github.io/data-processing/.

Usage examples

It's easy to load data into Jupyter Notebook or Google Colab, for example: plot the number of matches over time using Polars.

Given that datasets are available under URL, you can even use one of the Web UIs built on DuckDB-Wasm to run query entirely in the browser, for example: compute number of games per type per month

Gallery

Below we want to link some cool examples of how people in the community are using the data dumps. If you've created something please share with us on Discord or here in issues!

Development

This project is using dbt for managing the SQL pipeline that transforms data and DuckDB as the query engine.

Initial

Setup:

python3 -m venv .venv
source .venv/bin/activate  # but I also recommend https://direnv.net/ that will load .envrc automatically
pip install -r requirements.txt

It's also recommented to install pre commit hooks that will check style of SQL code before making a commit

pre-commit install

Usage

data_source/dev contains a small sample of the full data sources used to genrate full dumps in prod, basic development and testing should be possible purely on this sample.

To build the data marts from this sample data:

dbt run

To run tests on the generated data (e.g. validate that fields are not null, or custom queries return expected results):

dbt test

About

SQL pipelines for processing BAR data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages