Skip to content

DataBiosphere/data-explorer

Repository files navigation

Data explorer

CircleCI

Overview

Data Explorer lets you explore a dataset. The code (in this repo and data-explorer-indexers repo) is dataset-agnostic. All dataset configuration happens in config files.

Examples:

Quickstart

Run local Data Explorer with the 1000 Genomes dataset:

  • If ~/.config/gcloud/application_default_credentials.json doesn't exist, create it by running gcloud auth application-default login.
  • docker-compose up --build
  • Navigate to localhost:4400
  • If you want to use the Save in Terra feature, do this one-time setup.

Run local Data Explorer with a custom dataset

  • Index your dataset into Elasticsearch.
    Before you can run the servers in this repo to display a Data Explorer UI, your dataset must be indexed into Elasticsearch. Use an indexer from https://github.com/DataBiosphere/data-explorer-indexers.

  • Create dataset_config/<my dataset>

  • If you want to use the Save in Terra feature, do this one-time setup.

  • If ~/.config/gcloud/application_default_credentials.json doesn't exist, create it by running gcloud auth application-default login.

  • DATASET_CONFIG_DIR=dataset_config/<my dataset> docker-compose up --build -t 0

    • The -t 0 makes Kibana stop more quickly after Ctrl-C
    • If you get an error like ui_1 | Module not found: Can't resolve 'superagent' in '/ui/src/api/src', add -V: DATASET_CONFIG_DIR=dataset_config/<my dataset> docker-compose up --build -t 0 -V. -V is only needed for the next invocation of docker-compose, not all future invocations.
    • If ES crashes due to OOM, you can increase heap size:
      ES_JAVA_OPTS="-Xms10g -Xmx10g" docker-compose up --build -t 0
      
  • Navigate to localhost:4400

Architecture overview

The basic flow:

  1. Index dataset into Elasticsearch using an indexer from https://github.com/DataBiosphere/data-explorer-indexers
  2. Run the servers in this repo to display Data Explorer UI

GCP deployment:

GCP deployment architecture

For local development, an nginx reverse proxy is used to get around CORS:

Local deployment architecture

Want to try out Data Explorer for your dataset?

Here's one possible flow.

Sample file support

If your dataset includes sample files (VCF, BAM, etc), then Data Explorer will have:

  • A Samples Overview facet, which gives an overview of your sample files:

  • Sample file facets will display number of sample files instead of number of participants. For example, if your dataset has 100 participant and each participant has 5 files, and there is a facet for "Raw coverage", the number on the upper right of the facet can be 0-500, and represents how many sample files are in the current selection.

Time series support

If your dataset has longitudinal data, then Data Explorer will show time-series visualizations:

Development

Updating the API using swagger-codegen

We use swagger-codegen to automatically implement the API, as defined in api/api.yaml, for the API server and the UI. Whenever the API is updated, follow these steps to update the server implementations:

  • Clear out existing generated models:
    rm ui/src/api/src/model/*
    rm api/data_explorer/models/*
    
  • Regenerate Javascript and Python definitions.
    • From the .jar (Linux):
      java -jar ~/swagger-codegen-cli.jar generate -i api/api.yaml -l python-flask -o api -DsupportPython2=true,packageName=data_explorer
      java -jar ~/swagger-codegen-cli.jar generate -i api/api.yaml -l javascript -o ui/src/api -DuseES6=true
      yapf -ir . --exclude ui/node_modules --exclude api/.tox
      
    • From the global script (macOS or other):
      swagger-codegen generate -i api/api.yaml -l python-flask -o api -DsupportPython2=true,packageName=data_explorer
      swagger-codegen generate -i api/api.yaml -l javascript -o ui/src/api -DuseES6=true
      yapf -ir . --exclude ui/node_modules
      
  • Update API and UI servers.
  • Don't forget to fix JS warnings. (Otherwise CircleCI will fail.)

One-time setup

  • docker-compose should be at least 1.21.0. The data-explorer-indexer repo refers to the network created by docker-compose in this repo. Prior to 1.21.0, the network name was dataexplorer_default. Starting with 1.21.0, the network name is data-explorer_default.

  • Install swagger-codegen-cli.jar. This is only needed if you modify api.yaml

    # Linux
    wget https://repo1.maven.org/maven2/io/swagger/swagger-codegen-cli/2.3.1/swagger-codegen-cli-2.3.1.jar -O ~/swagger-codegen-cli.jar
    # macOS
    brew install swagger-codegen
    
  • In ui/ run npm install. This will install tools used during git precommit, such as formatting tools.

  • Set up git secrets.

One-time setup for Save in Terra feature

The Save in Terra feature temporarily stores data in a GCS bucket.

  • If you haven't already, fill out deploy.json for your dataset.
    • Even if you don't plan on deploying Data Explorer to GCP, deploy.json will still need to be filled out. A temporary file will be written to a GCS bucket in the project in deploy.json, even for local deployment of Data Explorer. Choose a project where you have at least Project Editor permissions.
  • Create export bucket. This only needs to be done once per deploy project. Run deploy/create-export-url-bucket.sh DATASET from the root of the repo, where DATASET is the name of the directory in dataset_config.
  • The Save in Terra feature requires a service account private key. Follow these instructions to download a key. This needs to be done once per person per deploy project. If three people run Data Explorer with the same deploy project, then all three need to download a key for the deploy project.
    • Go to the Service Accounts page for your deploy project.
    • Click on the three-dot Actions menu for the App Engine default service account -> Create Key -> CREATE.
    • Move the downloaded file to dataset_config/DATASET/private-key.json

Testing

Every commit on a remote branch kicks off all tests on CircleCI.

API server unit tests use pytest and tox. To run locally:

virtualenv ~/virtualenv/tox
source ~/virtualenv/tox/bin/activate
pip install tox
cd api && tox -e py35

End-to-end tests use Puppeteer and jest-puppeteer. To run locally:

# Optional: ensure the elasticsearch index is clean
docker-compose up --build -d elasticsearch
curl -XDELETE localhost:9200/_all
# Start the rest of the services
docker-compose up --build
cd ui && npm test

Troubleshooting tips for end-to-end tests:

Formatting

ui/ is formatted with Prettier. husky is used to automatically format files upon commit. To fix formatting, in ui/ run npm run fix.

Python files are formatted with YAPF.