This repo hosts the code for the egress pipeline used by the Digital Health Data Repository (Sage Bionetworks) for data enrichment and summarization for i2b2.
- R >= 4.0.0
- Docker
- Synapse account with relevant access permissions
- Synapse authentication token
A Synapse authentication token is required for use of the Synapse APIs (e.g. the synapser
package for R). For help with Synapse, Synapse APIs, Synapse authentication tokens, etc., please refer to the Synapse documentation.
There are two methods to run this pipeline:
Regardless of which method you use, you need to set your Synapse Personal Access Token somewhere in your environment. See the examples below
- Option 1: For only the current shell session:
export SYNAPSE_AUTH_TOKEN=<your-token>
- Option 2: For all future shell sessions (modify your shell profile)
# Open the profile file
nano ~/.bash_profile
# Append the following
SYNAPSE_AUTH_TOKEN=<your-token>
export SYNAPSE_AUTH_TOKEN
# Save the file
source ~/.bash_profile
For the Docker method, there is a pre-published docker image available here.
The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:
- Create a computing environment with the dependencies needed by the machine running the pipeline
- Install the packages needed in order to run the pipeline
- Run a script containing the instructions for the pipeline
If you do not want to use the pre-built Docker image, skip to the next section (Build the Docker image yourself)
- Pull the docker image
docker pull ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main
- Run the docker container
docker run \
--name container-name \
-e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main
For an explanation of the various config parameters used in the pipeline, please see Config Parameters.
- Clone this repo
git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
- Build the docker image
# Option 1: From the directory containing the Dockerfile
cd /path/to/Dockerfile/
docker build <optional-arguments> -t image-name .
# Option 2: From anywhere
docker build <optional-arguments> -t image-name -f /path/to/Dockerfile/ .
- Run the docker container
docker run \
--name container-name \
-e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
image-name
For an explanation of the various config parameters used in the pipeline, please see Config Parameters.
If you would like to run the pipeline manually, please follow the instructions in this section.
- Clone this repo
git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
-
Modify the parameters in the config as needed
-
Run run-pipeline.R
This table contains all of the parameters needed to run the pipeline, along with their definitions and examples.
Parameter | Definition | Example |
---|---|---|
ontologyFileID |
A Synapse ID for the the i2b2 concepts map ontology file stored in Synapse. | syn12345678 |
parquetDirID |
A Synapse ID for a folder entity in Synapse where the input data is stored. This should be the folder housing the post-ETL parquet data. | syn12345678 |
concept_replacements |
A named vector of strings and their replacements. The names must be valid values of the concept_filter_col column of the concept_map data frame. For RECOVER, concept_map is the ontology file data frame. |
R Example c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath') |
synFolderID |
A Synapse ID for a folder entity in Synapse where you want to store the final output files. | syn12345678 |
s3bucket |
The name of the S3 bucket containing input data | recover-bucket |
s3basekey |
The base key of the S3 bucket containing input data. | main/archive/2024-.../ |
selectedVarsFileID |
A Synapse ID for the CSV file listing which datasets and variables have been selected for use in this pipeline | syn12345678 |
outputConceptsDir |
The location to save intermediate and final i2b2 summary files to | ./output-concepts |