GitHub - Sage-Bionetworks/recover-pipeline-i2b2: Egress pipeline for data enrichment and i2b2 summarization for NIH RECOVER

recover-pipeline-i2b2

This repo hosts the code for the egress pipeline used by the Digital Health Data Repository (Sage Bionetworks) for data enrichment and summarization for i2b2.

Requirements

R >= 4.0.0
Docker
Synapse account with relevant access permissions
Synapse authentication token

A Synapse authentication token is required for use of the Synapse APIs (e.g. the synapser package for R). For help with Synapse, Synapse APIs, Synapse authentication tokens, etc., please refer to the Synapse documentation.

Usage

There are two methods to run this pipeline:

Docker container, or
Manual Job

Set Synapse Personal Access Token

Regardless of which method you use, you need to set your Synapse Personal Access Token somewhere in your environment. See the examples below

Option 1: For only the current shell session:

export SYNAPSE_AUTH_TOKEN=<your-token>

Option 2: For all future shell sessions (modify your shell profile)

# Open the profile file
nano ~/.bash_profile

# Append the following
SYNAPSE_AUTH_TOKEN=<your-token>
export SYNAPSE_AUTH_TOKEN

# Save the file
source ~/.bash_profile

Docker Container

For the Docker method, there is a pre-published docker image available here.

The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:

Create a computing environment with the dependencies needed by the machine running the pipeline
Install the packages needed in order to run the pipeline
Run a script containing the instructions for the pipeline

If you do not want to use the pre-built Docker image, skip to the next section (Build the Docker image yourself)

Use the pre-built Docker image

Pull the docker image

docker pull ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main

Run the docker container

docker run \
  --name container-name \
  -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
  ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main

For an explanation of the various config parameters used in the pipeline, please see Config Parameters.

Build the Docker image yourself

Clone this repo

git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git

Build the docker image

# Option 1: From the directory containing the Dockerfile
cd /path/to/Dockerfile/
docker build <optional-arguments> -t image-name .

# Option 2: From anywhere
docker build <optional-arguments> -t image-name -f /path/to/Dockerfile/ .

Run the docker container

docker run \
  --name container-name \
  -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
  image-name

For an explanation of the various config parameters used in the pipeline, please see Config Parameters.

Manual Job

If you would like to run the pipeline manually, please follow the instructions in this section.

Clone this repo

git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git

Modify the parameters in the config as needed
Run run-pipeline.R

Config Parameters

This table contains all of the parameters needed to run the pipeline, along with their definitions and examples.

Parameter	Definition	Example
`ontologyFileID`	A Synapse ID for the the i2b2 concepts map ontology file stored in Synapse.	syn12345678
`parquetDirID`	A Synapse ID for a folder entity in Synapse where the input data is stored. This should be the folder housing the post-ETL parquet data.	syn12345678
`concept_replacements`	A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame.	R Example c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath')
`synFolderID`	A Synapse ID for a folder entity in Synapse where you want to store the final output files.	syn12345678
`s3bucket`	The name of the S3 bucket containing input data	recover-bucket
`s3basekey`	The base key of the S3 bucket containing input data.	main/archive/2024-.../
`selectedVarsFileID`	A Synapse ID for the CSV file listing which datasets and variables have been selected for use in this pipeline	syn12345678
`outputConceptsDir`	The location to save intermediate and final i2b2 summary files to	./output-concepts

Name		Name	Last commit message	Last commit date
Latest commit History 544 Commits
.github/workflows		.github/workflows
config		config
pipeline		pipeline
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

recover-pipeline-i2b2

Requirements

Usage

Set Synapse Personal Access Token

Docker Container

Use the pre-built Docker image

Build the Docker image yourself

Manual Job

Config Parameters

About

Releases

Packages

Contributors 2

Languages

Sage-Bionetworks/recover-pipeline-i2b2

Folders and files

Latest commit

History

Repository files navigation

recover-pipeline-i2b2

Requirements

Usage

Set Synapse Personal Access Token

Docker Container

Use the pre-built Docker image

Build the Docker image yourself

Manual Job

Config Parameters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages