This project is a Python-based ETL (Extract, Transform, Load) tool designed to fetch data from a REDCap server and store it in a SQLite database. The ETL process involves the following steps:
- Extract: Data is extracted from the REDCap server using REDCap APIs. Or if the data is already extracted, it is read from the extracted files.
- Transform: The extracted data is then transformed into a format suitable for insertion into a SQLite database and stored as SQL files.
- Load: The transformed data (one SLQ file per patient) is loaded into the SQLite database.
This tool is particularly useful for users needing to transfer large amounts of data from REDCap to SQLite in a reliable and efficient manner. You can customize the data model and mapping tables to suit your specific requirements. The tool can be easily tested without having access to a REDCap project token by using the provided example data.
- Ensure Docker and Docker-compose are installed on your system. You can verify this by running
docker --version
anddocker-compose --version
in your terminal. - Clone this repository to your local machine.
- Navigate to the root directory of the cloned repository.
- Define the Data Model and Mapping to Target Data Model as described in the "Data Model Definition" and "Mapping to Target Data Model" sections below.
- Create a
config_docker.json
file in the root directory. You can use theconfig_docker_example.json
file as a template. Define the parameters as described in the "Config File Setup" section below. - Run the docker-compose file by running
docker-compose up
. - The ETL process will start automatically, and repeat every 6 hours.
- To stop the ETL process, run
docker-compose down
.
- Ensure Python 3.x is installed on your system. You can verify this by running
python --version
in your terminal. - Clone this repository to your local machine.
- Navigate to the root directory of the cloned repository.
- It is recommended to create a virtual environment to isolate the project dependencies. You can do this by running
python -m venv venv
. - Activate the virtual environment by running
source venv/bin/activate
(Linux/Mac) orvenv\Scripts\activate
(Windows). - Install the required Python packages by running
pip install -r requirements.txt
. - Define the Data Model and Mapping to Target Data Model as described in the "Data Model Definition" and "Mapping to Target Data Model" sections below.
- Create a
config.json
file in the root directory. You can use theconfig_example.json
file as a template. Define the parameters as described in the "Config File Setup" section below. - Run the ETL process by running
python workflow.py
.
- Define your data model in a separate SQL file. This should include the tables and fields that will be present in your SQLite database.
- Define the path to this file in the
config.json
file under thedb_schema
parameter.
- Write a mapping files in your mapping folder (defined in the
config.json
file) to map the extracted data to the target data model. These csv files will help to transform the extracted REDCap data into the format of your SQLite data model. - The mapping files should have the following columns:
Table
: The name of the table in the target data model.Attribute
: The name of the attribute in the target data model.NotNull
_: NOT NULL if the attribute is required, nothing otherwise.field_name
: The name of the field in the REDCap data. For further information, see theMapping.md
file and some example mapping files in themappingtables
folder of the ClassicDB example.
- Create a
config.json
file in the root directory. This file will store the configuration parameters for the ETL process. You can use theconfig_example.json
file as a template. - Define the following parameters in the
config.json
file:repository_root
: The path to the root directory of the cloned repository.extract_redcap
: True if data should be extracted from REDCap, False otherwise.redcap_api_address
: The URL of the REDCap API.redcap_project
: The name of the REDCap project.redcap_api_token
: The API token for accessing the REDCap project.extraction_path
: The path where the extracted data should be stored.data_path
: The path where the data files are stored.mapping_path
: The path to the mapping tables.db_creation
: True if the database should be created, False otherwise.db_wipe
: True if the database should be wiped before loading data, False otherwise.db_path
: The path where the SQLite database should be stored.db_schema
: The path to the data model (SQL schema) file.db_load_data
: True if data should be loaded into the database, False otherwise.
- The
ClassicDB_example
folder contains example data, a data model (see Figure) and their corresponding mapping tables (One possible mapping of Patient data) that can be used to test the ETL process without having access to a REDCap project. - The example data is stored in a CSV file and can be used to simulate the extraction process.
- The example data can be used to test the transformation and loading processes as well.
-
Copy and rename the 'config_docker_example.json' file to 'config_docker.json'.
cp config_docker_example.json config_docker.json
-
Run the docker-compose file.
docker-compose up -d
-
The ETL process will start automatically, and repeat every 6 hours.
-
(optional) To stop the ETL process, run the following command.
docker-compose down
-
Rename the 'config_example.json' file to 'config.json'.
mv config_example.json config.json
-
Run the ETL process.
python workflow.py
This project is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, under the conditions that you include the following:
- A copy of the original MIT License in any redistributed copies or substantial portions of the software.
- A clear acknowledgement of the original source of the software.
For more details, please see the LICENSE file in the project root.
Contributions, issues, and feature requests are welcome!
Enjoy transferring data from REDCap to SQLite efficiently and reliably!