Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Clarify README #49

Merged
merged 2 commits into from
Oct 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
.env
node_modules
.idea
54 changes: 27 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,47 @@ CHT Sync is a bundled solution consisting of [Logstash](https://www.elastic.co/l

**Note**: In order for `cht-sync` to run, it needs a link to [cht-pipeline](https://github.com/medic/cht-pipeline), which contains transformation models for DBT.

## Architecture

![Architecture Diagram](./architecture.png)

CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It combines several technologies to achieve this seamless synchronization and provide an efficient workflow for data processing and visualization.

At the core of the CHT Sync are Logstash, PostgREST, and DBT. Logstash plays a key role in the data synchronization process, facilitating the extraction of data from CouchDB and transferring it to PostgREST, ensuring real-time updates in PostgreSQL. PostgREST, on the other hand, acts as a RESTful API layer, enabling convenient interactions with PostgreSQL for data storage and retrieval.

Once the data is synchronized and stored in PostgreSQL, it undergoes transformation using predefined DBT models from the [cht-pipeline](https://github.com/medic/cht-pipeline). DBT plays a crucial role in preparing the data in a format that is optimized for querying and analysis, ensuring the data is readily available for analytics purposes.

CHT Sync also leverages Superset, an analytics and dashboarding platform, to provide intuitive visualizations and interactive analytics on the synchronized data stored in PostgreSQL. Superset empowers users to explore and gain valuable insights from the data, enabling informed decision-making and data-driven actions.

The overall architecture of CHT-sync is driven by the seamless integration of these technologies. CouchDB serves as the source database, containing the original data to be synchronized. Logstash, PostgREST, and DBT facilitate the data flow from CouchDB to PostgreSQL, transforming it into a queriable format. PostgreSQL acts as the centralized repository for the synchronized and transformed data, while Superset provides the interface for users to explore and visualize the analytics.

## Getting Started

CHT sync has been specifically designed to work in both local development environments for testing models or workflows, gamma environment, as well as in production environments. Each setup accommodates the needs of different stages or environment.
CHT Sync has been specifically designed to work in both local development environments for testing models or workflows, gamma environment, as well as in production environments. Each setup accommodates the needs of different stages or environments.

### Prerequisites

- `Docker`
- An `.env` file containing the environment variable placeholders from the `.env.template` file. The file should be located in the root directory of the project or set by the operating system. The variables should be customized accordingly for the specific deployment needs.

### Local Setup

The local environment setup involves starting Logstash, PostgreSQL, PostgREST, DBT, and CouchDB. This configuration facilitates data synchronization, transformation, and storage for local development and testing. Fake data is generated for CouchDB. The required environment variables can be found in the `env.template` file, which should be customized accordingly for the specific deployment needs.
The local environment setup involves starting Logstash, PostgreSQL, PostgREST, DBT, and CouchDB. This configuration facilitates data synchronization, transformation, and storage for local development and testing. Fake data is generated for CouchDB.

1. Provide the databases you want to sync in the `.env` file:

```
COUCHDB_DBS=<dbs-to-sync> # space separated list of databases you want to sync e.g "medic medic_sentinel"
```

2. Install the dependencies and run the Docker containers locally:

```sh
# starts: logstash, superset, postgres, postgrest, data-generator, couchdb and dbt
npm install
npm run local
```

#### Run end-to-end test locally
1. Update the following environment variables in your `.env` file:

Expand Down Expand Up @@ -63,7 +81,7 @@ SUPERSET_PASSWORD=<your-superset-password>
SUPERSET_ADMIN_EMAIL=<your-superset-emaild>
```

2. Install and run locally
2. Install the dependencies and run the Docker containers locally:

```sh
# starts: logstash, superset, postgres, postgrest, data-generator, couchdb and dbt
Expand All @@ -72,7 +90,7 @@ npm run local
```

3. Wait for every container to be up and running.
4. Run end-to-end test
4. Run the end-to-end tests:

```sh
# runs tests/e2e-test.spec.ts
Expand All @@ -81,7 +99,7 @@ npm test

### Gamma Setup

The gamma environment setup involves starting Logstash, PostgreSQL, PostgREST, and DBT. This configuration facilitates data synchronization, transformation, and storage for medic gamma hosting. The required environment variables can be found in the `env.template` file, which should be customized accordingly for the specific deployment needs.
The gamma environment setup involves starting Logstash, PostgreSQL, PostgREST, and DBT. This configuration facilitates data synchronization, transformation, and storage for medic gamma hosting.

1. Update the following environment variables in your `.env` file:

Expand All @@ -96,7 +114,7 @@ COUCHDB_PASSWORD=<your-couchdb-password>
COUCHDB_SECURE=false
```

1. Start docker containers
1. Install the dependencies and start the Docker containers:
```sh
# starts: logstash, superset, postgres, postgrest, and dbt
npm install
Expand All @@ -105,7 +123,7 @@ npm run gamma

### Production Setup

The production environment setup involves starting Logstash, PostgREST, and DBT. This configuration facilitates data synchronization, transformation, and storage for CHT production hosting. The required environment variables can be found in the `env.template` file, which should be customized accordingly for the specific deployment needs.
The production environment setup involves starting Logstash, PostgREST, and DBT. This configuration facilitates data synchronization, transformation, and storage for CHT production hosting.

1. Update the following environment variables in your `.env` file:

Expand Down Expand Up @@ -135,32 +153,14 @@ COUCHDB_PORT=<your-couchdb-port>
COUCHDB_SECURE=false
```

1. (Optional) Start local version of PostgreSQL
1. (Optional) Start local version of PostgreSQL:
```
docker-compose -f docker-compose.postgres.yml -f docker-compose.yml up postgres
```

1. Start docker containers
1. Install the dependencies and start the Docker containers:
```sh
# starts: logstash, superset, postgrest and dbt
npm install
npm run prod
```

## Environment Variables

The required environment variables and their corresponding environments can be found in the `env.template` file. These variables must be present either in a `.env` file located in the root directory of the project or set by the operating system.

## Architecture

![Architecture Diagram](./architecture.png)

CHT-sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It combines several technologies to achieve this seamless synchronization and provide an efficient workflow for data processing and visualization.

At the core of the CHT-sync toolkit are Logstash, PostgREST, and DBT. Logstash plays a key role in the data synchronization process, facilitating the extraction of data from CouchDB and transferring it to PostgREST, ensuring real-time updates in PostgreSQL. PostgREST, on the other hand, acts as a RESTful API layer, enabling convenient interactions with PostgreSQL for data storage and retrieval.

Once the data is synchronized and stored in PostgreSQL, it undergoes transformation using predefined DBT models from the [cht-pipeline](https://github.com/medic/cht-pipeline). DBT plays a crucial role in preparing the data in a format that is optimized for querying and analysis, ensuring the data is readily available for analytics purposes.

CHT-sync also leverages Superset, an analytics and dashboarding platform, to provide intuitive visualizations and interactive analytics on the synchronized data stored in PostgreSQL. Superset empowers users to explore and gain valuable insights from the data, enabling informed decision-making and data-driven actions.

The overall architecture of CHT-sync is driven by the seamless integration of these technologies. CouchDB serves as the source database, containing the original data to be synchronized. Logstash, PostgREST, and DBT facilitate the data flow from CouchDB to PostgreSQL, transforming it into a queriable format. PostgreSQL acts as the centralized repository for the synchronized and transformed data, while Superset provides the interface for users to explore and visualize the analytics.
2 changes: 1 addition & 1 deletion env.template
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# project wide: optional
COMPOSE_PROJECT_NAME=pipeline

# postgrest and pogresql: required environment variables for 'gamma', prod and 'local'
# postgrest and postgresql: required environment variables for 'gamma', prod and 'local'
POSTGRES_USER=root
POSTGRES_PASSWORD=supercoolpassword
POSTGRES_DB=data
Expand Down
Loading