Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README & add utility shell script #151

Merged
merged 3 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 126 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,21 @@

---


## Introduction to COOL

![ COOL](./assets/img/arch2.png)

Different groups of people often have different behaviors or trends. For example, the bones of older people are more porous than those of younger people. It is of great value to explore the behaviors and trends of different groups of people, especially in healthcare, because we could adopt appropriate measures in time to avoid tragedy. The easiest way to do this is **cohort analysis**.

But with a variety of big data accumulated over the years, **query efficiency** becomes one of the problems that OnLine Analytical Processing (OLAP) systems meet, especially for cohort analysis. Therefore, COOL is introduced to solve the problems.
However, with a variety of big data accumulated over the years, **query efficiency** becomes one of the problems that OnLine Analytical Processing (OLAP) systems meet, especially for cohort analysis. Therefore, COOL is introduced to solve the problems.

COOL is an online cohort analytical processing system that supports various types of data analytics, including cube query, iceberg query and cohort query.

With the support of several newly proposed operators on top of a sophisticated storage layer, COOL could provide high performance (near real-time) analytical response for emerging data warehouse domains.

With the support of several newly proposed operators on top of a sophisticated storage layer, COOL could provide high-performance (near real-time) analytical responses for emerging data warehouse domains.

## Key features of COOL

1. **Easy to use.** COOL is easy to deploy on local or on cloud via docker.
1. **Easy to use.** COOL is easy to deploy locally or on the cloud via Docker.
2. **Near Real-time Responses.** COOL is highly efficient, and therefore, can process cohort queries in near real-time analytical responses.
3. **Specialized Storage Layout.** A specialized storage layout is designed for fast query processing and reduced space consumption.
4. **Self-designed Semantics.** There are some novel self-designed semantics for the cohort query, which can simplify its complexity and improve its functionality.
Expand All @@ -31,167 +29,225 @@ With the support of several newly proposed operators on top of a sophisticated s

## Quickstart

### BUILD
### Build package

Simply run `mvn clean package`
```bash
mvn clean package
```

### Required sources:
### Required sources

1. **dataset file**: a csv file with "," delimiter (normally dumped from a database table), and the table header is removed.
3. **dataset schema file**: a `table.yaml` file specifying the dataset's columns and their measure fields.
4. **query file**: a yaml file specify the parameters for running query server.
1. **dataset file**: a CSV file with "," delimiter (normally dumped from a database table) and the table header removed.
2. **dataset schema file**: a `table.yaml` file specifying the dataset's columns and their measure fields.
3. **query file**: a YAML file specifying the parameters for the running query server.

### Load dataset

Before query processing, we need to load the dataset into COOL native format. The sample code to load csv dataset with data loader can be found in [CsvLoader.java](cool-core/src/main/java/com/nus/cool/functionality/CsvLoader.java).

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader path/to/your/source/directory path/to/your/.yaml path/to/your/datafile path/to/output/datasource/directory
./cool load \
dataset \
path/to/your/.yaml \
path/to/your/datafile \
path/to/output/datasource/directory
```

The five arguments in the command have the following meaning:
1. a unique dataset name given under the directory
2. the table.yaml (the third required source)

1. the dataset name
2. the `table.yaml` (the third required source)
3. the dataset file (the first required source)
4. the output directory for the compacted dataset


### Execute queries

We provide an example for cohort query processing in [CohortAnalysis.java](cool-core/src/main/java/com/nus/cool/functionality/CohortAnalysis.java).

There are two types of queries in COOL. The first one includes two steps.

#### Cohort Query
<!-- There are two types of queries in COOL. The first one includes two steps. -->

- Select the specific users.
#### Cohort Selection

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortSelection path/to/output/datasource/directory path/to/your/queryfile
./cool cohortselection \
path/to/output/datasource/directory \
path/to/your/queryfile
```

- Executes cohort query users.
#### Cohort Query

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis path/to/output/datasource/directory path/to/your/cohortqueryfile
./cool cohortquery \
path/to/output/datasource/directory \
path/to/your/cohortqueryfile
```

- Executes the funnel query.
#### Funnel Query

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.FunnelAnalysis path/to/output/datasource/directory path/to/your/funnelqueryfile
./cool funnelquery \
path/to/output/datasource/directory \
path/to/your/funnelqueryfile
```

#### OLAP Query

- Executes the following query in cool.

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.IcebergLoader path/to/output/datasource/directory path/to/your/queryfile
./cool olapquery \
path/to/output/datasource/directory \
path/to/your/queryfile
```

### Example-Cohort Analysis
## Example: Cohort Analysis

#### Load dataset
### Load dataset

We have provided examples in `sogamo` directory and `health` directory. Now we take `sogamo` for example.
We have provided examples in `sogamo` directory and `health_raw` directory. Now we take `sogamo` for example.

The COOL system supports CSV data format by default, and you can load `sogamo` dataset with the following command.

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader sogamo sogamo/table.yaml sogamo/data.csv CubeRepo
./cool load \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.csv \
./CubeRepo
```

In addition, you can run the following command to load dataset in other formats under the `sogamo` directory.
<!-- disabled as currently not working -->
<!--
In addition, you can run the following command to load the dataset in other formats under the `sogamo` directory.
- parquet format data
```bash
$ java -jar cool-extensions/parquet-extensions/target/parquet-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/data.parquet CubeRepo
java -jar cool-extensions/parquet-extensions/target/parquet-extensions-0.1-SNAPSHOT.jar \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.parquet \
./CubeRepo
```
- Arrow format data
```bash
$ java -jar cool-extensions/arrow-extensions/target/arrow-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/data.arrow CubeRepo
java -jar cool-extensions/arrow-extensions/target/arrow-extensions-0.1-SNAPSHOT.jar \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/data.arrow \
./CubeRepo
```
- Avro format data
```bash
$ java -jar cool-extensions/avro-extensions/target/avro-extensions-0.1-SNAPSHOT.jar sogamo sogamo/table.yaml sogamo/avro/test.avro CubeRepo sogamo/avro/schema.avsc
java -jar cool-extensions/avro-extensions/target/avro-extensions-0.1-SNAPSHOT.jar \
sogamo \
datasets/sogamo/table.yaml \
datasets/sogamo/avro/test.avro \
./CubeRepo \
datasets/sogamo/avro/schema.avsc
```
-->

Finally, there will be a cube generated under the `CubeRepo` directory, which is named `sogamo`.
There will be a cube generated under the `./CubeRepo` directory, which is named `sogamo`.

#### Execute queries
Similarly, load the `health_raw` dataset with:

We use the `health` dataset for example to demonstrate the cohort ananlysis.
```bash
./cool load \
health_raw \
datasets/health_raw/table.yaml \
datasets/health_raw/data.csv \
./CubeRepo
```

- Select the specific users.
### Execute queries

We use the `health_raw` dataset for example to demonstrate the cohort analysis.

#### Select the specific users

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortSelection CubeRepo health/query1-0.json
./cool cohortselection \
./CubeRepo \
datasets/health_raw/sample_query_selection/query.json
```

where the three arguments are as follows:
1. `CubeRepo`: the output directory for the compacted dataset
2. `health`: the cube name of the compacted dataset
3. `health/query1-0.json`: the json file for the cohort query
where the arguments are:

- Display the selected all records of the cohort in terminal for exploration
```
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortExploration CubeRepo health loyal
```
1. `./CubeRepo`: the output directory for the compacted dataset
2. `datasets/health_raw/sample_query_selection/query.json`: the cohort query (in JSON)

- Execute cohort query on the selected users.
<!--
- Display all selected records of the cohort in the terminal for exploration
```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis CubeRepo health/query1-1.json
java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar \
com.nus.cool.functionality.CohortExploration \
./CubeRepo \
health_raw \
sample_query_selection
```
-->

- Execute cohort query on all the users.
#### Execute cohort query

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CohortAnalysis CubeRepo health/query2.json
./cool cohortquery \
./CubeRepo \
datasets/health_raw/sample_query_average/query.json
```

Partial results for the query `health/query2.json` on the `health` dataset are as at [result2.json](health/result2.json)
#### Funnel Analysis

We use the `sogamo` dataset for example to demonstrate the funnel analysis.

```bash
$ java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.FunnelAnalysis CubeRepo sogamo/query1.json
./cool funnelquery \
./CubeRepo \
datasets/sogamo/sample_funnel_analysis/query.json
```

### Example-OLAP Analysis
## Example: OLAP Analysis

#### Load dataset
### Load dataset

We have provided examples in `olap-tpch` directory.

The COOL system supports CSV data format by default, and you can load `tpc-h` dataset with the following command.

```bash
java -cp ./cool-core/target/cool-core-0.1-SNAPSHOT.jar com.nus.cool.functionality.CsvLoader tpc-h-10g olap-tpch/table.yaml olap-tpch/scripts/data.csv CubeRepo
./cool load \
tpc-h-10g \
datasets/olap-tpch/table.yaml \
datasets/olap-tpch/scripts/data.csv \
./CubeRepo
```

Finally, there will be a cube generated under the `CubeRepo` directory, which is named `tpc-h-10g`.
Finally, there will be a cube generated under the `./CubeRepo` directory, which is named `tpc-h-10g`.

### Execute queries

#### Execute queries
Run Server

1. put the `application.property file at the same level as the .jar file.
2. edit the server configuration in the `application.property file.
3. run the below commond line.
```
java -jar cool-queryserver/target/cool-queryserver-0.1-SNAPSHOT.jar
1. put the `application.property` file at the same level as the .jar file.
2. edit the server configuration in the `application.property` file.
3. run the below command.

```bash
./cool server
```

## CONNECT TO EXTERNAL STORAGE SERVICES
COOL has an [StorageService](cool-core/src/main/java/com/nus/cool/storageservice/StorageService.java) interface, which will allow COOL standalone server/workers (coming soon) to handle data movement between local and an external storage service. A sample implementation for HDFS connection can be found under the [hdfs-extensions](cool-extensions/hdfs-extensions/).

COOL has an [StorageService](cool-core/src/main/java/com/nus/cool/storageservice/StorageService.java) interface, which will allow COOL standalone server/workers (coming soon) to handle data movement between local and an external storage service. A sample implementation for HDFS connection can be found under the [hdfs-extensions](cool-extensions/hdfs-extensions/).

## Publication
* Q. Cai, K. Zheng, H.V. Jagadish, B.C. Ooi, J.W.L. Yip. CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics, in Proceedings of the VLDB Endowment, 10(17), 2024.
* Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. [Cool: a COhort OnLine analytical processing system](https://www.comp.nus.edu.sg/~ooibc/icde20cool.pdf), in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.
* Q. Cai, Z. Xie, M. Zhang, G. Chen, H.V. Jagadish and B.C. Ooi. [Effective Temporal Dependence Discovery in Time Series Data](http://www.comp.nus.edu.sg/~ooibc/cohana18.pdf), in Proceedings of the VLDB Endowment, 11(8), pp.893-905, 2018.
* Z. Xie, Q. Cai, F. He, G.Y. Ooi, W. Huang, B.C. Ooi. [Cohort Analysis with Ease](https://dl.acm.org/doi/10.1145/3183713.3193540), in Proceedings of the 2018 International Conference on Management of Data, pp.1737-1740, 2018.
* D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. [Cohort Query Processing](http://www.vldb.org/pvldb/vol10/p1-ooi.pdf), in Proceedings of the VLDB Endowment, 10(1), 2016.

- Q. Cai, K. Zheng, H.V. Jagadish, B.C. Ooi, J.W.L. Yip. CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics, in Proceedings of the VLDB Endowment, 10(17), 2024.

- Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. [Cool: a COhort OnLine analytical processing system](https://www.comp.nus.edu.sg/~ooibc/icde20cool.pdf), in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.
- Q. Cai, Z. Xie, M. Zhang, G. Chen, H.V. Jagadish and B.C. Ooi. [Effective Temporal Dependence Discovery in Time Series Data](http://www.comp.nus.edu.sg/~ooibc/cohana18.pdf), in Proceedings of the VLDB Endowment, 11(8), pp.893-905, 2018.
- Z. Xie, Q. Cai, F. He, G.Y. Ooi, W. Huang, B.C. Ooi. [Cohort Analysis with Ease](https://dl.acm.org/doi/10.1145/3183713.3193540), in Proceedings of the 2018 International Conference on Management of Data, pp.1737-1740, 2018.
- D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. [Cohort Query Processing](http://www.vldb.org/pvldb/vol10/p1-ooi.pdf), in Proceedings of the VLDB Endowment, 10(1), 2016.
65 changes: 65 additions & 0 deletions cool
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env bash

COOL_CORE_PATH="${COOL_CORE_JAR_PATH:-./cool-core/target/cool-core-0.1-SNAPSHOT.jar}"
COOL_QUERY_SERVER_PATH="${COOL_QUERY_SERVER_PATH:-./cool-queryserver/target/cool-queryserver-0.1-SNAPSHOT.jar}"


main_help() {
COOL_HELP_LEFT_ALIGN="%-17s"

echo "Usage: $0 <command> [<args>]"
echo
echo Commands
printf " $COOL_HELP_LEFT_ALIGN %s\n" "help" "Show this help menu"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "load" "Load dataset"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "cohortselection" "Perform cohort selection"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "cohortquery" "Perform cohort query"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "funnelquery" "Perform funnel query"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "olapquery" "Perform OLAP query"
printf " $COOL_HELP_LEFT_ALIGN %s\n" "server" "Start query server"

exit ${1:-0}
}

main_load() {
java -cp $COOL_CORE_PATH com.nus.cool.functionality.CsvLoader "$@"
}

main_cohortselection() {
java -cp $COOL_CORE_PATH com.nus.cool.functionality.CohortSelection "$@"
}

main_cohortquery() {
java -cp $COOL_CORE_PATH com.nus.cool.functionality.CohortAnalysis "$@"
}

main_funnelquery() {
java -cp $COOL_CORE_PATH com.nus.cool.functionality.FunnelAnalysis "$@"
}

main_olapquery() {
java -cp $COOL_CORE_PATH com.nus.cool.functionality.IcebergLoader "$@"
}

main_server() {
java -jar $COOL_QUERY_SERVER_PATH
}

function main() {
if (($# == 0)); then
main_help 0
fi

case ${1} in
help | load | cohortselection | cohortquery | funnelquery | olapquery | server)
"main_$1" "${@:2}"
;;
*)
echo "unknown command: $1"
main_help 1
exit 1
;;
esac
}

main "$@"
Loading