Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Data Explorer #10

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
018c723
Add initial setup and test connection to ADX with dummy
DavidOno Feb 28, 2023
3a708df
Small clean up
DavidOno Feb 28, 2023
5e41436
Change ingestion from file to dataframe and start with init()
DavidOno Mar 1, 2023
b3954eb
Make Dockerfile and config runnable for adx
DavidOno Mar 1, 2023
576e970
Add remaining azure data explorer logic.
DavidOno Mar 8, 2023
3cca2dc
Improve error handling and add first main.tf
DavidOno Mar 14, 2023
684fed6
Improve terraform
DavidOno Mar 18, 2023
28ac758
Improve stream ingestion (still not working)
DavidOno Mar 18, 2023
ac13e7a
Improve stream ingestion (still not very efficient)
DavidOno Mar 19, 2023
ab6e368
Make a single event ingestable by stream
DavidOno Mar 21, 2023
586e34d
Fix ingestion via stream
DavidOno Mar 21, 2023
48e9982
Cleanup code
DavidOno Mar 21, 2023
f71c48f
Refactor code
DavidOno Mar 21, 2023
b03e703
Refactor code
DavidOno Mar 22, 2023
c0562d7
Clean up
DavidOno Mar 28, 2023
8ec6824
Final Clean up
DavidOno Mar 28, 2023
81d1404
Final Clean up, Part 2
DavidOno Mar 29, 2023
abecb5e
Final Clean up, Part 3
DavidOno Mar 29, 2023
8f26e9a
Improve readme
DavidOno Mar 30, 2023
1b05a32
Change back to mw-docker image
DavidOno Mar 30, 2023
3b7d1ff
Document results for storage optimized
DavidOno Apr 1, 2023
684a17a
Document results for compute optimized
DavidOno Apr 8, 2023
f420062
Fix command in README
DavidOno Apr 8, 2023
92c851a
Small refactoring
DavidOno Apr 8, 2023
cdb32e2
Small refactoring
DavidOno Apr 8, 2023
f589540
Improve README
DavidOno Apr 8, 2023
0488cdd
Improve README
DavidOno Apr 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 81 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ This readme contains three sections:
## Benchmark

Using the test setup from this repository we ran tests against PostgreSQL, CockroachDB, YugabyteDB, ArangoDB, Cassandra and InfluxDB with (aside from PostgreSQL and InfluxDB) 3 nodes and 16 parallel workers that insert data (more concurrent workers might speed up the process but we didn't test that as this test was only to establish a baseline). All tests were done on a k3s cluster consisting of 8 m5ad.2xlarge (8vCPU, 32GB memory) and 3 i3.8xlarge (32vCPU, 244GB memory) EC2 instances on AWS. The database pods were run on the i3 instances with a resource limit of 8 cores and 10 GB memory per node, the client pods of the benchmark on the m5ad instances. In the text below all mentions of nodes refer to database nodes (pods) and not VMs or kubernetes nodes.
For Azure Data Explorer the test setup was adapted to the following: AKS with Standard_D8_v5 on 3 Nodes.

All tests were run on an empty database.

Expand All @@ -31,23 +32,31 @@ For generating primary keys for the events we have two modes: Calculating it on
For the insert testcase we use single inserts (for PostgreSQL with autocommit) to simulate an ingest where each message needs to be persisted as soon as possible so no batching of messages is possible. Depending on the architecture and implementation buffering and batching of messages is possible. To see what effect this has on the performance we have implemented a batch mode for the test.
For ArangoDB batch mode is implemented using the document batch API (`/_api/document`), the older generic batch API (`/_api/batch`) will be deprecated and produces worse performance so we did not use it. For PostgreSQL we implemented batch mode by doing a manual commit every x inserts and using COPY instead of INSERT. Another way to implement batch inserts is to use values lists (one insert statement with a list of values tuples) but this is not as fast as COPY. This mode can however be activated by passing `--extra-option use_values_lists=true`.

Azure Data Explorer offers compute and storage optimized sku types, as well as batch and stream ingestion. For storage optimized it was decided to test against Standard_L8s_v2, for compute optimized against Standard_E8a_v4 and in both cases a sku capacity of 2 was used.

### Insert performance

The table below shows the best results for the databases for a 3 node cluster and a resource limit of 8 cores and 10 GB memory per node. The exceptions are PostgreSQL, InfluxDB and TimescaleDB which were launched as only a single instance. Influx provides a clustered variant only with their Enterprise product and for TimescaleDB there is no official and automated way to create a cluster with a distributed hypertable. All tests were run with the newest available version of the databases at the time of testing and using the opensource or free versions.
Inserts were done with 16 parallel workers, and each test was run 3 times with the result being the average of these runs. For each run the inserts per second was calculated as the number of inserts divided by the sumed up duration of the workers.

| Database (Version tested) | Inserts/s | Insert mode | Primary-key mode |
|---------------------------------------------|------------|--------------------------------------|------------------|
| PostgreSQL | 428000 | copy, size 1000 | sql |
| CockroachDB (22.1.3) | 91000 | values lists, size 1000 | db |
| YugabyteDB YSQL (2.15.0) | 295000 | copy, size 1000 | sql |
| YugabyteDB YCQL (2.15.0) | 288000 | batch, size 1000 | - |
| ArrangoDB | 137000 | batch, size 1000 | db |
| Cassandra sync inserts | 389000 | batch, size 1000, max_sync_calls 1 | - |
| Cassandra async inserts | 410000 | batch, size 1000, max_sync_calls 120 | - |
| InfluxDB | 460000 | batch, size 1000 | - |
| TimescaleDB | 600000 | copy, size 1000 | - |
| Elasticsearch | 170000 | batch, size 10000 | db |
| Database (Version tested) | Inserts/s | Insert mode | Primary-key mode |
|-----------------------------------------|-----------|--------------------------------------|------------------|
| PostgreSQL | 428000 | copy, size 1000 | sql |
| CockroachDB (22.1.3) | 91000 | values lists, size 1000 | db |
| YugabyteDB YSQL (2.15.0) | 295000 | copy, size 1000 | sql |
| YugabyteDB YCQL (2.15.0) | 288000 | batch, size 1000 | - |
| ArrangoDB | 137000 | batch, size 1000 | db |
| Cassandra sync inserts | 389000 | batch, size 1000, max_sync_calls 1 | - |
| Cassandra async inserts | 410000 | batch, size 1000, max_sync_calls 120 | - |
| InfluxDB | 460000 | batch, size 1000 | - |
| TimescaleDB | 600000 | copy, size 1000 | - |
| Elasticsearch | 170000 | batch, size 10000 | db |
| Azure Data Explorer (Storage optimized) | 36000<sup>*</sup> | batch, size 1000 | - |
| Azure Data Explorer (Storage optimized) | 30000<sup>*</sup> | stream, size 1000 | - |
| Azure Data Explorer (Compute optimized) | 38000<sup>*</sup> | batch, size 1000 | - |
| Azure Data Explorer (Compute optimized) | 53000<sup>*</sup> | stream, size 1000 | - |

<sup>*Inserts are not written into database immediately, but only queued</sup>

You can find additional results from older runs in [old-results.md](old-results.md) but be aware that comparing them with the current ones is not always possible due to different conditions during the runs.

Expand All @@ -64,16 +73,21 @@ Although the results of our benchmarks show a drastic improvement, batching in m

For TimescaleDB the insert performance depends a lot on the number and size of chunks that are written to. In a fill-level test with 50 million inserts per step where in each step the timestamps started again (so the same chunks were written to as in the last step) performance degraded rapidly. But in the more realistic case of ever increasing timestamps (so new chunks being added) performance stayed relatively constant.

Notable about Azure Data Explorer is the fact, that inserts are not immediately written to the database, instead their queued. The larger the batch/stream the longer it takes until the queue
starts to work it off. Once started it can keep the pace. For example a couple of rows take 20 - 30 seconds to appear in the database, 1000 rows 5 - 6 minutes, but 12.5 million rows require also only 5 - 6 minutes.

### Query performance

| Database / Query | count-events | temperature-min-max | temperature-stats | temperature-stats-per-device | newest-per-device |
|------------------|--------------|---------------------|-------------------|------------------------------|-------------------|
| PostgreSQL | 39 | 0.01 | 66 | 119 | 92 |
| CockroachDB | 123 | 153 | 153 | 153 | 150 |
| InfluxDB | 10 | 48 | 70 | 71 | 0.1 |
| TimescaleDB | 30 | 0.17 | 34 | 42 | 38 |
| Elasticsearch | 0.04 | 0.03 | 5.3 | 11 | 13 |
| Yugabyte (YSQL) | 160 | 0.03 | 220 | 1700 | failure |
| Database / Query | count-events | temperature-min-max | temperature-stats | temperature-stats-per-device | newest-per-device |
|-----------------------------------------|--------------|---------------------|-------------------|------------------------------|-------------------|
| PostgreSQL | 39 | 0.01 | 66 | 119 | 92 |
| CockroachDB | 123 | 153 | 153 | 153 | 150 |
| InfluxDB | 10 | 48 | 70 | 71 | 0.1 |
| TimescaleDB | 30 | 0.17 | 34 | 42 | 38 |
| Elasticsearch | 0.04 | 0.03 | 5.3 | 11 | 13 |
| Yugabyte (YSQL) | 160 | 0.03 | 220 | 1700 | failure |
| Azure Data Explorer (Storage optimized) | 0.33 | 0.76 | 1.0 | 3.2 | 8.6 |
| Azure Data Explorer (Compute optimized) | 0.29 | 0.55 | 0.69 | 2.2 | failure |

The table gives the average query duration in seconds

Expand Down Expand Up @@ -101,6 +115,10 @@ For TimescaleDB query performance is also very dependent on the number and size

Elasticsearch seems to cache query results, as such running the queries several times will yield millisecond response times for all queries. The times noted in the table above are against a freshly started elasticsearch cluster.

In case of the newest-per-device query (compute optimized) Azure Data Explorer did not succeed but terminated with "partition operator exceed amount of maximum partitions allowed (64)."
When a query is run against Azure Data Explorer, the query engine tries to optimize it by breaking it down into smaller, parallelizable tasks that can be executed across multiple partitions.
If the query requires more partitions than the maximum allowed limit, it will fail with the error message above.

For all databases there seems to be a rough linear correlation between query times and database size. So when running the tests with only 50 million rows the query times were about 10 times as fast.

## Running the tests
Expand Down Expand Up @@ -136,9 +154,9 @@ To run the test use `python run.py insert`. You can use the following options:
* `--workers`: Set of worker counts to try, default is `1,4,8,12,16` meaning the test will try with 1 concurrent worker, then with 4, then 8, then 12 and finally 16
* `--runs`: How often should the test be repeated for each worker count, default is `3`
* `--primary-key`: Defines how the primary key should be generated, see below for choices. Defaults to `db`
* `--tables`: To simulate how the databases behave if inserts are done to several tables this option can be changed from `single` to `multiple` to have the test write into four instead of just one table
* `--tables`: To simulate how the databases behave if inserts are done to several tables. This option can be changed from `single` to `multiple` to have the test write into four instead of just one table
* `--num-inserts`: The number of inserts each worker should do, by default 10000 to get a quick result. Increase this to see how the databases behave under constant load. Also increase the timout option accordingly
* `--timeout`: How long should the script wait for the insert test to complete in seconds. Default is `0`. Increase accordingly if you increase the number of inserts or disable by stting to `0`
* `--timeout`: How long should the script wait for the insert test to complete in seconds. Default is `0`. Increase accordingly if you increase the number of inserts or disable by setting to `0`
* `--batch`: Switch to batch mode (for postgres this means manual commits, for arangodb using the [batch api](https://docs.python-arango.com/en/main/batch.html)). Specify the number of inserts per transaction/batch
* `--extra-option`: Extra options to supply to the test scripts, can be used multiple times. Currently only used for ArangoDB (see below)
* `--clean` / `--no-clean`: By default the simulator will clean and recreate tables to always have the same basis for the runs. Can be disabled
Expand Down Expand Up @@ -312,6 +330,47 @@ kubectl apply -f dbinstall/elastic-deployment.yaml

```

### Azure Data Explorer

Azure Data Explorer (ADX) is a fully managed, high-performance, big data analytics platform. Azure Data Explorer can take all this varied data, and then ingest, process, and store it. You can use Azure Data Explorer for near real-time queries and advanced analytics.

To deploy an ADX-Cluster change the Service Principle (AAD) to your own Service Principle inside dbinstall/azure_data_explorer/main.tf:
```
data "azuread_service_principal" "service-principle" {
display_name = "mw_iot_ADX-DB-Comparison"
}
```

Additionally adjust the following fields depending on your performance needs:

```
resource "azurerm_kusto_cluster" "adxcompare" {
...

sku {
name = "Dev(No SLA)_Standard_E2a_v4"
capacity = 1
}
```

To apply the infrastructure, run:
```bash
az login
terraform apply
```

Finally, create the kubernetes secret:

````bash
kubectl create secret generic adx-secret --from-literal=adx_aad_app_id=<your aad app id> --from-literal=adx_app_key=<your app key> --from-literal=adx_authority_id=<your authority id> -n default
````
To enable streaming set the `batch` parameter of the config to `false`. This will not apply to existing tables.
Streaming ingestion on cluster level is enabled by default by terraform cluster config.


Besides having an emulator for ADX, capable of running locally, it is not recommended to use this emulator for any kind of benchmark tests, since the performance profile is very different. Furthermore, it is even prohibited by license terms.


## Remarks on the databases

This is a collection of problems / observations collected over the course of the tests.
Expand Down
5 changes: 5 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,9 @@ targets:
elasticsearch:
module: elasticsearch
endpoint: https://elastic-es-http.default.svc.cluster.local:9200
azure_data_explorer:
module: azure_data_explorer
kusto_uri: https://adxcompare.westeurope.kusto.windows.net
kusto_ingest_uri: https://ingest-adxcompare.westeurope.kusto.windows.net
kusto_db: SampleDB
namespace: default
73 changes: 73 additions & 0 deletions dbinstall/azure_data_explorer/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0.2"
}
}

required_version = ">= 1.1.0"
}

provider "azurerm" {
features {}
}


resource "azurerm_resource_group" "rg-compare" {
name = "db-performance-comparison"
location = "West Europe"

tags = {
"cost center" = "IOT",
environment = "iot-lab"
}
}

resource "azurerm_kusto_cluster" "adxcompare" {
name = "adxcompare"
location = azurerm_resource_group.rg-compare.location
resource_group_name = azurerm_resource_group.rg-compare.name
streaming_ingestion_enabled = true

# Change depending on your performance needs
sku {
name = "Dev(No SLA)_Standard_E2a_v4"
capacity = 1
}

tags = {
"cost center" = "IOT",
environment = "iot-lab"
}
}

resource "azurerm_kusto_database" "sample-db" {
name = "SampleDB"
resource_group_name = azurerm_resource_group.rg-compare.name
location = azurerm_resource_group.rg-compare.location
cluster_name = azurerm_kusto_cluster.adxcompare.name

hot_cache_period = "P1D"
soft_delete_period = "P1D"
}

data "azurerm_client_config" "current" {
}

# Change to your own service principal
data "azuread_service_principal" "service-principle" {
display_name = "mw_iot_ADX-DB-Comparison"
}

resource "azurerm_kusto_database_principal_assignment" "ad-permission" {
name = "AD-Permission"
resource_group_name = azurerm_resource_group.rg-compare.name
cluster_name = azurerm_kusto_cluster.adxcompare.name
database_name = azurerm_kusto_database.sample-db.name

tenant_id = data.azurerm_client_config.current.tenant_id
principal_id = data.azuread_service_principal.service-principle.id
principal_type = "App"
role = "Admin"
}
4 changes: 4 additions & 0 deletions deployment/templates/collector.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@ spec:
value: "{{ .Values.target_module }}"
- name: RUN_CONFIG
value: "{{ .Values.run_config }}"
envFrom:
- secretRef:
name: adx-secret

terminationGracePeriodSeconds: 2
nodeSelector:
{{ .Values.nodeSelector | toYaml | indent 8 }}
Expand Down
3 changes: 3 additions & 0 deletions deployment/templates/worker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.name
envFrom:
- secretRef:
name: adx-secret
restartPolicy: Never
terminationGracePeriodSeconds: 2
nodeSelector:
Expand Down
6 changes: 3 additions & 3 deletions simulator/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ FROM python:3.8-alpine as base

FROM base as builder

RUN mkdir /install
RUN apk update && apk add postgresql-dev gcc python3-dev musl-dev
RUN mkdir /install
RUN apk update && apk add postgresql-dev gcc python3-dev musl-dev libc-dev make git libffi-dev openssl-dev libxml2-dev libxslt-dev automake g++
WORKDIR /install
COPY requirements.txt /requirements.txt
RUN pip install --prefix=/install -r /requirements.txt
Expand All @@ -13,7 +13,7 @@ FROM base
COPY --from=builder /install /usr/local
COPY requirements.txt /
RUN pip install -r requirements.txt
RUN apk --no-cache add libpq
RUN apk --no-cache add libpq libstdc++
ADD . /simulator
WORKDIR /simulator
CMD ["python", "main.py"]
3 changes: 3 additions & 0 deletions simulator/modules/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,8 @@ def select_module():
elif mod == "elasticsearch":
from . import elasticsearch
return elasticsearch
elif mod == "azure_data_explorer":
from . import azure_data_explorer
return azure_data_explorer
else:
raise Exception(f"Unknown module: {mod}")
Loading