diff --git a/docs/backlog.md b/docs/backlog.md index d094264..0aa8f7b 100644 --- a/docs/backlog.md +++ b/docs/backlog.md @@ -4,6 +4,11 @@ - [o] Documentation: Convert documents to Markdown, and publish to RTD - [o] Improve OCI image building, using modern recipe - [o] Bring version numbers up to speed (in docs, for OCI images) +- [o] Query Timer's documentation says MongoDB adapter needs a patch!? + +## Iteration +1.5 +- [o] Fix typo `stdev`? +- [o] Active voice ## Iteration +2 - [o] Verify functionality on all cloud offerings diff --git a/docs/data-generator.md b/docs/data-generator.md index ecc6fb8..98256a7 100644 --- a/docs/data-generator.md +++ b/docs/data-generator.md @@ -1,152 +1,77 @@ # Data Generator -The Data Generator evolved from a solution for a use case to a standalone tool. This file describes how the -Data Generator can be setup and the functionality as well as explain different example use cases. - -## Table of Contents - -- [Data Generator](#data-generator) - * [General Information](#general-information) - + [About](#about) - + [How To](#how-to) - - [Pip install](#pip-install) - - [Docker Image](#docker-image) - + [Supported Databases](#supported-databases) - - [CrateDB](#cratedb) - - [InfluxDB](#influxdb) - - [TimescaleDB](#timescaledb) - - [MongoDB](#mongodb) - - [PostgreSQL](#postgresql) - - [AWS Timestream](#aws-timestream) - - [Microsoft SQL Server](#microsoft-sql-server) - * [Data Generator Configuration](#data-generator-configuration) - + [Environment variables configuring the behaviour of the Data Generator](#environment-variables-configuring-the-behaviour-of-the-data-generator) - - [CONCURRENCY](#concurrency) - - [ID_START](#id_start) - - [ID_END](#id_end) - - [INGEST_MODE](#ingest_mode) - * [INGEST_MODE False](#ingest_mode-false) - * [INGEST_MODE True](#ingest_mode-true) - - [INGEST_SIZE](#ingest_size) - - [TIMESTAMP_START](#timestamp_start) - - [TIMESTAMP_DELTA](#timestamp_delta) - - [SCHEMA](#schema) - - [BATCH_SIZE](#batch_size) - - [ADAPTER](#adapter) - - [STATISTICS_INTERVAL](#statistics_interval) - - [PROMETHEUS_PORT](#prometheus_port) - + [Environment variables used to configure different databases](#environment-variables-used-to-configure-different-databases) - - [ADDRESS](#address) - - [USERNAME](#username) - - [PASSWORD](#password) - - [DATABASE](#database) - - [TABLE](#table) - - [PARTITION](#partition) - + [Environment variables used to configure CrateDB](#environment-variables-used-to-configure-cratedb) - - [SHARDS](#shards) - - [REPLICAS](#replicas) - + [Environment variables used to configure TimescaleDB](#environment-variables-used-to-configure-timescaledb) - - [TIMESCALE_COPY](#timescale_copy) - - [TIMESCALE_DISTRIBUTED](#timescale_distributed) - + [Environment variables used to configure InfluxDB](#environment-variables-used-to-configure-influxdb) - - [TOKEN](#token) - - [ORG](#org) - + [Environment variables used to configure AWS Timestream](#environment-variables-used-to-configure-aws-timestream) - - [AWS_ACCESS_KEY_ID](#aws_access_key_id) - - [AWS_SECRET_ACCESS_KEY](#aws_secret_access_key) - - [AWS_REGION_NAME](#aws_region_name) - * [Data Generator Schemas](#data-generator-schemas) - + [Structure](#structure) - + [Complex Schema Example](#complex-schema-example) - + [Sensor Types](#sensor-types) - - [Float Sensor](#float-sensor) - - [Bool Sensor](#bool-sensor) - * [Batch-Size-Automator](#batch-size-automator) - * [Prometheus Metrics](#prometheus-metrics) - * [Example Use Cases](#example-use-cases) - + [Single channel](#single-channel) - + [Multiple channels](#multiple-channels) - * [Alternative data generators](#alternative-data-generators) - + [Why use this data generator over the alternatives?](#why-use-this-data-generator-over-the-alternatives) - + [cr8 + mkjson](#cr8--mkjson) - + [tsbs data generator](#tsbs-data-generator) - * [Glossary](#glossary) - -## General Information - -This chapter covers general information about the Data Generator, e.g. the supported databases and the basic workflow. - -### About - The Data Generator is a tool to generate timeseries data which adheres to a statistical model described by a [schema](#data-generator-schemas). -It can be used for both [populating a database](#ingest_mode) as well as -having a way to [continuously insert](#ingest_mode) timeseries data. - +It can be used for both [populating a database](#ingest-mode) as well as +having a way to [continuously insert](#ingest-mode) timeseries data. -### Install +## Installation -#### PyPI package - -The *Time Series Data Generator* `tsperf` package and can be installed using `pip install tsperf`. +:::{rubric} PyPI package +::: +The *Time Series Data Generator* `tsperf` package and can be installed using `pip`. +```shell +pip install tsperf +``` -#### Docker image +:::{rubric} OCI image +::: -Another way to use the Data Generator is to build the Docker image `tsperf`. +Another way to use the Data Generator is to build the OCI image `tsperf`. -+ navigate to root directory of this repository -+ build docker image with `docker build -t tsperf -f Dockerfile .` -+ Adapt one of the example docker-compose files in the [example folder](examples) -+ start (e.g. crate example) with `docker-compose -f examples/basic-cratedb.yml up` ++ Navigate to root directory of this repository. ++ Build docker image with `docker build -t tsperf -f Dockerfile .`. ++ Adapt one of the example docker-compose files in the [example folder]. ++ Start (e.g. CrateDB example) with `docker-compose -f examples/basic-cratedb.yml up`. -For an explanation on how to set the environment variables see [Environment variables](#data-generator-configuration). -For example use cases see [Example use cases](#example-use-cases) +Configure TSDG using [environment variables](#data-generator-configuration). +See also [example use cases](#example-use-cases). -### Usage +## Usage + Look at the default configuration of the Data Generator by executing `tsperf write --help` in a terminal. + Run the Data Generator with the desired configuration values by executing `tsperf write` in a terminal. -To look at example configurations navigate to the [example folder](../../examples). Each environment variable can be +To look at example configurations navigate to the [example folder]. Each environment variable can be overwritten by using the corresponding command line argument. ### Supported Databases -Currently 7 Databases are +Currently, 7 databases are supported. + + [CrateDB](https://crate.io/) + [InfluxDB V2](https://www.influxdata.com/) -+ [TimescaleDB](https://www.timescale.com/) ++ [Microsoft SQL Server](https://www.microsoft.com/de-de/sql-server) + [MongoDB](https://www.mongodb.com/) + [PostgreSQL](https://www.postgresql.org/) -+ [AWS Timestream](https://aws.amazon.com/timestream/) -+ [Microsoft SQL Server](https://www.microsoft.com/de-de/sql-server) ++ [TimescaleDB](https://www.timescale.com/) ++ [Timestream](https://aws.amazon.com/timestream/) Databases can be run either local or in the Cloud as both use cases are supported. Support for additional databases depends on the demand for it. The following chapters give an overview over the specific implementation for the different databases. +(dg-cratedb)= #### CrateDB -##### Client Library - -For CrateDB the [crate](https://pypi.org/project/crate/) library is used. To connect to CrateDB the following -environment variables must be set: +For CrateDB the [crate](https://pypi.org/project/crate/) library is used. In +order to connect to CrateDB, the following environment variables must be set: -+ [ADDRESS](#address): hostname including port e.g. `localhost:4200` -+ [USERNAME](#username): CrateDB username. -+ [PASSWORD](#password): password for CrateDB user. ++ [ADDRESS](#setting-dg-address): hostname including port e.g. `localhost:4200` ++ [USERNAME](#setting-dg-username): CrateDB username. ++ [PASSWORD](#setting-dg-password): password for CrateDB user. ##### Table Setup A table gets it's name either from the provided [schema](#data-generator-schemas) or from the environment variable -[TABLE](#table) +[TABLE](#setting-dg-table) A table for CrateDB has three columns: + `ts`: column containing a timestamp (occurrence of the payload) -+ `g_ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#partition). It is ++ `g_ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#setting-dg-partition). It is used to partition the table and generated by the DB. + `payload`: column containing the the values, is of type [OBJECT](https://crate.io/docs/crate/reference/en/latest/general/ddl/data-types.html#object) Dynamic. The concrete @@ -154,15 +79,15 @@ A table for CrateDB has three columns: Additional table configuration: -+ with [SHARDS](#shards) the amount of shards for the table can be configured -+ with [REPLICAS](#replicas) the amount of replicas for the table can be configured ++ with [SHARDS](#setting-dg-shards) the amount of shards for the table can be configured ++ with [REPLICAS](#setting-dg-replicas) the amount of replicas for the table can be configured ##### Insert Insert is done using the [unnest](https://crate.io/docs/crate/reference/en/latest/general/builtins/table-functions.html?#unnest-array-array) function of CrateDB. -##### Specifics +##### Notes + All columns and sub-columns are automatically indexed + Using an object column makes it possible to insert the values for multiple schemas into a single table (similar to a @@ -170,21 +95,20 @@ function of CrateDB. + Using `unnest` for the insert makes it possible to take the generated values without modification and insert them directly into the table. +(dg-influxdb)= #### InfluxDB -##### Client Library - For InfluxDB the [influx-client](https://pypi.org/project/influxdb-client/) library is used as the Data Generator only -supports InfluxDB V2. To connect to InfluxDB the following environment variables must be set: +supports InfluxDB V2. To connect to InfluxDB, the following environment variables must be set: -+ [ADDRESS](#address): Database address. Either a DSN URI, or `hostname:port`. -+ [TOKEN](#token): InfluxDB Read/Write token -+ [ORG](#org): InfluxDB organization ++ [ADDRESS](#setting-dg-address): Database address. Either a DSN URI, or `hostname:port`. ++ [TOKEN](#setting-dg-token): InfluxDB Read/Write token ++ [ORG](#setting-dg-org): InfluxDB organization ##### Bucket Setup A bucket gets it's name either from the provided [schema](#data-generator-schemas) or from the environment variable -[TABLE](#table) +[TABLE](#setting-dg-table) If a bucket with the same name already exists on the given host this bucket is used to insert data otherwise a new bucket is created without retention rules (data is saved indefinitely) @@ -195,76 +119,57 @@ Insert into InfluxDB is done using the `Point` type from the `influxdb_client.cl [schema](#data-generator-schemas) are added to `Point.tag` (InfluxDB creates indices for tags). Measurements are saved to `Point.field`. The timestamp is added to `Point.time`. Multiple Points are then inserted in a batch. -##### Specifics +##### Notes + All tags are automatically indexed + Insert of multiple schemas into a single bucket is possible due to InfluxDB being a NoSQL database. + When using InfluxDB V2 with a usage-based plan insert is limited to 300MB/5m, this is about 15.000 data points per second. Excess data is dropped by InfluxDB and the client is not informed. -#### TimescaleDB - -##### Client Library -For TimescaleDB the [psycopg2](https://pypi.org/project/psycopg2/) library is used. As psycopg2 does not have the best -insert performance for TimescaleDB (see [here](https://docs.timescale.com/latest/tutorials/quickstart-python#insert_rows)), -to insert a lot of data points, it is advised to split [IDs](#id_start) over multiple data-generator instances. +(dg-mssql)= +#### Microsoft SQL Server -Note: starting with version `0.1.3` TimescaleDB uses the [pgcopy](https://pypi.org/project/pgcopy/) library by default -to enhance insert performance for single clients. To override the default setting you can set -[TIMESCALE_COPY](#timescale_copy) to `False`. +For Microsoft SQL Server the [pyodcb](https://github.com/mkleehammer/pyodbc) library is used. +If the Data Generator is run via `pip install` please ensure that `pyodbc` is properly installed on your system. -To connect with TimescaleDB the following environment variables must be set: +To connect to Microsoft SQL Server, the following environment variables must be set: -+ [ADDRESS](#address): Database address -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): the database name with which to connect ++ [ADDRESS](#setting-dg-address): the host where Microsoft SQL Server is running in this [format](https://www.connectionstrings.com/azure-sql-database/) ++ [USERNAME](#setting-dg-username): Database user ++ [PASSWORD](#setting-dg-password): Password of the database user ++ [DATABASE](#setting-dg-database): the database name to connect to or create ##### Table Setup -A table gets it's name either from the provided [schema](#data-generator-schemas) or from the environment variable -[TABLE](#table) +A table gets it's name from the provided [schema](#data-generator-schemas) -A table for TimescaleDB consists of the following columns: +A table for Microsoft SQL Server consists of the following columns: + `ts`: column containing a timestamp (occurrence of the payload) -+ `ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#partition). + a column for each entry in `tags` and `fields`. + `tags` are of type `INTEGER` when using numbers and of type `TEXT` when using list notation + `fields` are of the type defined in the [schema](#data-generator-schemas) -**If a table with the same name already exists which doesn't have the expected structure the data-generator will fail -when inserting values.** - -Using this table a TimescaleDB Hypertable is created partitioned by the `ts` and `ts_'interval'` column +**If a table or database with the same name already exists it will be used by the data generator** ##### Insert -Insert is done in batches. - -##### Specifics +The insert is done using the `executemany` function -+ No index is created, to query data indices must be created manually. -+ Insert of multiple schemas into a single table is not possible as the table schema is only created once. -+ psycopg2 does not have the best insert performance for TimescaleDB (see [here](https://docs.timescale.com/latest/tutorials/quickstart-python#insert_rows)) - to insert a lot of data points, it is advised to split [IDs](#id_start) over multiple data-generator instances. -+ TimescaleDB can be used with distributed hypertables. To test the data generator on hypertables, the - `TIMESCALE_DISTRIBUTED` environment variable must be set to `True`. +(dg-mongodb)= #### MongoDB -##### Client Library - For MongoDB the [MongoClient](https://mongodb.github.io/node-mongodb-native/api-generated/mongoclient.html) library is used. -To connect with MongoDB the following environment variables must be set: +To connect to MongoDB, the following environment variables must be set: -+ [ADDRESS](#address): hostname (can include port if not standard MongoDB port is used) -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): The name of the MongoDB database that will be used ++ [ADDRESS](#setting-dg-address): hostname (can include port if not standard MongoDB port is used) ++ [USERNAME](#setting-dg-username): username of TimescaleDB user ++ [PASSWORD](#setting-dg-password): password of TimescaleDB user ++ [DATABASE](#setting-dg-database): The name of the MongoDB database that will be used ##### Collection Setup @@ -281,34 +186,33 @@ A document in the collection consists of the following elements: Insert is done using the `insert_many` function of the collection to insert documents in batches. -##### Specifics +##### Notes + MongoDB only creates a default index other indices have to be created manually. + Insert of multiple schemas into a single collection is possible but it's advised to use different collections for each schema (same database is fine). +(dg-postgresql)= #### PostgreSQL -##### Client Library - For PostgreSQL the [psycopg2](https://pypi.org/project/psycopg2/) library is used. -To connect with PostgreSQL the following environment variables must be set: +To connect to PostgreSQL, the following environment variables must be set: -+ [ADDRESS](#address): hostname -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): the database name with which to connect ++ [ADDRESS](#setting-dg-address): hostname ++ [USERNAME](#setting-dg-username): username of TimescaleDB user ++ [PASSWORD](#setting-dg-password): password of TimescaleDB user ++ [DATABASE](#setting-dg-database): the database name with which to connect ##### Table Setup A table gets it's name either from the provided [schema](#data-generator-schemas) or from the environment variable -[TABLE](#table). +[TABLE](#setting-dg-table). A table for PostgreSQL consists of the following columns: + `ts`: column containing a timestamp (occurrence of the payload) -+ `ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#partition). ++ `ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#setting-dg-partition). + a column for each entry in `tags` and `fields`. + `tags` are of type `INTEGER` when using numbers and of type `TEXT` when using list notation + `fields` are of the type defined in the [schema](#data-generator-schemas) @@ -320,131 +224,149 @@ when inserting values.** Insert is done in batches. -##### Specifics +##### Notes + No index is created, to query data indices must be created manually. + Insert of multiple schemas into a single table is not possible as the table schema is only created once. -#### AWS Timestream +(dg-timescaledb)= +#### TimescaleDB -##### Client Library +For TimescaleDB the [psycopg2](https://pypi.org/project/psycopg2/) library is used. As psycopg2 does not have the best +insert performance for TimescaleDB (see [here](https://docs.timescale.com/latest/tutorials/quickstart-python#insert-rows-of-data)), +to insert a lot of data points, it is advised to split [IDs](#setting-dg-id-start) over multiple data-generator instances. -For AWS Timestream the [boto3](https://github.com/boto/boto3) library is used. +Note: starting with version `0.1.3` TimescaleDB uses the [pgcopy](https://pypi.org/project/pgcopy/) library by default +to enhance insert performance for single clients. To override the default setting you can set +[TIMESCALE_COPY](#setting-dg-timescale-copy) to `False`. -To connect with AWS Timestream the following environment variables must be set: +To connect to TimescaleDB, the following environment variables must be set: -+ [AWS_ACCESS_KEY_ID](#aws_access_key_id): AWS Access Key ID -+ [AWS_SECRET_ACCESS_KEY](#aws_secret_access_key): AWS Secret Access Key -+ [AWS_REGION_NAME](#aws_region_name): AWS Region -+ [DATABASE](#database): the database name to connect to or create ++ [ADDRESS](#setting-dg-address): Database address ++ [USERNAME](#setting-dg-username): username of TimescaleDB user ++ [PASSWORD](#setting-dg-password): password of TimescaleDB user ++ [DATABASE](#setting-dg-database): the database name with which to connect ##### Table Setup -A table gets it's name from the provided [schema](#data-generator-schemas) +A table gets it's name either from the provided [schema](#data-generator-schemas) or from the environment variable +[TABLE](#setting-dg-table) -A table for AWS Timestream consists of the following columns: +A table for TimescaleDB consists of the following columns: -+ A column for each `tag` in the provided schema -+ All columns necessary for the AWS Timestream - [datamodel](https://docs.aws.amazon.com/timestream/latest/developerguide/getting-started.python.code-samples.write-data.html): ++ `ts`: column containing a timestamp (occurrence of the payload) ++ `ts_'interval'`: column containing the `ts` value truncated to the value set with [PARTITION](#setting-dg-partition). ++ a column for each entry in `tags` and `fields`. + + `tags` are of type `INTEGER` when using numbers and of type `TEXT` when using list notation + + `fields` are of the type defined in the [schema](#data-generator-schemas) -**If a table or database with the same name already exists it will be used by the data generator** +**If a table with the same name already exists which doesn't have the expected structure the data-generator will fail +when inserting values.** + +Using this table a TimescaleDB Hypertable is created partitioned by the `ts` and `ts_'interval'` column ##### Insert -The insert is done according to the optimized write [documentation](https://docs.aws.amazon.com/timestream/latest/developerguide/getting-started.python.code-samples.write-data-optimized.html). Values are grouped by their tags and inserted in batches. As AWS Timestream has a default limit of 100 values per batch, the batch is limited to have a maximum size of 100. +Insert is done in batches. -##### Specifics +##### Notes -Tests show that about 600 values per second can be inserted by a single data generator instance. So the data schema has -to be adjusted accordingly to not be slower than the settings require. - -For example, having 600 sensors with each 2 fields, and each should write a single value each second, would require -1200 values/s of insert speed. For satisfying this scenario, at least 2 data generator instance would be needed. - -#### Microsoft SQL Server ++ No index is created, to query data indices must be created manually. ++ Insert of multiple schemas into a single table is not possible as the table schema is only created once. ++ psycopg2 does not have the best insert performance for TimescaleDB (see [here](https://docs.timescale.com/latest/tutorials/quickstart-python#insert-rows-of-data)) + to insert a lot of data points, it is advised to split [IDs](#setting-dg-id-start) over multiple data-generator instances. ++ TimescaleDB can be used with distributed hypertables. To test the data generator on hypertables, the + `TIMESCALE_DISTRIBUTED` environment variable must be set to `True`. -##### Client Library +(dg-timestream)= +#### Timestream -For Microsoft SQL Server the [pyodcb](https://github.com/mkleehammer/pyodbc) library is used. -If the Data Generator is run via `pip install` please ensure that `pyodbc` is properly installed on your system. +For AWS Timestream the [boto3](https://github.com/boto/boto3) library is used. -To connect with Microsoft SQL Server the following environment variables must be set: +To connect to AWS Timestream, the following environment variables must be set: -+ [ADDRESS](#address): the host where Microsoft SQL Server is running in this [format](https://www.connectionstrings.com/azure-sql-database/) -+ [USERNAME](#username): Database user -+ [PASSWORD](#password): Password of the database user -+ [DATABASE](#database): the database name to connect to or create ++ [AWS_ACCESS_KEY_ID](#setting-dg-aws_access_key_id): AWS Access Key ID ++ [AWS_SECRET_ACCESS_KEY](#setting-dg-aws_secret_access_key): AWS Secret Access Key ++ [AWS_REGION_NAME](#setting-dg-aws_region_name): AWS Region ++ [DATABASE](#setting-dg-database): the database name to connect to or create ##### Table Setup A table gets it's name from the provided [schema](#data-generator-schemas) -A table for Microsoft SQL Server consists of the following columns: +A table for AWS Timestream consists of the following columns: -+ `ts`: column containing a timestamp (occurrence of the payload) -+ a column for each entry in `tags` and `fields`. - + `tags` are of type `INTEGER` when using numbers and of type `TEXT` when using list notation - + `fields` are of the type defined in the [schema](#data-generator-schemas) ++ A column for each `tag` in the provided schema ++ All columns necessary for the AWS Timestream + [datamodel](https://docs.aws.amazon.com/timestream/latest/developerguide/getting-started.python.code-samples.write-data.html): **If a table or database with the same name already exists it will be used by the data generator** ##### Insert -The insert is done using the `executemany` function +The insert is done according to the optimized write [documentation](https://docs.aws.amazon.com/timestream/latest/developerguide/getting-started.python.code-samples.write-data-optimized.html). Values are grouped by their tags and inserted in batches. As AWS Timestream has a default limit of 100 values per batch, the batch is limited to have a maximum size of 100. -## Data Generator Configuration +##### Notes -The Data Generator is mostly configured by setting Environment Variables (or command line arguments start with `-h` for -more information). This chapter lists all available Environment Variables and explains their use in the Data Generator. +Tests show that about 600 values per second can be inserted by a single data generator instance. So the data schema has +to be adjusted accordingly to not be slower than the settings require. -### Environment variables configuring the behaviour of the Data Generator +For example, having 600 sensors with each 2 fields, and each should write a single value each second, would require +1200 values/s of insert speed. For satisfying this scenario, at least 2 data generator instance would be needed. + -The environment variables in this chapter are used to configure the behaviour of the data generator. -#### CONCURRENCY +(data-generator-configuration)= +## Configuration -Type: Integer +The Data Generator is mostly configured by setting environment variables or +command line arguments. Start with `-h` for more information. This chapter +lists all available environment variables, and explains their use in the +Data Generator. -Value: A positive number. +### Environment Variables -Default: 1 +Configure the behaviour of the data generator using environment variables. -The Data Generator will split the insert into as many threads as this variable indicates. It will use `threading.Threads` -as +#### CONCURRENCY -#### ID_START +:Type: Integer +:Value: A positive number. +:Default: 1 -Type: Integer +The Data Generator will split the insert into as many threads as this variable +indicates. -Value: A positive number. Must be smaller than [ID_END](#id_end). +(setting-dg-id-start)= +#### ID_START -Default: 1 +:Type: Integer +:Value: A positive number. Must be smaller than [ID_END](#setting-dg-id-end). +:Default: 1 The Data Generator will create `(ID_END + 1) - ID_START` channels. +(setting-dg-id-end)= #### ID_END -Type: Integer - -Value: A positive number. Must be greater than [ID_START](#id_start). - -Default: 500 +:Type: Integer +:Value: A positive number. Must be greater than [ID_START](#setting-dg-id-start). +:Default: 500 The Data Generator will create `(ID_END + 1) - ID_START` channels. +(ingest-mode)= #### INGEST_MODE -Type: Boolean - -Values: False or True - -Default: True +:Type: Boolean +:Value: False or True +:Default: True +(ingest-mode-false)= ##### INGEST_MODE False When `INGEST_MODE` is set to `False` the Data Generator goes into "steady load"-mode. This means for all channels -controlled by the Data Generator an insert is performed each [TIMESTAMP_DELTA](#timestamp_delta) seconds. +controlled by the Data Generator an insert is performed each [TIMESTAMP_DELTA](#setting-dg-timestamp-delta) seconds. **Note: If too many channels are controlled by one Data Generator instance so an insert cannot be performed in the timeframe set by `TIMESTAMP_DELTA` it is advised to split half the IDs to a separate Data Generator instance. @@ -454,22 +376,24 @@ With this configuration the [Batch Size Automator](#batch-size-automator) is dis [Prometheus metrics](#prometheus-metrics) g_insert_time, g_rows_per_second, g_best_batch_size, g_best_batch_rps, will stay at 0. +(ingest-mode-true)= ##### INGEST_MODE True When `INGEST_MODE` is set to `True` the Data Generator goes into "burst insert"-mode. This means it tries to insert as many values as possible. This mode is used populate a database and can be used to measure insert performance. Using this mode results in values inserted into the database at a faster rate than defined by -[TIMESTAMP_DELTA](#timestamp_delta) but the timestamp values will still adhere to this value and be in the defined time -interval. This means that if [TIMESTAMP_START](#timestamp_start) is not set to a specific value timestamps will point to the future. -By adjusting `TIMESTAMP_START` to a timestamp in the past in combination with a limited [INGEST_SIZE](#ingest_size) the rang of +[TIMESTAMP_DELTA](#setting-dg-timestamp-delta) but the timestamp values will still adhere to this value and be in the defined time +interval. This means that if [TIMESTAMP_START](#setting-dg-timestamp-start) is not set to a specific value timestamps will point to the future. +By adjusting `TIMESTAMP_START` to a timestamp in the past in combination with a limited [INGEST_SIZE](#setting-dg-ingest-size) the rang of timestamps can be controlled. -When [BATCH_SIZE](#batch_size) is set to a value smaller or equal to 0 the [Batch Size Automator](#batch-size-automator) +When [BATCH_SIZE](#setting-dg-batch-size) is set to a value smaller or equal to 0 the [Batch Size Automator](#batch-size-automator) is activated. This means that the insert performance is supervised and the batch size adjusted to have a fast insert speed. If the value is greater than 0 the batch size will be fixed at this value and the Batch Size Automator will be disabled. +(setting-dg-ingest-size)= #### INGEST_SIZE Type: Integer @@ -492,6 +416,7 @@ We have 500 channels and for each channel 2000 values are generated, therefore w **Note: a value contains all the information for a single channel, including the defined `tags` and `fields`. See [Data Generator Schemas](#data-generator-schemas) for more information about tags and fields.** +(setting-dg-timestamp-start)= #### TIMESTAMP_START Type: Integer @@ -500,14 +425,20 @@ Values: A valid UNIX timestamp Default: timestamp at the time the Data Generator was started -This variable defines the first timestamp used for the generated values. When using -[INGEST_MODE True](#ingest_mode-true) all following timestamps have an interval to the previous timestamp by the value -of [TIMESTAMP_DELTA](#timestamp_delta). When using [INGEST_MODE False](#ingest_mode-false) the second insert happens when -`TIMESTAMP_START + TIMESTAMP_DELTA` is equal or bigger than the current timestamp (real life). +This variable defines the first timestamp used for the generated values. + +When using [INGEST_MODE True](#ingest-mode-true), all following timestamps have +an interval to the previous timestamp by the value of +[TIMESTAMP_DELTA](#setting-dg-timestamp-delta). -This means that if `TIMESTAMP_START` is set to the future no inserts will happen until the -`TIMESTAMP_START + TIMESTAMP_DELTA` timestamp is reached. +When using [INGEST_MODE False](#ingest-mode-false), the second insert happens +when `TIMESTAMP_START + TIMESTAMP_DELTA` is equal or bigger than the current +timestamp (real life). +This means that if `TIMESTAMP_START` is set to the future, no inserts will +happen until the `TIMESTAMP_START + TIMESTAMP_DELTA` timestamp is reached. + +(setting-dg-timestamp-delta)= #### TIMESTAMP_DELTA Type: Float @@ -518,36 +449,34 @@ Default: 0.5 The value of `TIMESTAMP_DELTA` defines the interval between timestamps of the generated values. +(setting-dg-schema)= #### SCHEMA -Type: String - -Values: Either relative or absolute path to a schema in JSON format (see - [Data Generator Schemas](#data-generator-schemas) for more information on schemas). +:Type: String +:Value: Either relative or absolute path to a schema in JSON format. See + [Data Generator Schemas](#data-generator-schemas) for more information on schemas. +:Default: empty string -Default: empty string - -When using a relative path with the docker image be sure to checkout the [Dockerfile](Dockerfile) to be sure to use the -correct path. +When using a relative path with the docker image, be sure to check out the +`Dockerfile` to make sure to use the correct path. +(setting-dg-batch-size)= #### BATCH_SIZE -Type: Integer - -Values: Any number. - -Default: -1 +:Type: Integer +:Value: Any number. +:Default: -1 -`BATCH_SIZE` is only taken into account when [INGEST MODE](#ingest_mode) is set to `True`. The value of `BATCH_SIZE` +`BATCH_SIZE` is only taken into account when [INGEST MODE](#ingest-mode) is set to `True`. The value of `BATCH_SIZE` defines how many rows will be inserted with one insert statement. If the value is smaller or equal to `0` the [Batch Size Automator](#batch-size-automator) will take control over the batch size and dynamically adjusts the batch size to get the best insert performance. +(setting-dg-adapter)= #### ADAPTER -Type: String - -Values: cratedb|timescaledb|influxdb|mongodb|postgresql|timestream|mssql +:Type: String +:Value: `cratedb|timescaledb|influxdb|mongodb|postgresql|timestream|mssql The value will define which database adapter to use: + Amazon Timestream @@ -560,89 +489,69 @@ The value will define which database adapter to use: #### STATISTICS_INTERVAL -Type: Integer +Print statistics of average function execution time every `STATISTICS_INTERVAL` seconds. -Values: A positive number - -Default: 30 - -Prints statistics of average function execution time every `STATISTICS_INTERVAL` seconds. +:Type: Integer +:Value: A positive number +:Default: 30 #### PROMETHEUS_PORT -Type: Integer - -Values: 1 to 65535 - -Default: 8000 - The port that is used to publish Prometheus metrics. -### Environment variables used to configure different databases +:Type: Integer +:Value: 1 to 65535 +:Default: 8000 -The environment variables in this chapter are used by different databases to connect and configure them. Each entry will -contain the databases for which they are used and example values. For collected information for a single database see -the chapters for the database: -+ [CrateDB](#cratedb) -+ [InfluxDB](#influxdb) -+ [TimescaleDB](#timescaledb) -+ [MongoDB](#mongodb) -+ [AWS Timestream](#aws-timestream) -+ [Postgresql](#postgresql) -+ [MS SQL Server](#microsoft-sql-server) -#### ADDRESS +### Database Settings -Type: String +Environment variables to configure database connectivity. -Values: Database address (DSN URI, hostname:port) according to the database client requirements - -**CrateDB:** - -Host must include port, e.g.: `"localhost:4200"` +(setting-dg-address)= +#### ADDRESS -**TimescaleDB, Postgresql and InfluxDB:** +:Type: String +:Value: Database address (DSN URI, hostname:port) according to the database client requirements -Host must be hostname excluding port, e.g.: `"localhost"` +:::{note} +**CrateDB:** Host must include port, e.g.: `"localhost:4200"` -**MongoDB:** +**TimescaleDB, Postgresql and InfluxDB:** Host must be hostname excluding port, e.g.: `"localhost"` -Host can be either without port (e.g. `"localhost"`) or with port (e.g. `"localhost:27017"`) +**MongoDB:** Host can be either without port (e.g. `"localhost"`) or with port (e.g. `"localhost:27017"`) -**MSSQL:** - -Host must start with `tcp:` +**MSSQL:** Host must start with `tcp:` +::: +(setting-dg-username)= #### USERNAME -Type: String - -Values: username of user used for authentication against the database - -Default: None +:Type: String +:Value: Username to authenticate against the database +:Default: None -used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +Used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +(setting-dg-password)= #### PASSWORD -Type: String +:Type: String +:Value: Password to authenticate against the database +:Default: None -Values: password of user used for authentication against the database - -Default: None - -used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +Used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +(setting-dg-database)= #### DATABASE -Type: String - -Values: Name of the database where table will be created - -Default: empty string +:Type: String +:Value: Name of the database where table will be created +:Default: empty string -used with InfluxDB, TimescaleDB, MongoDB, AWS Timestream, Postgresql, MSSQL. +Used with InfluxDB, TimescaleDB, MongoDB, AWS Timestream, Postgresql, MSSQL. +:::{note} **InfluxDB:** This is an optional parameter for InfluxDB. In case it is set the Bucket where the values are inserted will use the value of `DATABASE` as name. If `DATABASE` is empty string than the name of the schema (see @@ -657,132 +566,131 @@ The value of `DATABASE` is used as the database parameter of MongoDB. **AWS Timestream:** The value of `DATABASE` is used as the database parameter of AWS Timestream. +::: +(setting-dg-table)= #### TABLE -Type: String +:Type: String +:Value: Name of the table where values are stored +:Default: empty string -Values: Name of the table where values are stored - -Default: empty string - -used with CrateDB, Postgresql, MSSQL and TimescaleDB. It is an optional parameter to overwrite the default table name -defined in the schema (see [Data Generator Schemas](#data-generator-schemas)). +Used with CrateDB, PostgreSQL, MSSQL, and TimescaleDB. It is an optional parameter +to overwrite the default table name defined in the schema. See also +[Data Generator Schemas](#data-generator-schemas). +(setting-dg-partition)= #### PARTITION -Type: String - -Values: second, minute, hour, day, week, month, quarter, year - -Default: week +:Type: String +:Value: second, minute, hour, day, week, month, quarter, year +:Default: week -used with CrateDB, Postgresql and TimescaleDB. Is used to define an additional Column to partition the table. For +Used with CrateDB, Postgresql and TimescaleDB. Is used to define an additional Column to partition the table. For example, when using `week` an additional column is created (Crate: `g_ts_week`, Timescale/Postgres `ts_week`) and the value from the `ts` column is truncated to its week value. -### Environment variables used to configure CrateDB -The environment variables in this chapter are only used to configure CrateDB +(cratedb-settings)= +### CrateDB Settings -#### SHARDS - -Type: Integer +The environment variables in this chapter are only used to configure CrateDB. -Values: positive number +(setting-dg-shards)= +#### SHARDS -Default: 4 +:Type: Integer +:Value: positive number +:Default: 4 Defines how many [shards](https://crate.io/docs/crate/reference/en/latest/general/ddl/sharding.html) will be used. +(setting-dg-replicas)= #### REPLICAS -Type: Integer - -Values: positive number - -Default: 0 +:Type: Integer +:Value: positive number +:Default: 0 Defines how many [replicas](https://crate.io/docs/crate/reference/en/latest/general/ddl/replication.html) for the table will be created. -### Environment variables used to configure TimescaleDB - -The environment variables in this chapter are only used to configure TimescaleDB -#### TIMESCALE_COPY - -Type: Boolean - -Values: True or False - -Default: True +(influxdb-settings)= +### InfluxDB Settings -Defines if Timescale insert uses `pgcopy` or not. - -#### TIMESCALE_DISTRIBUTED +The environment variables in this chapter are only used to configure InfluxDB. -Type: Boolean +(setting-dg-token)= +#### TOKEN -Values: True or False +:Type: String +:Value: token gotten from InfluxDB V2 +:Default: empty string -Default: False +Influx V2 uses [token](https://v2.docs.influxdata.com/v2.0/security/tokens/view-tokens/) based authentication. -Defines if Timescale is used with distributed hypertables or not. +(setting-dg-org)= +#### ORG -### Environment variables used to configure InfluxDB +:Type: String +:Value: org_id gotten from InfluxDB V2 +:Default: empty string -The environment variables in this chapter are only used to configure InfluxDB +Influx V2 uses [organizations](https://v2.docs.influxdata.com/v2.0/organizations/) to manage buckets. -#### TOKEN -Type: String +(timescaledb-settings)= +### TimescaleDB Settings -Values: token gotten from InfluxDB V2 +The environment variables in this chapter are only used to configure TimescaleDB. -Default: empty string +(setting-dg-timescale-copy)= +#### TIMESCALE_COPY -Influx V2 uses [token](https://v2.docs.influxdata.com/v2.0/security/tokens/view-tokens/) based authentication. +:Type: Boolean +:Value: True or False +:Default: True -#### ORG +Defines if Timescale insert uses `pgcopy` or not. -Type: String +#### TIMESCALE_DISTRIBUTED -Values: org_id gotten from InfluxDB V2 +:Type: Boolean +:Value: True or False +:Default: False -Default: empty string +Defines if Timescale is used with distributed hypertables or not. -Influx V2 uses [organizations](https://v2.docs.influxdata.com/v2.0/organizations/) to manage buckets. -### Environment variables used to configure AWS Timestream +(timestream-settings)= +### Timestream Settings -The environment variables in this chapter are only used to configure AWS Timestream +The environment variables in this chapter are only used to configure AWS Timestream. +(setting-dg-aws_access_key_id)= #### AWS_ACCESS_KEY_ID -Type: String - -Values: AWS Access Key ID - -Default: empty string +:Type: String +:Value: AWS Access Key ID +:Default: empty string +(setting-dg-aws_secret_access_key)= #### AWS_SECRET_ACCESS_KEY -Type: String - -Values: AWS Secret Access Key - -Default: empty string +:Type: String +:Value: AWS Secret Access Key +:Default: empty string +(setting-dg-aws_region_name)= #### AWS_REGION_NAME -Type: String +:Type: String +:Value: AWS region name +:Default: empty string -Values: AWS region name -Default: empty string - -## Data Generator Schemas +## Schemas The Data Generator uses schemas to determine what kind of values to generate. These schemas are described in JSON files. This chapter explains how to write @@ -790,13 +698,14 @@ schemas based on examples. ### Structure -A Data Generator Schema is a JSON-file which must contain one object (a second key `description` can be used to explain -the schema). The key for this object will be the default value of [TABLE](#table). For example the following -JSON Object would contain a schema called `button_sensor`. +A Data Generator Schema is a file in JSON format, which must contain one object. +A second key `description` can be used to explain the schema. The key for this +object will be the default value of [TABLE](#setting-dg-table). For example, the +following JSON object would contain a schema called `button_sensor`. -```JSON +```json { - "button_sensor": {"..."} + "button_sensor": {"..."} } ``` @@ -805,16 +714,16 @@ thing where the measured values come from. `fields` describe all measured values `button_sensor` schema. A sensor is identified by a `sensor_id` (we keep it simple for the first example). And has a single metric `button_press`: -```JSON +```json { - "button_sensor": { - "tags": { - "sensor_id": "id" - }, - "fields": { - "button_press": {"..."} - } + "button_sensor": { + "tags": { + "sensor_id": "id" + }, + "fields": { + "button_press": {"..."} } + } } ``` @@ -822,33 +731,32 @@ As you can see the `sensor_id` tag gets the value `"id"` this means it will be t with the values between `ID_START` and `ID_END`. The `button_sensor` metric is another object describing how the value of this object should be calculated and saved to the database. -```JSON +```json { - "button_sensor": { - "tags": { - "sensor_id": "id" + "button_sensor": { + "tags": { + "sensor_id": "id" + }, + "fields": { + "button_press": { + "key": { + "value": "button_pressed" + }, + "type": { + "value": "BOOL" }, - "fields": { - "button_press": { - "key": { - "value": "button_pressed" - }, - "type": { - "value": "BOOL" - }, - "true_ratio": { - "value": 0.001 - } - } + "true_ratio": { + "value": 0.001 } + } } + } } ``` The `button_press` metric is of type `BOOL` and has a true_ratio of `0.001` which means it is true in 1 out of 1000 cases. Go to [Sensor Types](#sensor-types) to get a more detailed overview over the different Sensor types. Or look at -[motor.json](../schema/basic/motor.json) or [environment.json](../schema/basic/environment.json) for examples -containing schema descriptions. +[motor.json] or [environment.json] for examples containing schema descriptions. This is the basic structure of a Data Generator Schema. It can contain any amount of tags and fields, but row/document size increases with each add value, as well as calculation time with each metric. @@ -859,113 +767,68 @@ size increases with each add value, as well as calculation time with each metric `tags` can also be defined as a list of values, e.g. `["AT", "BE", "CH", "DE", "ES", "FR", "GB", "HU", "IT", "JP"]`. This then uses the values in the array to setup the tags. -### Complex Schema Example - -**Use Case:** - -We want to describe channels for: - -+ We have **50 plants** -+ Each plant contains of **5 lines** -+ Each line consists of **10 machines** -+ Each machine has **5 different sensors**: - + voltage - + current - + temperature - + power - + vibration - -**Solution:** - -The first thing we need to identify how many channels we have in total: `50 plants * 5 lines * 10 machines = 2500 channels total`. -Now we know that we have to use 2500 IDs so we set `ID_START=1` and `ID_END=2500`. Then we create our schema: - -```JSON -{ - "example": { - "tags": { - "plant": 50, - "line": 5, - "machine": "id" - }, - "fields": { - "voltage": {"..."}, - "current": {"..."}, - "temperature": {"..."}, - "power": {"..."}, - "vibration": {"..."} - } - } -} -``` - -As you see even a complex schema isn't that complicated to write. The main limitation of the Data Generator is that one -instance can only take a single schema. But when combining multiple instances it is easily possible to simulate complex -setups spanning multiple factories with multiple machinery. - ### Sensor Types This chapter describes the available Sensor types, what values they use and the projected output. #### Float Sensor -To generate real world resembling float values the *floating value simulator* library is used. +To generate real world resembling float values, the [](#float-value-simulator) library is used. -##### Schema +We describe all the keys and the corresponding values a schema for a Float Sensor must contain. +All elements are associated with the key under the `fields` object (see [here](#structure)). +So for example we already have this schema: -We describe all the keys and the corresponding values a schema for a Float Sensor must contain. All elements are -associated with the key under the `fields` object (see [here](#structure)). So for example we already have this schema: - -```JSON +```json { - "example": { - "tags": { - "plant": 50, - "line": 5, - "machine": "id" - }, - "fields": { - "voltage": {"..."}, - "current": {"..."}, - "temperature": {"..."}, - "power": {"..."}, - "vibration": {"..."} - } + "example": { + "tags": { + "plant": 50, + "line": 5, + "machine": "id" + }, + "fields": { + "voltage": {"..."}, + "current": {"..."}, + "temperature": {"..."}, + "power": {"..."}, + "vibration": {"..."} } + } } ``` Now we decide that `voltage` is a Float Sensor and describe it in the schema like this: -```JSON +```json { - "key": { - "value": "voltage" - }, - "type": { - "value": "FLOAT" - }, - "min": { - "value": 200 - }, - "max": { - "value": 260 - }, - "mean": { - "value": 230 - }, - "stdev": { - "value": 10 - }, - "variance": { - "value": 0.03 - }, - "error_rate": { - "value": 0.001 - }, - "error_length": { - "value": 2.5 - } + "key": { + "value": "voltage" + }, + "type": { + "value": "FLOAT" + }, + "min": { + "value": 200 + }, + "max": { + "value": 260 + }, + "mean": { + "value": 230 + }, + "stdev": { + "value": 10 + }, + "variance": { + "value": 0.03 + }, + "error_rate": { + "value": 0.001 + }, + "error_length": { + "value": 2.5 + } } ``` @@ -980,74 +843,123 @@ Let's explain what this means. We take the normal voltage output of a european s + error_rate: The sensor has a 1:1000 chance to malfunction and report wrong values + error_length: When the sensor malfunctions and reports a wrong value on average the next 2.5 values are also wrong -Using this schema to generate 100.000 values the curve will look something like this (the spikes are the errors): - -![Image of docu_example schema curve](https://user-images.githubusercontent.com/453543/118527328-76dec400-b741-11eb-99bc-9852e244996f.png) +Using this schema to generate 100.000 values, the curve will look something +like this (the spikes are the errors): -And the value distribution of this curve will look something like this: +![Image of docu_example schema curve](https://user-images.githubusercontent.com/453543/118527328-76dec400-b741-11eb-99bc-9852e244996f.png){width=480} -![Image of docu_example schema value distribution](https://user-images.githubusercontent.com/453543/118527891-03898200-b742-11eb-885f-af5a094c23f5.png) +The value distribution of this curve will look something like this: -#### Bool Sensor +![Image of docu_example schema value distribution](https://user-images.githubusercontent.com/453543/118527891-03898200-b742-11eb-885f-af5a094c23f5.png){width=480} -The Bool Sensor produces boolean values according to a given ration. +#### Boolean Sensor -##### Schema - -The schema for a Bool Sensor is pretty simple, we look at the example we created when we described the +The boolean sensor produces boolean values according to a given ratio. The +schema for a boolean sensor is pretty simple, see also [Data Generator Schema](#structure). -```JSON -"button_press": { +```json +{ + "button_press": { "key": { - "value": "button_pressed" + "value": "button_pressed" }, "type": { - "value": "BOOL" + "value": "BOOL" }, "true_ratio": { - "value": 0.001 + "value": 0.001 } + } } ``` -+ key: the column/sub-column/field name where the value will be written -+ type: the type of sensor to use (currently `FLOAT` and `BOOL` are supported) -+ true_ratio: the ratio how many time the Bool Sensor will generate the value `True`. E.g. if value is `1` the sensor - will output `True` every time. If the value is `0.5` the output will be 50% `True` and 50% `False`. +:key: + The column/sub-column/field name where the value will be written. +:type: + The type of sensor to use. Currently, `FLOAT` and `BOOL` are supported. +:true_ratio: + The ratio how many time the Bool Sensor will generate the value `True`. + E.g. if value is `1`, the sensor will output `True` every time. If the value + is `0.5`, the output will be 50% `True` and 50% `False`. -## Batch-Size-Automator +### Complex Schema Example -To optimize ingest performance the [BSA](https://pypi.org/project/batch-size-automator/) library is used +:::{rubric} Use Case +::: +We want to describe channels for: -### Setup ++ We have **50 plants** ++ Each plant contains of **5 lines** ++ Each line consists of **10 machines** ++ Each machine has **5 different sensors**: + + voltage + + current + + temperature + + power + + vibration -The BSA is only active when [INGEST_MODE](#ingest_mode) is set to `1` and [BATCH_SIZE](#batch_size) has a value smaller -or equal `0`. When activated everything else is done automatically. +:::{rubric} Solution +::: +The first thing we need to identify how many channels we have in total: `50 plants * 5 lines * 10 machines = 2500 channels total`. +Now we know that we have to use 2500 IDs so we set `ID_START=1` and `ID_END=2500`. Then we create our schema: + +```json +{ + "example": { + "tags": { + "plant": 50, + "line": 5, + "machine": "id" + }, + "fields": { + "voltage": {"..."}, + "current": {"..."}, + "temperature": {"..."}, + "power": {"..."}, + "vibration": {"..."} + } + } +} +``` + +As you see even a complex schema isn't that complicated to write. The main limitation of the Data Generator is that one +instance can only take a single schema. But when combining multiple instances it is easily possible to simulate complex +setups spanning multiple factories with multiple machinery. + +## Batch Size Automator + +To optimize ingest performance, the [](#bsa) utility library is used. +The BSA is only active when [INGEST_MODE](#ingest-mode) is set to `1`, +and [BATCH_SIZE](#setting-dg-batch-size) has a value smaller or equal `0`. +When activated, everything else is configured automatically. ## Prometheus Metrics -This chapter gives an overview over the available Prometheus metrics and what they represent - -+ tsperf_generated_values: how many values have been generated -+ tsperf_inserted_values: how many values have been inserted -+ tsperf_insert_percentage: [INGEST_SIZE](#ingest_size) times number of IDs divided by inserted_values -+ tsperf_batch_size: The currently used batch size (only available with [BSA](#batch-size-automator)) -+ tsperf_insert_time: The average time it took to insert the current batch into the database (only available with [BSA](#batch-size-automator)) -+ tsperf_rows_per_second: The average number of rows per second with the latest batch_size (only available with [BSA](#batch-size-automator)) -+ tsperf_best_batch_size: The up to now best batch size found by the batch_size_automator (only available with [BSA](#batch-size-automator)) -+ tsperf_best_batch_rps: The rows per second for the up to now best batch size (only available with [BSA](#batch-size-automator)) -+ tsperf_values_queue_was_empty: How many times the internal queue was empty when the insert threads requested values - (indicates data generation lacks behind data insertion) -+ tsperf_inserts_failed: How many times the insert operation has failed -+ tsperf_inserts_performed_success: How many time the insert operation was performed successfully. - For Databases where a single insert operation has to be split in to multiple - (AWS Timestream) still only one is counted. +An overview over the available Prometheus metrics and what they represent. + +:::{csv-table} Query Timer Statistics Arguments +"Metric Name", "Description" + +tsperf_generated_values, How many values have been generated +tsperf_inserted_values, How many values have been inserted +tsperf_insert_percentage, [INGEST_SIZE](#setting-dg-ingest-size) times number of IDs divided by number of inserted values +tsperf_batch_size, The currently used batch size [^bsa-only] +tsperf_insert_time, The average time it took to insert the current batch into the database [^bsa-only] +tsperf_rows_per_second, The average number of rows per second with the latest batch size [^bsa-only] +tsperf_best_batch_size, The best batch size found by the batch size automator up to now [^bsa-only] +tsperf_best_batch_rps, The rows per second number for the best batch size up to now [^bsa-only] +tsperf_values_queue_was_empty, How many times the internal queue was empty when the insert threads requested values. This can indicate whether data generation lacks behind data insertion. +tsperf_inserts_failed, How many times the insert operation has failed +tsperf_inserts_performed_success, "How many times the insert operation was performed successfully. For databases where a single insert operation has to be split into multiple ones. For AWS Timestream, still only one is counted." +::: + +[^bsa-only]: Only available with [](#bsa). ## Example Use Cases This chapter gives examples on how the Data Generator can be used. The -respective files for these examples can be found [here](../schema). +respective files for the examples can be explored in the schema folder. ### Single channel @@ -1062,7 +974,7 @@ Every 5 seconds each sensor reports a value and we want our simulation to run fo #### Setup -The resulting JSON schema could look like [`machine.json`](../schema/factory/simple/machine.json). +The resulting JSON schema could look like [machine.json]. As we have five sensors on ten lines in two factories we have 100 sensors in total, so for our docker-compose file we set the following environment variables: @@ -1075,17 +987,19 @@ As we want to use CrateDB running on localhost we set the following environment + USERNAME: "aValidUsername" + PASSWORD: "PasswordForTheValidUsername" -As we want to have a consistent insert every 5 seconds for one hour we set the following environment variables: +As we want to have a consistent insert every 5 seconds for one hour we set the +following environment variables: + INGEST_MODE: 0 + INGEST_SIZE: 720 (an hour has 3600 seconds divided by 5 seconds) + TIMESTAMP_DELTA: 5 + And finally we want to signal using the appropriate schema: + SCHEMA: "tsperf.schema.factory.simple:machine.json" The resulting yml file could look like this: -```YML +```yaml version: "2.3" services: datagen: @@ -1107,15 +1021,16 @@ services: SCHEMA: "tsperf.schema.factory.simple:machine.json" ``` -#### Running the example +#### Usage To run this example follow the following steps: + navigate to root directory of this repository + build docker image with `docker build -t tsperf -f Dockerfile .` + start an instance of CrateDB on localhost with `docker run -p "4200:4200" crate` -+ Enter USERNAME and PASSWORD in the [docker-compose file](../../examples/factory-simple-machine.yml) - + If no user was created you can just delete both environment variables (crate will use a default user) ++ Enter USERNAME and PASSWORD in the [simple factory compose file] + + If no user was created, you can just delete both environment variables. + CrateDB will use a default user. + start the docker-compose file with `docker-compose -f examples/factory-simple-machine.yml up` You can now navigate to localhost:4200 to look at CrateDB or to localhost:8000 to look at the raw data of the Data Generator. @@ -1139,22 +1054,23 @@ combinations have multiple sensors reporting at different time intervals: #### Setup As we actually use four different schemas (temperature metric is different for upper and lower lines) we also have four -different schema files. You can find them [here](../../examples/MultiType). +different schema files. You can find them in the [complex factory schemas] folders. -To run this use-case we have to write a more complex docker-compose [file](../../examples/factory-complex-scenario.yml). +To run this use-case we have to write a more complex docker-compose [complex factory compose file]. **Note we use `INGEST_MODE: 1` to insert data fast. To keep data-size small we only insert 1000 seconds worth of data, this can obviously be adjusted to create a bigger dataset.** -#### Running the example +#### Usage To run this example follow the following steps: + navigate to root directory of this repository + build docker image with `docker build -t tsperf -f Dockerfile .` + start an instance of CrateDB on localhost with `docker run -p "4200:4200" crate` -+ Add USERNAME and PASSWORD in the [docker-compose file](../../examples/factory-complex-scenario.yml) - + If no user was created you can just ignore both environment variables (crate will use a default user) ++ Adjust USERNAME and PASSWORD within the docker-compose file + + If no user was created, you can just ignore both environment variables. + CrateDB will use a default user. + start the docker-compose file with `docker-compose -f examples/factory-complex-scenario.yml up` You can now navigate to localhost:4200 to look at CrateDB or to localhost:8000 to look at the raw data of the Data Generator. @@ -1168,3 +1084,13 @@ You can now navigate to localhost:4200 to look at CrateDB or to localhost:8000 t - Plant: - Line: - Machine: + + +[complex factory compose file]: https://github.com/crate/tsperf/blob/main/examples/factory-complex-scenario.yml +[complex factory schemas]: https://github.com/crate/tsperf/tree/main/tsperf/schema/factory/complex +[environment.json]: https://github.com/crate/tsperf/blob/main/tsperf/schema/basic/environment.json +[example folder]: https://github.com/crate/tsperf/tree/main/examples +[machine.json]: https://github.com/crate/tsperf/blob/main/tsperf/schema/factory/simple/machine.json +[motor.json]: https://github.com/crate/tsperf/blob/main/tsperf/schema/basic/motor.json +[schema folder]: https://github.com/crate/tsperf/tree/main/tsperf/schema +[simple factory compose file]: https://github.com/crate/tsperf/blob/main/examples/factory-simple-machine.yml diff --git a/docs/query-timer.md b/docs/query-timer.md index 1ab2ed4..93fd9cb 100644 --- a/docs/query-timer.md +++ b/docs/query-timer.md @@ -1,185 +1,196 @@ # Query Timer -## General Information - -This chapter cover general information about the Query Timer, e.g. supported databases and the basic workflow. - -### About - -The Query Timer is a tool to invoke queries against databases and measure its responsiveness. - -### How To - -#### Pip install - -The *Time Series Query Timer* is part of the `tsperf` package and can be installed using `pip install tsperf`. - -By calling `tsperf read --help` the possible configurations are listed. For further details see -[Query Timer Configuration](#query-timer-configuration). All configurations can be done with either command line -arguments or environment variables but when both are set then command line arguments will be used. - -When calling `tsperf read` with the desired arguments the Query Timer outputs live updated statistics on the query execution. -This includes: - -+ concurrency: how many threads are running, defined by [CONCURRENCY](#concurrency) -+ iterations: how many queries will be done in each thread, defined by [ITERATIONS](#iterations) -+ Progress: Percent of queries done and duration in seconds -+ time left: how much time is approximately left -+ rate: how many queries are executed each second on average -+ mean: the average query duration -+ stdev: the standard deviation of query execution time from the mean -+ min: the minimal query duration -+ max: the maximum query duration -+ success: how many queries were executed successfully -+ failure: how many queries were not executed successfully -+ percentiles: prints all chosen percentiles from the query execution times, defined by [QUANTILES](#quantiles) - -**NOTE: the QueryTimer measures roundtrip times so the actual time spent in the database could be less.** +The Query Timer is a tool to invoke queries against databases and measure +its responsiveness. + +## Installation + +:::{rubric} PyPI package +::: +The *Time Series Query Timer* is part of the `tsperf` package and can be +installed using `pip`. +```shell +pip install tsperf +``` + +## Usage + +By calling `tsperf read --help`, the possible configurations are listed. For +further details, see [Query Timer Configuration](#configuration). +All configurations can be done with either command line arguments or environment +variables, the former are taking precedence. + +When calling `tsperf read` with the desired arguments, the Query Timer outputs +live updated statistics on the query execution. This includes: + +:::{csv-table} Query Timer Statistics Arguments +"Argument", "Description", "Setting" + +concurrency, How many threads are running, [CONCURRENCY](#setting-qt-concurrency) +iterations, How many queries will be done in each thread, [ITERATIONS](#setting-qt-iterations) +progress, Percent of queries done and duration in seconds +time left, How much time is approximately left +rate, How many queries are executed each second on average +mean, The average query duration +stdev, The standard deviation of query execution time from the mean +min, The minimal query duration +max, The maximum query duration +success, How many queries were executed successfully +failure, How many queries were not executed successfully +percentiles, Chosen percentiles from the query execution times, [QUANTILES](#setting-qt-quantiles) +::: + +:::{note} +The QueryTimer measures roundtrip times, so the actual +query execution time spent within the database could be less. +::: ### Supported Databases -Currently 7 Databases are +Currently, 7 databases are supported. + + [CrateDB](https://crate.io/) + [InfluxDB V2](https://www.influxdata.com/) + [TimescaleDB](https://www.timescale.com/) -+ [MongoDB](https://www.mongodb.com/) with limitations see [Using MongoDB](#using-mongodb) ++ [MongoDB](https://www.mongodb.com/) + [PostgreSQL](https://www.postgresql.org/) + [AWS Timestream](https://aws.amazon.com/timestream/) + [Microsoft SQL Server](https://www.microsoft.com/de-de/sql-server) + #### CrateDB -##### Client Library +For CrateDB the [crate](https://pypi.org/project/crate/) library is used. +To connect to CrateDB, the following environment variables must be set: -For CrateDB the [crate](https://pypi.org/project/crate/) library is used. To connect to CrateDB the following -environment variables must be set: ++ [ADDRESS](#setting-qt-address): hostname including port e.g. `localhost:4200` ++ [USERNAME](#setting-qt-username): CrateDB username. ++ [PASSWORD](#setting-qt-password): password for CrateDB user. -+ [ADDRESS](#address): hostname including port e.g. `localhost:4200` -+ [USERNAME](#username): CrateDB username. -+ [PASSWORD](#password): password for CrateDB user. #### InfluxDB -##### Client Library - -For InfluxDB the [influx-client](https://pypi.org/project/influxdb-client/) library is used as the Data Generator only -supports InfluxDB V2. To connect to InfluxDB the following environment variables must be set: +For InfluxDB, the [influx-client](https://pypi.org/project/influxdb-client/) library is used. +To connect to InfluxDB, the following environment variables must be set: -+ [ADDRESS](#address): hostname -+ [TOKEN](#token): InfluxDB Read/Write token -+ [ORG](#org): InfluxDB organization ++ [ADDRESS](#setting-qt-address): hostname ++ [TOKEN](#setting-qt-token): InfluxDB Read/Write token ++ [ORG](#setting-qt-org): InfluxDB organization -##### Specifics +:::{note} +As only InfluxDB V2 is currently supported, queries have to be written in the Flux Query Language. +::: -+ As only InfluxDB V2 is currently supported queries have to be written in the Flux Query Language. -#### TimescaleDB +#### Microsoft SQL Server -##### Client Library +For Microsoft SQL Server the [pyodcb](https://github.com/mkleehammer/pyodbc) library is used. +If the Data Generator is run via `pip install` please ensure that `pyodbc` is properly installed on your system. -For TimescaleDB the [psycopg2](https://pypi.org/project/psycopg2/) library is used. +To connect with Microsoft SQL Server the following environment variables must be set: -To connect with TimescaleDB the following environment variables must be set: ++ [ADDRESS](#setting-qt-address): the host where Microsoft SQL Server is running in this [format](https://www.connectionstrings.com/azure-sql-database/) ++ [USERNAME](#setting-qt-username): Database user ++ [PASSWORD](#setting-qt-password): Password of the database user ++ [DATABASE](#setting-qt-database): the database name to connect to or create -+ [ADDRESS](#address): hostname -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): the database name with which to connect #### MongoDB -##### Client Library - -For MongoDB the [MongoClient](https://mongodb.github.io/node-mongodb-native/api-generated/mongoclient.html) library is +For MongoDB, the [MongoClient](https://mongodb.github.io/node-mongodb-native/api-generated/mongoclient.html) library is used. To connect with MongoDB the following environment variables must be set: -+ [ADDRESS](#address): hostname (can include port if not standard MongoDB port is used) -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): The name of the MongoDB database that will be used ++ [ADDRESS](#setting-qt-address): hostname (can include port if not standard MongoDB port is used) ++ [USERNAME](#setting-qt-username): username of TimescaleDB user ++ [PASSWORD](#setting-qt-password): password of TimescaleDB user ++ [DATABASE](#setting-qt-database): The name of the MongoDB database that will be used -##### Specifics +:::{note} +Because `pymongo` does not support queries as string, support for MongoDB is +turned off in the binary. To still use the Query Timer with MongoDB, have a +look at the next documentation section. +::: -Because `pymongo` does not support queries as string, Support for MongoDB is turned of in the binary. To still use the -Query Timer with Mongo DB have a look at the [Using MongoDB](#using-mongodb) section of this documentation. +:::{attention} +To use the Query Timer with MongoDB, the code needs to be changed. Therefore, +check out the [repository](https://www.github.com/crate/tsperf). -#### PostgreSQL ++ In the file `core.py`, uncomment the import statement of the `MongoDBAdapter`. ++ Also uncomment the instantiation of the `adapter` in the `get_database_adapter` function. ++ Comment the `ValueError` in the line above. -##### Client Library +This should let you start the Query Timer using `ADAPTER` set to MongoDB. -For PostgreSQL the [psycopg2](https://pypi.org/project/psycopg2/) library is used. +To add the query you want to measure add a variable containing your query to the script and pass this variable to +`adapter.execute_query()` in the `start_query_run` function, instead of `config.query`. -To connect with PostgreSQL the following environment variables must be set: +Now, the Query Timer is able to measure query execution times for MongoDB. +::: -+ [ADDRESS](#address): hostname -+ [USERNAME](#username): username of TimescaleDB user -+ [PASSWORD](#password): password of TimescaleDB user -+ [DATABASE](#database): the database name with which to connect +:::{todo} +Why make the user need to change the code? Why not just implement the facts above? +::: -#### AWS Timestream -##### Client Library +#### PostgreSQL -For AWS Timestream the [boto3](https://github.com/boto/boto3) library is used. +For PostgreSQL the [psycopg2](https://pypi.org/project/psycopg2/) library is used. -To connect with AWS Timestream the following environment variables must be set: +To connect with PostgreSQL the following environment variables must be set: -+ [AWS_ACCESS_KEY_ID](#aws_access_key_id): AWS Access Key ID -+ [AWS_SECRET_ACCESS_KEY](#aws_secret_access_key): AWS Secret Access Key -+ [AWS_REGION_NAME](#aws_region_name): AWS Region -+ [DATABASE](#database): the database name to connect to or create ++ [ADDRESS](#setting-qt-address): hostname ++ [USERNAME](#setting-qt-username): username of TimescaleDB user ++ [PASSWORD](#setting-qt-password): password of TimescaleDB user ++ [DATABASE](#setting-qt-database): the database name with which to connect -##### Specifics -+ Tests have shown that queries often fail due to server errors. To accommodate this an automatic retry is implemented - that tries to execute the query a second time. If it fails again the query is marked as failure. - -#### Microsoft SQL Server +#### TimescaleDB -##### Client Library +For TimescaleDB the [psycopg2](https://pypi.org/project/psycopg2/) library is used. -For Microsoft SQL Server the [pyodcb](https://github.com/mkleehammer/pyodbc) library is used. -If the Data Generator is run via `pip install` please ensure that `pyodbc` is properly installed on your system. +To connect with TimescaleDB the following environment variables must be set: -To connect with Microsoft SQL Server the following environment variables must be set: ++ [ADDRESS](#setting-qt-address): hostname ++ [USERNAME](#setting-qt-username): username of TimescaleDB user ++ [PASSWORD](#setting-qt-password): password of TimescaleDB user ++ [DATABASE](#setting-qt-database): the database name with which to connect -+ [ADDRESS](#address): the host where Microsoft SQL Server is running in this [format](https://www.connectionstrings.com/azure-sql-database/) -+ [USERNAME](#username): Database user -+ [PASSWORD](#password): Password of the database user -+ [DATABASE](#database): the database name to connect to or create -### Using MongoDB +#### Timestream -To use the Query Timer with MongoDB the code of the Query Timer needs to be changed. Therefore checkout the -[repository](https://www.github.com/crate/tsperf). +For AWS Timestream the [boto3](https://github.com/boto/boto3) library is used. -+ In the file [core.py](core.py), uncomment the import statement of the `MongoDBAdapter` -+ Also uncomment the instantiation of the `adapter` in the `get_database_adapter` function -+ Comment the `ValueError` in the line above +To connect with AWS Timestream the following environment variables must be set: -This should let you start the Query Timer using `ADAPTER` set to MongoDB. ++ [AWS_ACCESS_KEY_ID](#setting-qt-aws_access_key_id): AWS Access Key ID ++ [AWS_SECRET_ACCESS_KEY](#setting-qt-aws_secret_access_key): AWS Secret Access Key ++ [AWS_REGION_NAME](#setting-qt-aws_region_name): AWS Region ++ [DATABASE](#setting-qt-database): the database name to connect to or create -To add the query you want to measure add a variable containing your query to the script and pass this variable to -`adapter.execute_query()` in the `start_query_run` function, instead of `config.query`. - -Now the Query Timer is able to measure query execution times for MongoDB. +:::{note} +Tests have shown that queries often fail due to server errors. To accommodate this, +an automatic retry is implemented, that tries to execute the query a second time. +If it fails again the query is marked as failure. +::: + -## Query Timer Configuration +(configuration)= +## Configuration The Query Timer is mostly configured by setting Environment Variables (or command line arguments start with `-h` for more information). This chapter lists all available Environment Variables and explains their use in the Query Time. -### Environment variables configuring the behaviour of the Query Time +### Database Settings -The environment variables in this chapter are used to configure the behaviour of the Query Timer +The environment variables in this chapter are used to configure the behaviour of the Query Timer. +(setting-qt-adapter)= #### ADAPTER -Type: String - -Values: cratedb|timescaledb|influxdb1|influxdb2|mongodb|postgresql|timestream|mssql +:Type: String +:Value: `cratedb|timescaledb|influxdb1|influxdb2|mongodb|postgresql|timestream|mssql` The value will define which database adapter to use: + CrateDB @@ -190,54 +201,50 @@ The value will define which database adapter to use: + Timestream + Microsoft SQL Server +(setting-qt-concurrency)= #### CONCURRENCY How many threads are used in parallel to execute queries -Type: Integer - -Values: Integer bigger 0 - -Default: 10 +:Type: Integer +:Values: Integer bigger 0 +:Default: 10 +(setting-qt-iterations)= #### ITERATIONS How many iterations each thread is doing. -Type: Integer - -Values: Integer bigger 0 - -Default: 100 +:Type: Integer +:Value: Integer bigger 0 +:Default: 100 +(setting-qt-quantiles)= #### QUANTILES List of quantiles that will be written to the ouput after the Query Timer finishes -Type: String - -Values: list of Floats between 0 and 100 split by `,` - -Default: "50,60,75,90,99" +:Type: String +:Value: list of Floats between 0 and 100 split by `,` +:Default: "50,60,75,90,99" +(setting-qt-refresh-interval)= #### REFRESH_INTERVAL The time in seconds between updates of the output -Type: Float - -Values: Any positive float - -Default: 0.1 +:Type: Float +:Value: Any positive float +:Default: 0.1 +(setting-qt-query)= #### QUERY -Type: String - -Values: A valid Query as string - -Default: "" +:Type: String +:Value: A valid Query as string +:Default: "" +(setting-qt-address)= #### ADDRESS Type: String @@ -260,33 +267,30 @@ Host can be either without port (e.g. `"localhost"`) or with port (e.g. `"localh host must start with `tcp:` +(setting-qt-username)= #### USERNAME -Type: String - -Values: username of user used for authentication against the database - -Default: None +:Type: String +:Value: username of user used for authentication against the database +:Default: None used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +(setting-qt-password)= #### PASSWORD -Type: String - -Values: password of user used for authentication against the database - -Default: None +:Type: String +:Value: password of user used for authentication against the database +:Default: None used with CrateDB, TimescaleDB, MongoDB, Postgresql, MSSQL. +(setting-qt-database)= #### DATABASE -Type: String - -Values: Name of the database where table will be created - -Default: empty string +:Type: String +:Value: Name of the database where table will be created +:Default: empty string used with TimescaleDB, MongoDB, AWS Timestream, Postgresql, MSSQL. @@ -300,90 +304,91 @@ The value of `DATABASE` is used as the database parameter of MongoDB. **AWS Timestream:** The value of `DATABASE` is used as the database parameter of AWS Timestream. -### Environment variables used to configure InfluxDB +### InfluxDB Settings The environment variables in this chapter are only used to configure InfluxDB +(setting-qt-token)= #### TOKEN -Type: String - -Values: token gotten from InfluxDB V2 - -Default: empty string +:Type: String +:Value: token gotten from InfluxDB V2 +:Default: empty string Influx V2 uses [token](https://v2.docs.influxdata.com/v2.0/security/tokens/view-tokens/) based authentication. +(setting-qt-org)= #### ORG -Type: String - -Values: org_id gotten from InfluxDB V2 - -Default: empty string +:Type: String +:Value: org_id gotten from InfluxDB V2 +:Default: empty string Influx V2 uses [organizations](https://v2.docs.influxdata.com/v2.0/organizations/) to manage buckets. -### Environment variables used to configure AWS Timestream +### Timestream Settings The environment variables in this chapter are only used to configure AWS Timestream +(setting-qt-aws_access_key_id)= #### AWS_ACCESS_KEY_ID -Type: String - -Values: AWS Access Key ID - -Default: empty string +:Type: String +:Value: AWS Access Key ID +:Default: empty string +(setting-qt-aws_secret_access_key)= #### AWS_SECRET_ACCESS_KEY -Type: String - -Values: AWS Secret Access Key - -Default: empty string +:Type: String +:Value: AWS Secret Access Key +:Default: empty string +(setting-qt-aws_region_name)= #### AWS_REGION_NAME -Type: String - -Values: AWS region name - -Default: empty string +:Type: String +:Value: AWS region name +:Default: empty string ## Alternative Query Timers -As the Query Timer is just a by-product of the Data Generator there are other alternatives that offer more features and -ways to time queries. The main advantage of the Query Timer is that it supports all Databases that are also supported by -the Data Generator and is easy and fast to use. +The Query Timer is just a by-product of the Data Generator. There are other +alternatives that offer more features and ways to measure the timing of queries. +The main advantage of the Query Timer is that it supports all Databases that are +also supported by the Data Generator and that it is easy and quick to use. ### cr8 -[cr8](https://github.com/mfussenegger/cr8) is a highly sophisticated tool that offers the possibility to measure query +[cr8] is a highly sophisticated tool that offers the possibility to measure query execution times for CrateDB and other databases using the PostgreSQL protocol. -Pros: - -+ Offers support for .toml files to configure more complex scenarios. -+ Offers saving results to CrateDB directly -+ For CrateDB only the real DB-time is measured (when not using the postgres port) - -Cons: +:::{rubric} Pros +::: ++ **Tracks:** Supports configuring more complex scenarios using .toml files. ++ **Persistence:** Supports saving results to CrateDB directly. ++ **Effective:** With the CrateDB HTTP protocol, the real timings spent within + the database are measured, not only round-trip times. -+ No support for databases not using PostgreSQL protocol +:::{rubric} Cons +::: ++ No support for databases not using PostgreSQL protocol. ### JMeter -[Jmeter](https://jmeter.apache.org/) is a well known and great tool that offers the possibility to measure query +[JMeter] is a well known and great tool that offers the possibility to measure query execution times for Databases using JDBC. -Pros: +:::{rubric} Pros +::: ++ Industry standard for these kinds of tests. ++ Supports export of results to Prometheus. ++ Provides sophisticated settings and configurations to support more complex use cases. -+ Industry standard for these kinds of tests -+ Offers Prometheus export of results -+ Offers more sophisticated settings and configurations to support more complicated use cases +:::{rubric} Cons +::: ++ More complex to set up for simple use cases. -Cons: -+ More complex to setup for simple use cases +[cr8]: https://github.com/mfussenegger/cr8 +[JMeter]: https://jmeter.apache.org/ diff --git a/docs/utility/batch-size-automator.md b/docs/utility/batch-size-automator.md index 82cd14e..9cede96 100644 --- a/docs/utility/batch-size-automator.md +++ b/docs/utility/batch-size-automator.md @@ -1,13 +1,42 @@ -# Batch size automator +(bsa)= +(batch-size-automator)= +# Batch Size Automator -`batch_size_automator` is a Python library that allows to automatically detect the best -`batch_size` for optimized batch data operations (e.g. database ingest). +A utility to automatically detect the best batch size for optimized data insert operations. -## Why use batch_size_automator instead of other libraries +## Features +The BSA utility provides two modes: -What other libraries? ++ Finding best batch size ++ Surveillance -## Using batch_size_automator +### Finding best batch size + +1. The BSA calculates how many rows were inserted per second during the last + test cycle (`test_size` inserts). +2. The BSA compares if the current result is better than the best. + + a. If current was better, the batch size is adjusted by the step size. + + b. If current was worse, the batch size is adjusted in the opposite direction + of the last adjustment and the step size is reduced. + +3. Repeat steps 1 to 2 until step size is below a threshold. This means that we + entered 2.b. often and should have found our optimum batch size. +4. Change to surveillance mode. + +### Surveillance + +1. The BSA increases the length of the test cycle to 1000 inserts. +2. After 1000 inserts, the BSA calculates if performance got worse. + + a. if performance is worse test cycle length is set to 20, and we switch to + finding best batch size mode. + + b. If performance is the same or better repeat steps 1 to 2. + + +## Usage The most basic version on how to use the BSA. This will take care of your batch size and over time optimize it for maximum performance. @@ -118,29 +147,3 @@ while True: duration = time.monotonic() - start bsa.insert_batch_time(duration) ``` - -## Simplified functionality - -This chapter show how the BSA works in a simplified explanation. - -### Modes - -The BSA consists of two modes: -+ finding best batch size -+ surveillance - -#### Finding best batch size - -1. The BSA calculates how many rows where inserted per second during the last test cycle (`test_size` inserts) -2. The BSA compares if the current result is better than the best - 2.a. If current was better the batch size is adjusted by the step size - 2.b. If current was worse the batch size is adjusted in the opposite direction of the last adjustment and the step size is reduced -3. Repeat steps 1 to 2 until step size is below a threshold (this means that we entered 2.b. often and should have found our optimum batch size) -4. Change to surveillance mode - -#### Surveillance mode - -1. The BSA increases the length of the test cycle to 1000 inserts -2. After 1000 inserts the BSA calculates if performance has gotten worse - 2.a. if performance is worse test cycle length is set to 20 and we switch to finding best batch size mode - 2.b. If performance is the same or better repeat steps 1 to 2 diff --git a/docs/utility/float-simulator.md b/docs/utility/float-simulator.md index 5878414..e28bcba 100644 --- a/docs/utility/float-simulator.md +++ b/docs/utility/float-simulator.md @@ -1,31 +1,44 @@ -# Float value simulator +(float-value-simulator)= +# Float Value Simulator -`float_simulator` is a Python library that allows to generate float values that look like real world data. The data is -modeled after provided arguments and will have normal distribution. +A utility to generate floating point values that look like real world data. The +data is modeled after provided arguments and will have normal distribution. -## Why use float_simulator instead of other libraries +## Why not use Numpy? -As each generated value is based on the previous value the result set will look and measure like real world sensor data. -Libraries like `numpy` offer also normal distributed values but these values are unordered. E.g. we have a temperature -sensor with an average value of 6.4° and a standard deviation of 0.2. The order of values are completely different when -simulating those values with the `float_simulator` and `numpy`. +Each generated value is based on the previous value, so the result set will +look and measure like real world sensor data. + +Libraries like `numpy` also offer normal distributed values, but these values are +unordered. E.g. we have a temperature sensor with an average value of 6.4° and a +standard deviation of 0.2. The order of values are completely different when +simulating those values with the `float_simulator` vs. `numpy`. ### Order of values -`float_simulator` 10.000 values, with `mean=6.4`, `stdev=0.2`: +:::{rubric} Float Simulator +::: +10.000 values, with `mean=6.4`, `stdev=0.2`: -![float_simulator_values](https://user-images.githubusercontent.com/453543/118516727-e0a5a080-b736-11eb-800f-be3caf77b195.png) +![float_simulator_values](https://user-images.githubusercontent.com/453543/118516727-e0a5a080-b736-11eb-800f-be3caf77b195.png){width=480} -`numpy` 10.000 values, with `loc=6.4`, `scale=0.2`: +:::{rubric} Numpy +::: +10.000 values, with `loc=6.4`, `scale=0.2`: -![numpy_values](https://user-images.githubusercontent.com/453543/118516831-f7e48e00-b736-11eb-8a5c-047590767f7f.png) +![numpy_values](https://user-images.githubusercontent.com/453543/118516831-f7e48e00-b736-11eb-8a5c-047590767f7f.png){width=480} ### Distribution -![float_simulator_distribution](https://user-images.githubusercontent.com/453543/118516654-cf5c9400-b736-11eb-8069-3ef85f22d5f4.png) -![numpy_distribution](https://user-images.githubusercontent.com/453543/118516782-ed29f900-b736-11eb-8c69-47db9c5ab6a0.png) +:::{rubric} Float Simulator +::: +![float_simulator_distribution](https://user-images.githubusercontent.com/453543/118516654-cf5c9400-b736-11eb-8069-3ef85f22d5f4.png){width=480} + +:::{rubric} Numpy +::: +![numpy_distribution](https://user-images.githubusercontent.com/453543/118516782-ed29f900-b736-11eb-8c69-47db9c5ab6a0.png){width=480} -## Using float_simulator +## Usage Instantiate a `FloatSimulator` object and calculate a number of values for it: diff --git a/docs/utility/tictrac.md b/docs/utility/tictrac.md index 6601882..32e86c9 100644 --- a/docs/utility/tictrac.md +++ b/docs/utility/tictrac.md @@ -1,13 +1,16 @@ +(tictrack)= # tictrack -`tictrack` is a Python library to measure function execution times and apply statistical functions on the results. +A utility to measure function execution times, and apply statistical functions on the results. -## Why using tictrack instead of other libraries - -Other libraries that measure function execution times require the same repetitive code for each time you want to use it. -This reduces readability and code needs to be changed when execution times no longer want to be tracked. Also, if an -average (or other statistical value) execution time needs to be calculated time keeping needs to be implemented again. +## Why? +Other libraries that measure function execution times require the same +repetitive code for each time you want to use it. This reduces readability +and code needs to be changed when execution times no longer want to be +tracked. Also, if an average (or other statistical value) execution time +needs to be calculated time keeping needs to be implemented again. +## Features `tictrack` solves this with the following features: + [decorator](#decorator) for function to automatically track the execution time of each function call + [wrapper](#wrapper) function that can be put around each function call that should be tracked @@ -17,9 +20,9 @@ average (or other statistical value) execution time needs to be calculated time + [function](#consolidating-the-result) to consolidate large result sets + additional [`delta`](#delta) time tracking so two results can be kept at the same time -## Using tictrack +## Usage -There are two ways to use `tictrack`, the optimal one depending on your specific use case. +There are two ways to use `tictrack`, the optimal one depends on your specific use case. - If you want to track every execution of one or more function, using the decorator is the easiest solution. - If you only want to track certain executions of one or more functions, the wrapper function is the better solution. @@ -127,6 +130,20 @@ could set `do_print=True` and `save_result=False`. **Note:** these arguments must be passed before any function `**kwargs` as keyword arguments to `tictrack.execute_timed_function` as described in the function documentation. +### Disabling tictrack + +To disable tictrack on a global scale this line needs to be added to you code before any measurements happen: + +```python +tictrack.enabled = False +``` + +This will reduce the influence of `tictrack` on the runtime to a minimum with no additional code changes necessary. This +makes it easy to switch `tictrack` on and off without searching the whole code base where it is used. + + +## Result Processing + ### Delta `tictrack` offers a second set of results which can be used to only analyze a subset of values. This option is enabled @@ -252,14 +269,3 @@ from tsperf.util import tictrack tictrack.reset("foo") # "foo" in tictrack.tic_toc is False ``` - -### Disabling tictrack - -To disable tictrack on a global scale this line needs to be added to you code before any measurements happen: - -```python -tictrack.enabled = False -``` - -This will reduce the influence of `tictrack` on the runtime to a minimum with no additional code changes necessary. This -makes it easy to switch `tictrack` on and off without searching the whole code base where it is used.