TimescaleDB is a database built on top of PostgreSQL, utilizing a SQL
interface and RDBMS characteristics. This supplemental guide explains how
the data generated for TSBS is stored, additional flags available when
using the data importer (tsbs_load_timescaledb
), and additional flags
available for the query runner (tsbs_run_queries_timescaledb
). This
should be read after the main README.
Data generated by tsbs_generate_data
for TimescaleDB is serialized in a
"pseudo-CSV" format, along with a custom header at the beginning. The
header is several lines long:
- one line composed of a comma-separated list of tag labels, with the literal string
tags
as the first value in the list - one or more lines composed of a comma-separated list of field labels, with the hypertable name as the first value in the list
- a blank line
An example for the cpu-only
use case:
tags,hostname,region,datacenter,rack,os,arch,team,service,service_version,service_environment
cpu,usage_user,usage_system,usage_idle,usage_nice,usage_iowait,usage_irq,usage_softirq,usage_steal,usage_guest,usage_guest_nice
Following this, each reading is composed of two rows:
- a comma-separated list of tag values for the reading, with the literal string
tags
as the first value in the list - a comma-separated list of field values for the reading, with the hypertable the reading belongs to being the first value and the timestamp as the second value
An example for the cpu-only
use case:
tags,host_0,eu-central-1,eu-central-1b,21,Ubuntu15.10,x86,SF,6,0,test
cpu,1451606400000000000,58.1317132304976170,2.6224297271376256,24.9969495069947882,61.5854484633778867,22.9481393231639395,63.6499207106198313,6.4098777048301052,44.8799140503027445,80.5028770761136201,38.2431182911542820
Examples of TimescaleDB YAML configuration for loading data can be found in the sample-configs directory. The main difference in benchmarking TimescaleDB regards whether you are attempting to benchmark a regular, single-node hypertable or a distributed multi-node hypertable. There are sample configuration files for both.
The settings that determines whether TSBS will attempt to create a distributed hypertable or a regular hypertable are below. All other settings impact the overall test, regardless of which type of hypertable setup you are testing.
Setting | Hypertable | Distributed Hypertable |
---|---|---|
in-table-partition-tag | false | true |
replication-factor | omit/0 | 1 |
Hostname of the PostgreSQL server.
Specifies any connection parameters to pass along as the client
connects to the PostgreSQL server. Values for dbname
, host
, and user
in the connection string will be overridden with the values from the flags
-db-name
, -host
, and -user
, respectively. See the
PostgreSQL documentation for more details.
Whether to actually use TimescaleDB's hypertable for storing data. Set to
false
to measure the insert/write performance of plain PostgreSQL.
User to use to connect to the PostgreSQL server.
Database to connect to in order to create additional benchmark databases.
By default this is the same as the user
(i.e., postgres
if neither is set),
but sometimes a user does not have its own database.
Whether to store the primary key tag as a column in the hypertable. By default the hypertable just stores the primary key (an int) of the tag set associated with the reading in the table. However this option can be more costly for some queries because of the need to do a JOIN. This option will store the primary tag (first tag in the list of tags in the data format) in the hypertable so that if most queries are using that as a filter, query performance can be improved.
Please note,
-in-table-partition-tag
is required for multi-node TimescaleDB benchmark testing. Currently, the multi-node query engine does not support push-down of JOINs and so the most efficient queries will filter on the primary tag name.
Whether to store the tags as a JSONB element in the tags table. By default
tags are stored in separate columns in a metadata table named tags
, where
each unique set of tags is stored in a separate row. Using this option will
store the tags a JSONB element in tags
instead. Write performance does not
seem to be dramatically affected by this option, but query performance is
typically better with non-JSONB tags so this defaults to false
.
Size of each time partition in terms of time. It is expressed as a Golang
time.Duration string, meaning a number followed by a unit abbreviation
(s = seconds, m = minutes, h = hours), e.g., the default 12h
is 12 hours.
This should be adjusted based on the dataset size.
Number of space partitions for the primary tag. Setting this to 0
will
disable space partitioning.
Multi-node Benchmarking: When testing a multi-node installation, this value should be set to zero or omitted (which will default to zero.) In doing so, the distributed hypertable will be created across all connected data nodes, which is the preferred setup.
This setting is specifically required to create a distributed hypertable. This is only supported on multi-node installations. Setting it to a value of 1 is recommended for most multi-node testing.
If this value is set >=1 on a single-node TimescaleDB instance, tsbs_load
will
error.
The format for (any) field indexes, which are additional secondary indexes
on fields in a hypertable. These are used for more performant threshold
queries when the threshold is on a field rather than time, e.g.,
cpu.usage_user > 90
. The two valid options are:
VALUE-TIME
which creates a compound index on(<field>, time DESC)
TIME-VALUE
which creates a compound index on(time DESC, <field>)
(<field>
is replaced with the actual field name)
Number of secondary indexes to create on measurement fields, with -1
signifying to create indexes on all fields. While secondary indexes can
increase query performance, they will also increase disk usage and reduce
write performance.
Whether to create a compound index on the primary tag and time dimension
(i.e., an index on (tags_id, time DESC)
). Removing this index is likely
to significantly reduce query performance.
Whether to create an index on the time dimension. For datasets with smaller
number of devices (i.e., <100k), this is usually recommended. For a larger
number of devices, -time-partition-index
is recommended instead.
Whether to create a compound index on the time dimension and the primary
tag (i.e., an index on (time DESC, tags_id)
).
For datasets with a larger number of devices (i.e., >100k), this is
usually recommended because it increases write performance by
reducing lock contention on nodes in the
B-tree since they are additionally partitioned by tags_id
.
Whether to consistently hash data across the multiple insert workers by the value of the primary (first) tag. For datasets with larger numbers of devices, this option helps improve data locality on disk which can lead to better query performance. For datasets with smaller numbers of devices, it is typically not necessary.
File to output periodic CPU and memory statistics. Useful for understanding system performance while writing data to the database.
File to output replication statistics. Useful for understanding how long it takes for data to be written in a replicated setup.
Comma separated list of hostnames for the PostgreSQL servers. Each server should contain a full copy/replica of the dataset. Workers are connected to a server in a round-robin fashion.
Specifies any connection parameters to pass along as the client
connects to the PostgreSQL server. Values for dbname
, host
, and user
in the connection string will be overridden with the values from the flags
-db-name
, -hosts
, and -user
, respectively. See the
PostgreSQL documentation for more details.
Whether to print out a sample EXPLAIN ANALYZE
output for the first query
in the set of queries. This will be the only query run and is useful for
understanding the query plan that is being generated for a particular
query type.
User to use to connect to the PostgreSQL server(s).