-
Notifications
You must be signed in to change notification settings - Fork 0
Performance benchmarking
To benchmark the performance of the database when the number of measurement-related models is reduced from around 30 to just 1, several scenarios with synthetic data have been created. All scenarios have the same number of total records, around 24 millions.
This scenario creates synthetic data for a single station and a set of variables (10) for 23 years. It results in a database structure where the number of records is spread evenly across the years and variables. As the default chunk time interval for the TimescaleDB is 1 day, this scenario results in many chunks (>8000) with just a few records each (~3000).
This scenario creates synthetic data for a set of stations (23) and variables (10) for a single year. It results in a database structure where the number of records is spread evenly across the years and variables, as before, but compressed in a single day. As the default chunk time interval for the TimescaleDB is 1 day, this scenario results in not that many chunks (365) and a higher number of records per chunk that the previous scenario (~66000), although the total number of records is the same. Even per chunk, the number of records is still very small for a Postgres DB designed to be fluid with millions of entries per table, suggesting that the performance will not be that different.
The following chart shows the time it takes to create 100k records at a time as a function of the number of records already present for two scenarios. As it can be seen, not only the performance is pretty good but also it is rather independent on the total number of records present or how they are spread over time (it could have been different as the table is internally partitioned in the time dimension). As a reference, the current live system they have has 16.5 million records as described in this link, so we are going a step further with this, although we have no information on data injection times. Injecting data is typically done asynchronously, so performance is less of an issue in this case, anyway.
The query time has also been tested in both scenarios, including in this time the database filtering and the extraction of the data as a pandas.Dataframe
that can be latter manipulated further:
records = Measurement.objects.filter(
station=station, variable=variable, time__range=(start_date, end_date)
)
records_df = pd.DataFrame.from_records(records.values())
A random date range has been used in each case. The results are shown in the following plot.
As it can be seen, the larger the query (the longer the time range), the more records need to be retrieved and therefore the longer it takes to complete the process. However,
- there seems to be no difference in performance based on the density of records per chunk (scenario 1 vs scenario 2 in the region they overlap);
- queries less than 2-years long are mostly in the sub-second range, which should result in an acceptable end-user experience.
It is unclear why there seems to be some branching in the plots (for both cases) when making small queries, but still the query time is pretty small and might be related with some internal database querying process, so it should not have a bit impact.
The analysis for Scenario 1 has been repeated limiting the resources available to up to 1 CPU and 4 Gb of memory to closely match what is likely to be available in the real web server. The results, overlayed with the previous data is shown in the next two plots.
As it can be seen, no obvious difference can be observed, suggesting that the performance is not likely to be limited by running Paricia with few resources.
A final scenario is included in which, in addition to having low resources (lower, actually: 1 CPU and 2 Gb of RAM), the file is exported as CSV and saved to disk, to have a better comparison with the query benchmarking provided in this issue. In particular, the code timed is:
records = Measurement.objects.filter(
station=station, variable=variable, time__range=(start_date, end_date)
)
records_df = pd.DataFrame.from_records(records.values())
records_df.to_csv(f"scratch/temp/{start_date}_{end_date}_{station}_{variable}.csv")
The result are included in the following plot.
As it can be seen, the time is now significantly higher, around twice the previous results, but still below the marks of the current system for a similar size query. For example, retrieving and saving as CSV 400k records (~4 years for 1 record every 5 min) takes around 4s, compared to the 7s in the current system.
We need to keep in mind that this comparison might be strongly influenced by the disk writing speed and the network speed, so the comparison must be taken with care.
Based on these observations, it seems that reducing the number of measurement models (i.e. database tables) do not have a detrimental effect in the performance of the database as such performance is dominated by the number of records per chunk, which is very small for a PostgreSQL DB in the two scenarios considered.