Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Measurement table to store iteration values in an array instead of in rows #100

Merged
merged 33 commits into from
Mar 23, 2024

Conversation

HumphreyHCB
Copy link
Contributor

@HumphreyHCB HumphreyHCB commented Jul 21, 2022

Changed the Measurements Table to no longer store all iterations as individual rows but instead, each iteration is stored in an array in the database, associated with it key which is built from runID, trailID, criterion and invocation.

This has resulted in a snapshot of the database reducing in size from 7,140 MB to 231 MB, and a row reduction of ~41,000,000 rows down to ~1,100,000 rows.
The full results can be seen here:
image

To migrate your old scheme of the measurements table. You will need to execute the migration script src/backend/db/schema-updates/migration.012.sql.
To execute the file in the command line with a postgres server. You will need to use the following command
psql -U *username* -d *database-name* -a -f *path-to-migration-script*

The script will aggregate all iteration of the same ID into to a single row with an array column containing all values for a literation, In the order in which they occurred.


Updates Before Merge 2024-03-23

By changing the database layout, we significantly reduce the number of records in the Measurement table. It's going from 70,184,010 rows to 3,070,654 and a total size of 5073MB to 504MB.

The performance increased significantly for storing results (up to 15x faster) and rendering reports (up to 4x faster). There's a regression for computing the timeline though.

The database size before migration:

    table_name      | row_estimate |   total    |   index    |   toast    |   table    | total_size_share
--------------------+--------------+------------+------------+------------+------------+------------------
measurement         | 7.018401e+07 | 5073 MB    | 2109 MB    |            | 2964 MB    |   79.38%
profiledata         |       101783 | 1260 MB    | 2248 kB    | 1251 MB    | 6384 kB    |   19.72%
timeline            |       496331 | 43 MB      | 11 MB      |            | 32 MB      |     .67%
run                 |        13549 | 7928 kB    | 3320 kB    | 8192 bytes | 4600 kB    |     .12%
trial               |        18713 | 5232 kB    | 1344 kB    | 8192 bytes | 3880 kB    |     .08%
source              |         3500 | 1376 kB    | 344 kB     | 8192 bytes | 1024 kB    |     .02%
experiment          |         3563 | 432 kB     | 232 kB     | 8192 bytes | 192 kB     |     .01%
project             |           17 | 64 kB      | 48 kB      | 8192 bytes | 8192 bytes |     .00%
criterion           |           10 | 48 kB      | 32 kB      | 8192 bytes | 8192 bytes |     .00%
environment         |           22 | 48 kB      | 32 kB      | 8192 bytes | 8192 bytes |     .00%
schemaversion       |            1 | 24 kB      | 16 kB      |            | 8192 bytes |     .00%
softwareversioninfo |            0 | 24 kB      | 16 kB      | 8192 bytes | 0 bytes    |     .00%
softwareuse         |            0 | 8192 bytes | 8192 bytes |            | 0 bytes    |     .00%

After migration:

    table_name      | row_estimate |   total    |   index    |   toast    |   table    | total_size_share
--------------------+--------------+------------+------------+------------+------------+------------------
profiledata         |       101783 | 1260 MB    | 2248 kB    | 1251 MB    | 6384 kB    |   69.16%
measurement         | 3.070654e+06 | 504 MB     | 66 MB      | 129 MB     | 309 MB     |   27.67%
timeline            |       496331 | 43 MB      | 11 MB      |            | 32 MB      |    2.36%
run                 |        13549 | 7928 kB    | 3320 kB    | 8192 bytes | 4600 kB    |     .43%
trial               |        18713 | 5232 kB    | 1344 kB    | 8192 bytes | 3880 kB    |     .28%
source              |         3500 | 1376 kB    | 344 kB     | 8192 bytes | 1024 kB    |     .07%
experiment          |         3563 | 432 kB     | 232 kB     | 8192 bytes | 192 kB     |     .02%
project             |           17 | 64 kB      | 48 kB      | 8192 bytes | 8192 bytes |     .00%
criterion           |           10 | 48 kB      | 32 kB      | 8192 bytes | 8192 bytes |     .00%
environment         |           22 | 48 kB      | 32 kB      | 8192 bytes | 8192 bytes |     .00%
softwareversioninfo |            0 | 24 kB      | 16 kB      | 8192 bytes | 0 bytes    |     .00%
schemaversion       |            1 | 24 kB      | 16 kB      |            | 8192 bytes |     .00%
softwareuse         |            0 | 8192 bytes | 8192 bytes |            | 0 bytes    |     .00%
SQL to get table sizes
WITH RECURSIVE pg_inherit(inhrelid, inhparent) AS
    (select inhrelid, inhparent
    FROM pg_inherits
    UNION
    SELECT child.inhrelid, parent.inhparent
    FROM pg_inherit child, pg_inherits parent
    WHERE child.inhparent = parent.inhrelid),
pg_inherit_short AS (SELECT * FROM pg_inherit WHERE inhparent NOT IN (SELECT inhrelid FROM pg_inherit))
SELECT table_schema
    , TABLE_NAME
    , row_estimate
    , pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
    , to_char(total_bytes::float8 / sum(total_bytes) OVER () * 100,'999D99%') AS total_size_share
  FROM (
    SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes
    FROM (
         SELECT c.oid
              , nspname AS table_schema
              , relname AS TABLE_NAME
              , SUM(c.reltuples) OVER (partition BY parent) AS row_estimate
              , SUM(pg_total_relation_size(c.oid)) OVER (partition BY parent) AS total_bytes
              , SUM(pg_indexes_size(c.oid)) OVER (partition BY parent) AS index_bytes
              , SUM(pg_total_relation_size(reltoastrelid)) OVER (partition BY parent) AS toast_bytes
              , parent
          FROM (
                SELECT pg_class.oid
                    , reltuples
                    , relname
                    , relnamespace
                    , pg_class.reltoastrelid
                    , COALESCE(inhparent, pg_class.oid) parent
                FROM pg_class
                    LEFT JOIN pg_inherit_short ON inhrelid = oid
                WHERE relkind IN ('r', 'p')
             ) c
             LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
  ) a
  WHERE oid = parent
) a
WHERE table_schema = 'public'
ORDER BY total_bytes DESC;

Database Access Rights

The migration may need some manual work afterwards though, at least when done with the wrong database user, to make sure the right user has access to the new Measurement table.

The database user will also need extra permissions to support exporting data as CSV files:

GRANT pg_execute_server_program TO user_name;

Limitations

This PR introduces a new API for clients to match the new data handling, but also drops the old API. This is mostly to make sure I migrate all my systems. Technically the conversion code is there and even an API compatibility check, but it's not doing the conversion. If it turns out that's needed, it can be added easily.

When ReBench is used as a client, at least version 1.2.1.dev1 is needed, as per this PR: smarr/ReBench#236

New API

New DataPoint Format

The new definition of DataPoint is as follows:

export interface DataPoint {
  /** Invocation */
  in: number;

  /**
   * An array of criteria with values order by iteration.
   * - some iterations may not yield data (ValuesPossiblyMissing)
   * - some criteria may not have data (CriterionWithoutData)
   */
  m: (ValuesPossiblyMissing | CriterionWithoutData)[];
}

export type ValuesPossiblyMissing = (number | null)[];
export type CriterionWithoutData = null;

This means, a DataPoint can for instance look like this:

const dataPoint = {
  in: 15,
  m: [
   [1, 2, 3, 4, null, 6, 7],
   null,
   [null, null, 3]
  ]
};

The datapoint represents data for invocation 15 of an experiment, and has data for two out of three criteria, where the second one is missing. The arrays encode iteration data in order of iterations, where missing data is identified by null.

API Version Check

OPTIONS /rebenchdb/results provides now version details of the supported API.

It currently responds with:

X-ReBenchDB-Result-API-Version: 2.0.0
Allow: PUT

It can be tested with curl -X OPTIONS https://rebench.dev/rebenchdb/results -i.

Other Changes

  • the self-tracking for performance is changed and may have less overhead now. Instead of going through the same full codepath, it is now simply appending to a previously initialized new array, avoiding having to check for initialization of the experiment etc every time.
  • awaiting quiescence is hopefully more correct now, since it's checking for new requests and awaits those, too
  • the timeline computation is now sending a batch of requests to the web worker instead of sending them one by one
  • the TypeScript compilation now targets ES2022, which may or may not cause issues in older browsers

Copy link
Owner

@smarr smarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
@smarr
Copy link
Owner

smarr commented Jul 21, 2022

So, CI seems to fail with the error you mentioned, too.

https://github.com/smarr/ReBenchDB/runs/7453601666?check_suite_focus=true#step:10:29

error: column "value" is of type real[] but expression is of type text[]

This looks to me like the SQL that you generate may not be ideal, and might have a mix up of notation.

What's the SQL query that you actually execute?
Perhaps one way of going about debugging this is to add tests specifically for the SQL queries that you changed?
Just to check that they execute as you expect them to?

@smarr
Copy link
Owner

smarr commented Jul 21, 2022

I am also wondering whether it's possible to simplify much of this by using arrays directly as parameters.

tests/db-tests.test.ts Outdated Show resolved Hide resolved
@HumphreyHCB HumphreyHCB force-pushed the Performance-Refactor branch 4 times, most recently from ed2a3fb to 98d9e6a Compare July 22, 2022 14:45
@smarr
Copy link
Owner

smarr commented Jul 22, 2022

Random thought when I was walking home: I don't see any changes to somns.R, which is the other consumer of the value column.
It's likely that we miss a test to see it failing, or not generating the expected results.
I somehow fear it doesn't just work :)

@smarr
Copy link
Owner

smarr commented Jul 22, 2022

I think I'll also look into whether we can have concrete benchmarks to more directly measure the tradeoffs of the different optimization approaches.

@HumphreyHCB
Copy link
Contributor Author

you are right, i forgot about somns.R. i will need some help with that, mostly with identifying what handles the Value in the R file.

also i not sure but would i need a new database import? as the old one will have the older version of the table. but ofc there won’t be a new database import yet. i might need some help with dealing that. i suspect the answer will be more unit tests :)

if you don’t mind can I leave this till Monday to sort out

I suspect its going to be a big job

@smarr
Copy link
Owner

smarr commented Jul 22, 2022

Of course, leave it to Monday!

Just some notes in the meantime.

Ideally, we'd refactor the code so that there are no value related changes in the timeline.R or other R files, and all changes (except for the refactoring) would be located in rebenchdb.R.

Here is the bit of the query that gets the value:
https://github.com/smarr/ReBenchDB/blob/master/src/views/rebenchdb.R#L49

I suspect we may want to do the processing and unnesting of data for instance in the "factorize" functions, e.g., https://github.com/smarr/ReBenchDB/blob/master/src/views/rebenchdb.R#L136

When you say database import, I suspect you mean migrating the database from the old to the new schema? I'd say this is a separate problem, which we probably want to postpone to the point in time when we are convinced that the current or another solution is the way we want to go.

@HumphreyHCB HumphreyHCB changed the title Change Measurements to no longer store all iterations as a separate Records (Currently Broken) Change Measurements to no longer store all iterations as a separate Records Jul 25, 2022
@HumphreyHCB
Copy link
Contributor Author

i think i might of fixed it, probably not the most neat way

but i know one of the CI tests will fail.

Error in structure(in_domain, pos = match(in_domain, breaks)) : 
      (converted from warning) Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.
      Consider 'structure(list(), *)' instead.

note sure if i broke it or not. ill have a look tommorow

src/views/rebenchdb.R Outdated Show resolved Hide resolved
@HumphreyHCB
Copy link
Contributor Author

Any specific reason to change the type from float4 to real?

the previous version of measurement table has the datatype of the value column as real
image

@smarr
Copy link
Owner

smarr commented Jul 27, 2022

All values are defined as float4 (which happens to be the same as real), but real is not consistent with the existing code and will lead to confusion:

https://github.com/smarr/ReBenchDB/blob/master/src/db/db.sql#L187
https://github.com/smarr/ReBenchDB/blob/master/src/db/db.sql#L216-L224

@HumphreyHCB HumphreyHCB force-pushed the Performance-Refactor branch 2 times, most recently from 9b18d18 to 31511c5 Compare July 27, 2022 18:33
Copy link
Owner

@smarr smarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of nitpicking. Sorry.

Generally, I think we also need:

  • tests, to test all these changes
  • benchmarks, to assess the benefits/drawbacks
    • time to insert a data set
    • time to query db+restore the iteration column

src/dashboard.ts Outdated Show resolved Hide resolved
src/db.ts Outdated
FROM Measurement
WHERE trialId = $1
GROUP BY runId, criterion, invocation
GROUP BY runId, criterion, invocation, ite
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, how come this is now grouped by ite?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HumphreyHCB sorry, I think we briefly discuss this, but I forgot, what was the explanation for this? Was this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a yes and no. its needed as more of a placeholder for a method that calls it. But we never use the result of the ite column, I think in the record measurement method we now work out what the iteration should be.
If you remove it something will break, I am sure there is a way around this

src/db.ts Outdated
Comment on lines 923 to 929
run.id.toString() +
' ' +
trial.id.toString() +
' ' +
d.in.toString() +
' ' +
criteria.get(m.c).id.toString()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something where using format strings so `${run.id} ${trial.id} ${d.in}`.

Though, I don't really understand what the iterationMap is for.
What does it do? Why do you need construct these "ids" and then use this elaborate way to concatenate things?

Perhaps a comment and naming could help to clear things up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to
iMID = "${run.id} ${trial.id} ${d.in} ${criteria.get(m.c).id}";
so what is iterationMap. before my change when we processed 50 datapoints from the payload we did an immediate batched insert at that point.

problem with that is with the new ARRAY Value, i want to avoid updating the array for a given id to append on extra value.
so once i proceed 50 datapoints i am not guaranteed that there is another value to be appended to the end of the array for a given ID we already seen ( an ID be the primary key for the measurements table which is {run.id} ${trial.id} ${d.in} ${criteria.get(m.c).id})

so what I do is map/dictionary of all ID’s and the array of values that goes with, this is (iterationMap). Once there are no more data points to process I know I have all of the array values for all of the ID. So, then I can start batch inserting, safely knowing there will be no need to update an ID to append another value.

i will add a comment to make this clearer

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, This method gets really large and hard to grasp.

Is this something you could refactor out into its own method, and do it as a conversion pass over the data before preparing the batched insertion?

This would mean two loops over the data, but that should be fine, and hopefully makes things more understandable.

src/db.ts Outdated Show resolved Hide resolved
src/db.ts Outdated Show resolved Hide resolved
src/db/migration.003.sql Outdated Show resolved Hide resolved
src/db/migration.003.sql Outdated Show resolved Hide resolved
Comment on lines 42 to 57
convert_array <- function(x) {
x <- gsub("(^\\{|\\}$)", "", x)
strsplit(x, split = ",")
}

convert_double_array <- function(x) {
sapply(convert_array(x), as.double)
}

result <-
result %>%
collect() %>%
mutate(value = convert_double_array(value))


result <- unnest(result, cols = c(value))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do all necessary transformations in the rebenchdb.R file, where I assume is duplicated code from this, or at least a close variant.

It would also be good to know which version is faster, the one doing it in R or the one doing it in the database.

tests/db-setup.test.ts Outdated Show resolved Hide resolved
tests/db-setup.test.ts Outdated Show resolved Hide resolved
@smarr
Copy link
Owner

smarr commented Jul 27, 2022

It would also be great if you could expand the initial PR description with a more detailed breakdown what changed, what the impact is, and what the consequences are.
As well as, a few words on how to migrate an existing database.

Thanks

@HumphreyHCB
Copy link
Contributor Author

thanks for the comments,
I will address them

@HumphreyHCB
Copy link
Contributor Author

this commit b7dddea
has reduced the sql query time from ~7255ms
down to ~639ms

@HumphreyHCB
Copy link
Contributor Author

So, the performance improvement here comes from only grabbing the things that you really need for the orderedMeasurement table?

yes. you could say i fixed a mistake i made

src/db/migration.003.sql Outdated Show resolved Hide resolved
@HumphreyHCB
Copy link
Contributor Author

HumphreyHCB commented Jul 28, 2022

also i have just tested this version of the sql https://github.com/smarr/ReBenchDB/blob/master/src/dashboard.ts#L31-L44
on the older sheme of the database.
i am getting a query term of ~1300ms - ~1422ms

so storing iterations of measurements as an array, has also improve the query time for the dashboard

src/stats/timeline.R Outdated Show resolved Hide resolved
src/stats/timeline.R Outdated Show resolved Hide resolved
src/views/rebenchdb.R Outdated Show resolved Hide resolved
src/views/rebenchdb.R Outdated Show resolved Hide resolved
src/views/rebenchdb.R Outdated Show resolved Hide resolved
tests/db-setup.test.ts Outdated Show resolved Hide resolved
@smarr
Copy link
Owner

smarr commented Jul 28, 2022

So, the performance improvement here comes from only grabbing the things that you really need for the orderedMeasurement table?

yes. you could say i fixed a mistake i made

Just to explain my question: I was merely to understand what the change was.
I still find the SQL hard to parse.
And the commit message didn't give me much of a hint. But now I think I understand.

@smarr
Copy link
Owner

smarr commented Jul 28, 2022

also i have just tested this version of the sql https://github.com/smarr/ReBenchDB/blob/master/src/dashboard.ts#L31-L44 on the older sheme of the database. i am getting a query term of ~1300ms - ~1422ms

so storing iterations of measurements as an array, has also improve the query time for the dashboard

Are you referring here to the down to ~639ms bit from before?

@HumphreyHCB
Copy link
Contributor Author

HumphreyHCB commented Jul 28, 2022

so my current branch with the new table dose the query in ~639ms. - b7dddea

the current version that on web which i think is your master branch
which uses the old scheme for the measurement table
will execute this query https://github.com/smarr/ReBenchDB/blob/master/src/dashboard.ts#L31-L44 in ~1300ms - ~1422ms

so yes i think the new sql and the new database design has made a ~1300ms to ~1422ms query time go down to ~639ms

smarr added 18 commits March 21, 2024 10:35
- load old format, do sizing as before, and then convert

Signed-off-by: Stefan Marr <[email protected]>
This is mostly to have them covered…

Signed-off-by: Stefan Marr <[email protected]>
… optimization

- rename methods for benchmarking (also used for convenience in test)
- avoid creating empty jobs

Signed-off-by: Stefan Marr <[email protected]>
…ion ids being used.

We didn’t increment the criterionId when no measurements were available, which lead to a wrong mapping of the ids used in the input data to ids used in the database.

Signed-off-by: Stefan Marr <[email protected]>
- account for new data format with ValuesPossiblyMissing

Signed-off-by: Stefan Marr <[email protected]>
The functions are in the next commit.

Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
…the ReBench data file format

- merge redundant type declaration
- filter out NULL value in the database

Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
The corresponding PR for the new API support is now merged. And since I didn’t update the docker file, the merge there is actually needed to see whether the benchmarking of ReBenchDB with ReBench itself works.

Signed-off-by: Stefan Marr <[email protected]>
@smarr
Copy link
Owner

smarr commented Mar 23, 2024

I added various details to the PR description and this is now ready to be merged 🥳 🎉

Latest performance results: https://rebench.dev/ReBenchDB/compare/0d0176d6619b4a7ad39fd445d3aac9884d26a69b..e8d018f5361c08cda1fa16097d333f3a4ada112b

@smarr smarr merged commit b0b149a into smarr:master Mar 23, 2024
1 check passed
@smarr smarr changed the title Change Measurements to no longer store all iterations as a separate Records Change Measurement table to store iteration values in an array instead of in rows Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants