-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Measurement table to store iteration values in an array instead of in rows #100
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
So, CI seems to fail with the error you mentioned, too. https://github.com/smarr/ReBenchDB/runs/7453601666?check_suite_focus=true#step:10:29
This looks to me like the SQL that you generate may not be ideal, and might have a mix up of notation. What's the SQL query that you actually execute? |
I am also wondering whether it's possible to simplify much of this by using arrays directly as parameters. |
ed2a3fb
to
98d9e6a
Compare
Random thought when I was walking home: I don't see any changes to |
I think I'll also look into whether we can have concrete benchmarks to more directly measure the tradeoffs of the different optimization approaches. |
you are right, i forgot about somns.R. i will need some help with that, mostly with identifying what handles the Value in the R file. also i not sure but would i need a new database import? as the old one will have the older version of the table. but ofc there won’t be a new database import yet. i might need some help with dealing that. i suspect the answer will be more unit tests :) if you don’t mind can I leave this till Monday to sort out I suspect its going to be a big job |
Of course, leave it to Monday! Just some notes in the meantime. Ideally, we'd refactor the code so that there are no Here is the bit of the query that gets the I suspect we may want to do the processing and unnesting of data for instance in the "factorize" functions, e.g., https://github.com/smarr/ReBenchDB/blob/master/src/views/rebenchdb.R#L136 When you say database import, I suspect you mean migrating the database from the old to the new schema? I'd say this is a separate problem, which we probably want to postpone to the point in time when we are convinced that the current or another solution is the way we want to go. |
i think i might of fixed it, probably not the most neat way but i know one of the CI tests will fail.
note sure if i broke it or not. ill have a look tommorow |
All values are defined as https://github.com/smarr/ReBenchDB/blob/master/src/db/db.sql#L187 |
9b18d18
to
31511c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of nitpicking. Sorry.
Generally, I think we also need:
- tests, to test all these changes
- benchmarks, to assess the benefits/drawbacks
- time to insert a data set
- time to query db+restore the iteration column
src/db.ts
Outdated
FROM Measurement | ||
WHERE trialId = $1 | ||
GROUP BY runId, criterion, invocation | ||
GROUP BY runId, criterion, invocation, ite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, how come this is now grouped by ite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HumphreyHCB sorry, I think we briefly discuss this, but I forgot, what was the explanation for this? Was this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a yes and no. its needed as more of a placeholder for a method that calls it. But we never use the result of the ite column, I think in the record measurement method we now work out what the iteration should be.
If you remove it something will break, I am sure there is a way around this
src/db.ts
Outdated
run.id.toString() + | ||
' ' + | ||
trial.id.toString() + | ||
' ' + | ||
d.in.toString() + | ||
' ' + | ||
criteria.get(m.c).id.toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something where using format strings so `${run.id} ${trial.id} ${d.in}`
.
Though, I don't really understand what the iterationMap
is for.
What does it do? Why do you need construct these "ids" and then use this elaborate way to concatenate things?
Perhaps a comment and naming could help to clear things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to
iMID = "${run.id} ${trial.id} ${d.in} ${criteria.get(m.c).id}";
so what is iterationMap. before my change when we processed 50 datapoints from the payload we did an immediate batched insert at that point.
problem with that is with the new ARRAY Value, i want to avoid updating the array for a given id to append on extra value.
so once i proceed 50 datapoints i am not guaranteed that there is another value to be appended to the end of the array for a given ID we already seen ( an ID be the primary key for the measurements table which is {run.id} ${trial.id} ${d.in} ${criteria.get(m.c).id})
so what I do is map/dictionary of all ID’s and the array of values that goes with, this is (iterationMap). Once there are no more data points to process I know I have all of the array values for all of the ID. So, then I can start batch inserting, safely knowing there will be no need to update an ID to append another value.
i will add a comment to make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, This method gets really large and hard to grasp.
Is this something you could refactor out into its own method, and do it as a conversion pass over the data before preparing the batched insertion?
This would mean two loops over the data, but that should be fine, and hopefully makes things more understandable.
src/stats/timeline.R
Outdated
convert_array <- function(x) { | ||
x <- gsub("(^\\{|\\}$)", "", x) | ||
strsplit(x, split = ",") | ||
} | ||
|
||
convert_double_array <- function(x) { | ||
sapply(convert_array(x), as.double) | ||
} | ||
|
||
result <- | ||
result %>% | ||
collect() %>% | ||
mutate(value = convert_double_array(value)) | ||
|
||
|
||
result <- unnest(result, cols = c(value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do all necessary transformations in the rebenchdb.R
file, where I assume is duplicated code from this, or at least a close variant.
It would also be good to know which version is faster, the one doing it in R or the one doing it in the database.
It would also be great if you could expand the initial PR description with a more detailed breakdown what changed, what the impact is, and what the consequences are. Thanks |
thanks for the comments, |
this commit b7dddea |
yes. you could say i fixed a mistake i made |
also i have just tested this version of the sql https://github.com/smarr/ReBenchDB/blob/master/src/dashboard.ts#L31-L44 so storing iterations of measurements as an array, has also improve the query time for the dashboard |
Just to explain my question: I was merely to understand what the change was. |
Are you referring here to the |
so my current branch with the new table dose the query in ~639ms. - b7dddea the current version that on web which i think is your master branch so yes i think the new sql and the new database design has made a ~1300ms to ~1422ms query time go down to ~639ms |
- load old format, do sizing as before, and then convert Signed-off-by: Stefan Marr <[email protected]>
This is mostly to have them covered… Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
… optimization - rename methods for benchmarking (also used for convenience in test) - avoid creating empty jobs Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
…ion ids being used. We didn’t increment the criterionId when no measurements were available, which lead to a wrong mapping of the ids used in the input data to ids used in the database. Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
- account for new data format with ValuesPossiblyMissing Signed-off-by: Stefan Marr <[email protected]>
The functions are in the next commit. Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
…the ReBench data file format - merge redundant type declaration - filter out NULL value in the database Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
6632483
to
b48b9d9
Compare
Signed-off-by: Stefan Marr <[email protected]>
Signed-off-by: Stefan Marr <[email protected]>
The corresponding PR for the new API support is now merged. And since I didn’t update the docker file, the merge there is actually needed to see whether the benchmarking of ReBenchDB with ReBench itself works. Signed-off-by: Stefan Marr <[email protected]>
I added various details to the PR description and this is now ready to be merged 🥳 🎉 Latest performance results: https://rebench.dev/ReBenchDB/compare/0d0176d6619b4a7ad39fd445d3aac9884d26a69b..e8d018f5361c08cda1fa16097d333f3a4ada112b |
Changed the Measurements Table to no longer store all iterations as individual rows but instead, each iteration is stored in an array in the database, associated with it key which is built from runID, trailID, criterion and invocation.
This has resulted in a snapshot of the database reducing in size from 7,140 MB to 231 MB, and a row reduction of ~41,000,000 rows down to ~1,100,000 rows.
The full results can be seen here:
To migrate your old scheme of the measurements table. You will need to execute the migration script
src/backend/db/schema-updates/migration.012.sql
.To execute the file in the command line with a postgres server. You will need to use the following command
psql -U *username* -d *database-name* -a -f *path-to-migration-script*
The script will aggregate all iteration of the same ID into to a single row with an array column containing all values for a literation, In the order in which they occurred.
Updates Before Merge 2024-03-23
By changing the database layout, we significantly reduce the number of records in the Measurement table. It's going from 70,184,010 rows to 3,070,654 and a total size of 5073MB to 504MB.
The performance increased significantly for storing results (up to 15x faster) and rendering reports (up to 4x faster). There's a regression for computing the timeline though.
The database size before migration:
After migration:
SQL to get table sizes
Database Access Rights
The migration may need some manual work afterwards though, at least when done with the wrong database user, to make sure the right user has access to the new Measurement table.
The database user will also need extra permissions to support exporting data as CSV files:
GRANT pg_execute_server_program TO user_name;
Limitations
This PR introduces a new API for clients to match the new data handling, but also drops the old API. This is mostly to make sure I migrate all my systems. Technically the conversion code is there and even an API compatibility check, but it's not doing the conversion. If it turns out that's needed, it can be added easily.
When ReBench is used as a client, at least version 1.2.1.dev1 is needed, as per this PR: smarr/ReBench#236
New API
New DataPoint Format
The new definition of DataPoint is as follows:
This means, a DataPoint can for instance look like this:
The datapoint represents data for invocation 15 of an experiment, and has data for two out of three criteria, where the second one is missing. The arrays encode iteration data in order of iterations, where missing data is identified by
null
.API Version Check
OPTIONS /rebenchdb/results
provides now version details of the supported API.It currently responds with:
It can be tested with
curl -X OPTIONS https://rebench.dev/rebenchdb/results -i
.Other Changes