array facet: don't materialize unnecessary columns #2008

cldellow · 2023-01-28T19:33:40Z

The presence of inner.* causes SQLite to materialize a row with all the columns. Those columns will be discarded later.

Instead, we can select only the column we'll use. This lets SQLite's optimizer realize that the other columns in the CTE definition aren't needed.

On a test table with 278K rows, 98K of which had an array, this speeds up the facet calculation from 4 sec to 1 sec.

📚 Documentation preview 📚: https://datasette--2008.org.readthedocs.build/en/2008/

The presence of `inner.*` causes SQLite to materialize a row with all the columns. Those columns will be discarded later. Instead, we can select only the column we'll use. This lets SQLite's optimizer realize that the other columns in the CTE definition aren't needed. On a test table with 278K rows, 98K of which had an array, this speeds up the facet calculation from 4 sec to 1 sec.

cldellow · 2023-01-28T19:34:29Z

I don't know how/if you do automated tests for performance, so I haven't changed any of the tests.

codecov · 2023-01-28T19:40:18Z

Codecov Report

Base: 92.11% // Head: 91.78% // Decreases project coverage by -0.34% ⚠️

Coverage data is based on head (f529a30) compared to base (e4ebef0).
Patch has no changes to coverable lines.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2008      +/-   ##
==========================================
- Coverage   92.11%   91.78%   -0.34%     
==========================================
  Files          38       39       +1     
  Lines        5555     5599      +44     
==========================================
+ Hits         5117     5139      +22     
- Misses        438      460      +22

Impacted Files	Coverage Δ
datasette/facets.py	`91.84% <ø> (ø)`
datasette/views/row.py	`87.82% <0.00%> (ø)`
datasette/views/table.py	`92.57% <0.00%> (ø)`
datasette/views/database.py	`96.61% <0.00%> (ø)`
datasette/utils/shutil_backport.py	`9.09% <0.00%> (ø)`
datasette/cli.py	`82.40% <0.00%> (+2.77%)`	⬆️
datasette/plugins.py	`85.29% <0.00%> (+2.94%)`	⬆️
datasette/utils/asgi.py	`93.12% <0.00%> (+3.05%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cldellow · 2023-01-29T04:23:58Z

Ack, this PR is broken. I see now that the inner.* is necessary for ensuring the correct count in the face of rows having duplicate values in views.

That fixes the overcounting, but I think can undercount when the rows have the same data, eg a view like:

SELECT '["bar"]' tags UNION ALL SELECT '["bar"]'

will produce a count of {"bar": 1 }, when it should be {"bar": 2}. In fact, this could apply in tables without primary keys, too.

If inner came from a base table that had a primary key or a rowid, we could use those column(s) to solve that case.

I guess a general solution would be to compute a window function so we have a distinct ID for each row. Will fiddle to see if I can get that working.

Use a window function to generate unique IDs for each row to avoid having to do a DISTINCT on arbitrarily wide rows. Add a test to capture the case where multiple rows are the same.

cldellow · 2023-01-29T04:50:50Z

I pushed a revised version which ends up being faster -- the example which currently takes 4 seconds now runs in 500ms.

and more correct. progress on #34, see also simonw/datasette#2008

simonw · 2023-01-29T05:39:54Z

I absolutely love this performance boost - really nice find.

One concern: this will be the first time Datasette ships a core feature that uses window functions.

Window functions were added to SQLite in version 3.25.0 on 2018-09-15 - which means it's still very common for Datasette to run on versions that don't yet support them.

So I see two options:

Detect window function support and switch between the old implementation and this better, new one
Detect window functions and disable the facet-by-JSON feature entirely if they are missing

I like the first option a bit better.

This also leads to a tricky CI challenge: Datasette needs to be able to run its test suite against more than one SQLite version to confidently test this feature going forward.

I don't yet have a good GitHub Actions recipe for this, but I really need one - for sqlite-utils too.

Might be able to use this trick for that: https://til.simonwillison.net/sqlite/ld-preload

simonw · 2023-01-29T05:47:36Z

I don't know how/if you do automated tests for performance, so I haven't changed any of the tests.

We don't have any performance tests yet - would be a useful thing to add, I've not built anything like that before (at least not in CI, I've always done as-hoc performance testing using something like Locust) so I don't have a great feel for how it could work.

I see not having to change the tests at all for this change as a really positive sign. If you find any behaviour differences between this and the previous that's a sign we should add a mother test or two specifying the behaviour we want.

cldellow · 2023-01-29T17:04:03Z

Performance tests - I think most places don't have them as a formal gate enforced by CI. TypeScript and scalac seem to have tests that run to capture timings. The timings are included by a bot as a comment or build check, and also stored in a database so you can graph changes over time to spot regressions. Probably overkill for Datasette!

Window functions - oh, good point. Looks like Ubuntu shipped JSON1 support as far back as sqlite 3.11. I'll let this PR linger until there's a way to run against different SQLite versions. For now, I'm shipping this with datasette-ui-extras, since I think it's OK for a plugin to enforce a higher minimum requirement.

Tests - there actually did end up being test changes to capture the undercount bug of the current implementation, so the current implementation would fail against the new tests.

Perhaps a non-window function version could be written that uses random() instead of row_number() over () in order to get a unique key. It's technically not unique, but in practice, I imagine it'll work well.

simonw · 2023-01-29T18:17:40Z

We don't have any performance tests yet - would be a useful thing to add, I've not built anything like that before (at least not in CI, I've always done as-hoc performance testing using something like Locust) so I don't have a great feel for how it could work.

Had an interesting conversation about this just now: https://fedi.simonwillison.net/@simon/109773800944614366

There's a risk that different runs will return different results due to the shared resource nature of GitHub Actions runners, but a good fix for that is to run comparative tests where you run the benchmark against e.g. both main and the incoming PR branch and report back on any differences.

cldellow mentioned this pull request Jan 28, 2023

undo monkey-patching of ArrayFacet cldellow/datasette-ui-extras#34

Open

array facets uniqueness fix

f529a30

Use a window function to generate unique IDs for each row to avoid having to do a DISTINCT on arbitrarily wide rows. Add a test to capture the case where multiple rows are the same.

cldellow added a commit to cldellow/datasette-ui-extras that referenced this pull request Jan 29, 2023

faster array facet

d086c46

and more correct. progress on #34, see also simonw/datasette#2008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array facet: don't materialize unnecessary columns #2008

array facet: don't materialize unnecessary columns #2008

cldellow commented Jan 28, 2023 •

edited by github-actions bot

Loading

cldellow commented Jan 28, 2023

codecov bot commented Jan 28, 2023 •

edited

Loading

cldellow commented Jan 29, 2023 •

edited

Loading

cldellow commented Jan 29, 2023

simonw commented Jan 29, 2023 •

edited

Loading

simonw commented Jan 29, 2023

cldellow commented Jan 29, 2023

simonw commented Jan 29, 2023

array facet: don't materialize unnecessary columns #2008

Are you sure you want to change the base?

array facet: don't materialize unnecessary columns #2008

Conversation

cldellow commented Jan 28, 2023 • edited by github-actions bot Loading

cldellow commented Jan 28, 2023

codecov bot commented Jan 28, 2023 • edited Loading

Codecov Report

cldellow commented Jan 29, 2023 • edited Loading

cldellow commented Jan 29, 2023

simonw commented Jan 29, 2023 • edited Loading

simonw commented Jan 29, 2023

cldellow commented Jan 29, 2023

simonw commented Jan 29, 2023

cldellow commented Jan 28, 2023 •

edited by github-actions bot

Loading

codecov bot commented Jan 28, 2023 •

edited

Loading

cldellow commented Jan 29, 2023 •

edited

Loading

simonw commented Jan 29, 2023 •

edited

Loading