Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array facet: don't materialize unnecessary columns #2008

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cldellow
Copy link
Contributor

@cldellow cldellow commented Jan 28, 2023

The presence of inner.* causes SQLite to materialize a row with all the columns. Those columns will be discarded later.

Instead, we can select only the column we'll use. This lets SQLite's optimizer realize that the other columns in the CTE definition aren't needed.

On a test table with 278K rows, 98K of which had an array, this speeds up the facet calculation from 4 sec to 1 sec.


📚 Documentation preview 📚: https://datasette--2008.org.readthedocs.build/en/2008/

The presence of `inner.*` causes SQLite to materialize a row with all
the columns. Those columns will be discarded later.

Instead, we can select only the column we'll use. This lets SQLite's
optimizer realize that the other columns in the CTE definition
aren't needed.

On a test table with 278K rows, 98K of which had an array, this speeds up
the facet calculation from 4 sec to 1 sec.
@cldellow
Copy link
Contributor Author

I don't know how/if you do automated tests for performance, so I haven't changed any of the tests.

@codecov
Copy link

codecov bot commented Jan 28, 2023

Codecov Report

Base: 92.11% // Head: 91.78% // Decreases project coverage by -0.34% ⚠️

Coverage data is based on head (f529a30) compared to base (e4ebef0).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2008      +/-   ##
==========================================
- Coverage   92.11%   91.78%   -0.34%     
==========================================
  Files          38       39       +1     
  Lines        5555     5599      +44     
==========================================
+ Hits         5117     5139      +22     
- Misses        438      460      +22     
Impacted Files Coverage Δ
datasette/facets.py 91.84% <ø> (ø)
datasette/views/row.py 87.82% <0.00%> (ø)
datasette/views/table.py 92.57% <0.00%> (ø)
datasette/views/database.py 96.61% <0.00%> (ø)
datasette/utils/shutil_backport.py 9.09% <0.00%> (ø)
datasette/cli.py 82.40% <0.00%> (+2.77%) ⬆️
datasette/plugins.py 85.29% <0.00%> (+2.94%) ⬆️
datasette/utils/asgi.py 93.12% <0.00%> (+3.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cldellow
Copy link
Contributor Author

cldellow commented Jan 29, 2023

Ack, this PR is broken. I see now that the inner.* is necessary for ensuring the correct count in the face of rows having duplicate values in views.

That fixes the overcounting, but I think can undercount when the rows have the same data, eg a view like:

SELECT '["bar"]' tags UNION ALL SELECT '["bar"]'

will produce a count of {"bar": 1 }, when it should be {"bar": 2}. In fact, this could apply in tables without primary keys, too.

If inner came from a base table that had a primary key or a rowid, we could use those column(s) to solve that case.

I guess a general solution would be to compute a window function so we have a distinct ID for each row. Will fiddle to see if I can get that working.

Use a window function to generate unique IDs for each row to avoid
having to do a DISTINCT on arbitrarily wide rows.

Add a test to capture the case where multiple rows are the same.
@cldellow
Copy link
Contributor Author

I pushed a revised version which ends up being faster -- the example which currently takes 4 seconds now runs in 500ms.

cldellow added a commit to cldellow/datasette-ui-extras that referenced this pull request Jan 29, 2023
and more correct. progress on #34, see also simonw/datasette#2008
@simonw
Copy link
Owner

simonw commented Jan 29, 2023

I absolutely love this performance boost - really nice find.

One concern: this will be the first time Datasette ships a core feature that uses window functions.

Window functions were added to SQLite in version 3.25.0 on 2018-09-15 - which means it's still very common for Datasette to run on versions that don't yet support them.

So I see two options:

  • Detect window function support and switch between the old implementation and this better, new one
  • Detect window functions and disable the facet-by-JSON feature entirely if they are missing

I like the first option a bit better.

This also leads to a tricky CI challenge: Datasette needs to be able to run its test suite against more than one SQLite version to confidently test this feature going forward.

I don't yet have a good GitHub Actions recipe for this, but I really need one - for sqlite-utils too.

Might be able to use this trick for that: https://til.simonwillison.net/sqlite/ld-preload

@simonw
Copy link
Owner

simonw commented Jan 29, 2023

I don't know how/if you do automated tests for performance, so I haven't changed any of the tests.

We don't have any performance tests yet - would be a useful thing to add, I've not built anything like that before (at least not in CI, I've always done as-hoc performance testing using something like Locust) so I don't have a great feel for how it could work.

I see not having to change the tests at all for this change as a really positive sign. If you find any behaviour differences between this and the previous that's a sign we should add a mother test or two specifying the behaviour we want.

@cldellow
Copy link
Contributor Author

Performance tests - I think most places don't have them as a formal gate enforced by CI. TypeScript and scalac seem to have tests that run to capture timings. The timings are included by a bot as a comment or build check, and also stored in a database so you can graph changes over time to spot regressions. Probably overkill for Datasette!

Window functions - oh, good point. Looks like Ubuntu shipped JSON1 support as far back as sqlite 3.11. I'll let this PR linger until there's a way to run against different SQLite versions. For now, I'm shipping this with datasette-ui-extras, since I think it's OK for a plugin to enforce a higher minimum requirement.

Tests - there actually did end up being test changes to capture the undercount bug of the current implementation, so the current implementation would fail against the new tests.

Perhaps a non-window function version could be written that uses random() instead of row_number() over () in order to get a unique key. It's technically not unique, but in practice, I imagine it'll work well.

@simonw
Copy link
Owner

simonw commented Jan 29, 2023

We don't have any performance tests yet - would be a useful thing to add, I've not built anything like that before (at least not in CI, I've always done as-hoc performance testing using something like Locust) so I don't have a great feel for how it could work.

Had an interesting conversation about this just now: https://fedi.simonwillison.net/@simon/109773800944614366

There's a risk that different runs will return different results due to the shared resource nature of GitHub Actions runners, but a good fix for that is to run comparative tests where you run the benchmark against e.g. both main and the incoming PR branch and report back on any differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants