-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #1167] Add filtering to the search endpoint #1468
Conversation
…4-update-api-schema
…4-update-api-schema
…4-update-api-schema
"link_opportunity_summary_funding_category", | ||
["funding_category_id"], | ||
unique=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would any of the new indexes cause performance issues on write? Or I guess the opportunities are relatively stable and don't need bulk updates too often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect to see any issues, and adjusting the indexes is pretty whatever. From past experience, testing against a local DB versus one in AWS is a bit different in terms of performance. Can evaluate more once we have the data in the dev DB.
Testing locally with ~10k records, adding/removing the indexes is basically instant. Prod data will be about 80k, so let's just say about 10x. I don't think the scale of our DB is anywhere near "big data" to need to super optimize this, but if we do we can easily add/remove indexes uneventfully.
# ..WHERE ((opportunity.agency ILIKE 'US-ABC%') OR (opportunity.agency ILIKE 'HHS%')) | ||
stmt = stmt.where( | ||
or_(*[Opportunity.agency.istartswith(agency) for agency in one_of_agencies]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you have an index on agency - would this do an index scan but not a direct index lookup. Would a direct index lookup on agency with == instead of ilike
be faster?
Also I was reading that ilike
does a case insensitive lower(column_name)
search - is the agency index setup with lowercase to accommodate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing (see below). No difference between like
and ilike
from what I see. Changing to an exact match was a lot better performance (20x on execution). ilike
does a sequential scan, and =
does a Bitmap index scan.
If we don't need to support agency nesting yet (eg. HHS
not getting HHS-XYZ
) then we can make it exact match, although I'm not too concerned about the performance.
A quick check, using the query in the summary, but with only the is_draft, current opportunity_summary, and agency filters (eg. agency bit like AND((opportunity.agency ILIKE 'US-MMM%') OR(opportunity.agency ILIKE 'HHS%'))
) gives
Limit (cost=357.66..357.67 rows=1 width=207) (actual time=9.897..9.899 rows=0 loops=1)
-> Sort (cost=357.65..357.66 rows=4 width=207) (actual time=9.896..9.897 rows=0 loops=1)
Sort Key: opportunity.opportunity_id DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.71..357.61 rows=4 width=207) (actual time=9.855..9.857 rows=0 loops=1)
Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id)
-> Nested Loop (cost=1.43..356.55 rows=3 width=239) (actual time=9.855..9.856 rows=0 loops=1)
Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id)
-> Nested Loop (cost=1.14..355.84 rows=2 width=231) (actual time=9.854..9.855 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.86..355.48 rows=1 width=223) (actual time=9.854..9.855 rows=0 loops=1)
Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id)
-> Nested Loop (cost=0.57..355.15 rows=1 width=231) (actual time=9.854..9.854 rows=0 loops=1)
-> Nested Loop (cost=0.29..354.81 rows=1 width=223) (actual time=9.853..9.854 rows=0 loops=1)
-> Seq Scan on opportunity (cost=0.00..346.50 rows=1 width=207) (actual time=9.853..9.853 rows=0 loops=1)
" Filter: ((is_draft IS FALSE) AND ((agency ~~* 'US-MMM%'::text) OR (agency ~~* 'HHS%'::text)))"
Rows Removed by Filter: 10500
-> Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary (cost=0.29..8.30 rows=1 width=16) (never executed)
Index Cond: (opportunity_id = opportunity.opportunity_id)
-> Index Only Scan using opportunity_summary_pkey on opportunity_summary (cost=0.29..0.34 rows=1 width=8) (never executed)
Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1 (cost=0.29..0.32 rows=1 width=8) (never executed)
Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type (cost=0.29..0.34 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument (cost=0.29..0.33 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category (cost=0.29..0.33 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
Heap Fetches: 0
Planning Time: 27.059 ms
Execution Time: 10.096 ms
Changing those queries to =
rather than ILIKE
(eg. AND((opportunity.agency = 'US-MMM') OR(opportunity.agency = 'HHS'))
) gives:
Limit (cost=23.76..23.77 rows=1 width=207) (actual time=0.346..0.348 rows=0 loops=1)
-> Sort (cost=23.75..23.76 rows=4 width=207) (actual time=0.345..0.346 rows=0 loops=1)
Sort Key: opportunity.opportunity_id DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=10.30..23.71 rows=4 width=207) (actual time=0.321..0.322 rows=0 loops=1)
Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id)
-> Nested Loop (cost=10.01..22.65 rows=3 width=239) (actual time=0.321..0.322 rows=0 loops=1)
Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id)
-> Nested Loop (cost=9.73..21.94 rows=2 width=231) (actual time=0.321..0.321 rows=0 loops=1)
-> Nested Loop Semi Join (cost=9.44..21.58 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1)
Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id)
-> Nested Loop (cost=9.16..21.25 rows=1 width=231) (actual time=0.320..0.321 rows=0 loops=1)
-> Nested Loop (cost=8.87..20.91 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1)
-> Bitmap Heap Scan on opportunity (cost=8.59..12.60 rows=1 width=207) (actual time=0.320..0.320 rows=0 loops=1)
" Recheck Cond: ((agency = 'US-MMM'::text) OR (agency = 'HHS'::text))"
Filter: (is_draft IS FALSE)
-> BitmapOr (cost=8.59..8.59 rows=1 width=0) (actual time=0.319..0.319 rows=0 loops=1)
-> Bitmap Index Scan on opportunity_agency_idx (cost=0.00..4.29 rows=1 width=0) (actual time=0.301..0.301 rows=0 loops=1)
" Index Cond: (agency = 'US-MMM'::text)"
-> Bitmap Index Scan on opportunity_agency_idx (cost=0.00..4.29 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
" Index Cond: (agency = 'HHS'::text)"
-> Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary (cost=0.29..8.30 rows=1 width=16) (never executed)
Index Cond: (opportunity_id = opportunity.opportunity_id)
-> Index Only Scan using opportunity_summary_pkey on opportunity_summary (cost=0.29..0.34 rows=1 width=8) (never executed)
Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1 (cost=0.29..0.32 rows=1 width=8) (never executed)
Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type (cost=0.29..0.34 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument (cost=0.29..0.33 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
Heap Fetches: 0
-> Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category (cost=0.29..0.33 rows=2 width=8) (never executed)
Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
Heap Fetches: 0
Planning Time: 25.776 ms
Execution Time: 0.652 ms
…rds could be returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured I would try my best to help out with a review! I have some questions for my own understanding. Feel free to wait on someone with more context for a "real" approval!
# We assume it's just a list of strings | ||
if allowed_values is None: | ||
params: dict = {"metadata": metadata} | ||
if minimum_length is not None: | ||
params["validate"] = [validators.Length(min=2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we'll need code changes to support search filtering on integer fields, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, any other types would require more changes. Just happens that all we want to support now are strings in the same one_of pattern.
When we switch to using a search index, this might be adjusted a bit, but I think this is closer to what we'd want there.
one_of_opportunity_statuses = filters.opportunity_status.get("one_of") | ||
|
||
if one_of_opportunity_statuses: | ||
stmt = stmt.where( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be expecting to see...
stmt = stmt.where( | |
stmt += stmt.where( |
...here, to support adding multiple WHERE filters being added. Can you help me understand how compound filters would work with this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of the filters / configs / etc attached to a statement doesn't actually modify the statement, and instead returns a new one.
You can't just do:
stmt = select(Opportunity)
stmt.where(Opportunity.opportunity_id = 5)
# stmt will still just represent `select * from opportunity`
as that won't modify the statement in place. Even looking at some of the internals, I'm not entirely sure why, it looks like whenever you call one of these methods it effectively makes a new select object (we take advantage of this when doing the count query in pagination by modify the query for counting).
And we can't do +=
because they didn't implement the method to do so.
self.schema_fields: dict[str, fields.MixinField] = {} | ||
self.schema_class_name = schema_class_name | ||
|
||
def with_one_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to add with_all_of
as a filter type as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add that if needed, although I have no idea how the query to the DB would actually work for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I didn't review all the tests in detail.
) | ||
|
||
if filters.funding_instrument is not None: | ||
stmt = stmt.join(LinkOpportunitySummaryFundingInstrument) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go inside the if
below too? So that we only join
if it's going to have a where
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that if we have multiple filters for the field, it adds the join for it collectively, we just only have one for each right now.
I should however make it so the filter model makes it so you'll always have at least one of these set so that we don't ever have a join for no reason. I'll make a follow-up ticket to add that.
Summary
Fixes #1167
Time to review: 15 mins
Changes proposed
Added filtering to the search endpoint, includes all but the query box parameter which has its own follow-up ticket
Added utilities to help generate the search filter schema
Added indexes to improve the performance of search (see additional info below for details)
Extensive additions to the tests
Added the ability to choose examples on the OpenAPI docs (included an example with no filters, and one with many)
Fixed a bug in the Paginator for handling counts (will follow-up and fix in the template repo)
Context for reviewers
This change has been extensively tested, manually, and through an enormous amount of new unit tests. As the change was already getting quite large, a few things will be dealt with in follow-up tickets:
For the filters, they're all
one_of
filters which means that only one of the supplied values needs to match for it to pass the where clause (literally the where clauses generate aswhere table.column in (1, 2, 3)
). You can see an example query below.The agency filter is a bit odd as I made it a
startswith
style filter instead to handle the way agency codes get nested. We may want to adjust this further in the future, but this will at least technically handle hierarchies of agencies right now.Additional information
I extensively tested the performance of the queries we run. I locally loaded in ~11k records using our factories (ran the
seed-local-db
script 300 times). With the API functioning, I make SQLAlchemy output the queries it ran and did anEXPLAIN ANALYZE ...
on the big ones. I then added several indexes which improved the performance.The primary query of the API looks like this:
Without any of the new indexes,
EXPLAIN ANALYZE
gives this a cost of ~1100 (non-specific unit). With the new indexes it becomes ~800. The actual runtime of these queries is in the 5-10ms range with or without the indexes, so it's minor either way. Note that querying the API locally, this gives response times of 50-150ms (slower initially before caching likely takes hold). Also if we're just filtering by something like opportunity status, then the costs are around 10-15.See: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE