[Issue #1167] Add filtering to the search endpoint #1468

chouinar · 2024-03-12T17:31:09Z

Summary

Fixes #1167

Time to review: 15 mins

Changes proposed

Added filtering to the search endpoint, includes all but the query box parameter which has its own follow-up ticket

Added utilities to help generate the search filter schema

Added indexes to improve the performance of search (see additional info below for details)

Extensive additions to the tests

Added the ability to choose examples on the OpenAPI docs (included an example with no filters, and one with many)

Fixed a bug in the Paginator for handling counts (will follow-up and fix in the template repo)

Context for reviewers

This change has been extensively tested, manually, and through an enormous amount of new unit tests. As the change was already getting quite large, a few things will be dealt with in follow-up tickets:

Query filtering: [Task]: Implement query filtering to the search endpoint #1455
Fixing logging formatting: [Task]: Update logging of extra parameters in enum lists in human readable formatter #1466
Additional order_by fields: [Task]: Add additional order_by fields to the API #1467

For the filters, they're all one_of filters which means that only one of the supplied values needs to match for it to pass the where clause (literally the where clauses generate as where table.column in (1, 2, 3)). You can see an example query below.

The agency filter is a bit odd as I made it a startswith style filter instead to handle the way agency codes get nested. We may want to adjust this further in the future, but this will at least technically handle hierarchies of agencies right now.

Additional information

I extensively tested the performance of the queries we run. I locally loaded in ~11k records using our factories (ran the seed-local-db script 300 times). With the API functioning, I make SQLAlchemy output the queries it ran and did an EXPLAIN ANALYZE ... on the big ones. I then added several indexes which improved the performance.

The primary query of the API looks like this:

SELECT
	opportunity.opportunity_id,
	opportunity.opportunity_number,
	opportunity.opportunity_title,
	opportunity.agency,
	opportunity.opportunity_category_id,
	opportunity.category_explanation,
	opportunity.is_draft,
	opportunity.revision_number,
	opportunity.modified_comments,
	opportunity.publisher_user_id,
	opportunity.publisher_profile_id,
	opportunity.created_at,
	opportunity.updated_at
FROM
	opportunity
	JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
	JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
	JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
	JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
	JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
WHERE
	opportunity.is_draft IS FALSE
	AND(EXISTS (
			SELECT
				1 FROM current_opportunity_summary
			WHERE
				opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
	AND current_opportunity_summary.opportunity_status_id IN(1,2)
	AND link_opportunity_summary_funding_instrument.funding_instrument_id IN(1,2)
	AND link_opportunity_summary_funding_category.funding_category_id IN(1,3,20)
	AND link_opportunity_summary_applicant_type.applicant_type_id IN(1, 2, 13)
	AND((opportunity.agency ILIKE 'US-ABC%')
	OR(opportunity.agency ILIKE 'HHS%'))
ORDER BY
	opportunity.opportunity_id DESC
LIMIT 25 OFFSET 25

Without any of the new indexes, EXPLAIN ANALYZE gives this a cost of ~1100 (non-specific unit). With the new indexes it becomes ~800. The actual runtime of these queries is in the 5-10ms range with or without the indexes, so it's minor either way. Note that querying the API locally, this gives response times of 50-150ms (slower initially before caching likely takes hold). Also if we're just filtering by something like opportunity status, then the costs are around 10-15.
See: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE

…4-update-api-schema

rylew1 · 2024-03-13T22:50:43Z

api/src/db/migrations/versions/2024_03_12_add_indexes_for_search.py

+        "link_opportunity_summary_funding_category",
+        ["funding_category_id"],
+        unique=False,
+    )


Would any of the new indexes cause performance issues on write? Or I guess the opportunities are relatively stable and don't need bulk updates too often

I wouldn't expect to see any issues, and adjusting the indexes is pretty whatever. From past experience, testing against a local DB versus one in AWS is a bit different in terms of performance. Can evaluate more once we have the data in the dev DB.

Testing locally with ~10k records, adding/removing the indexes is basically instant. Prod data will be about 80k, so let's just say about 10x. I don't think the scale of our DB is anywhere near "big data" to need to super optimize this, but if we do we can easily add/remove indexes uneventfully.

rylew1 · 2024-03-13T23:07:44Z

api/src/services/opportunities_v0_1/search_opportunities.py

+            #   ..WHERE ((opportunity.agency ILIKE 'US-ABC%') OR (opportunity.agency ILIKE 'HHS%'))
+            stmt = stmt.where(
+                or_(*[Opportunity.agency.istartswith(agency) for agency in one_of_agencies])
+            )


It looks like you have an index on agency - would this do an index scan but not a direct index lookup. Would a direct index lookup on agency with == instead of ilike be faster?

Also I was reading that ilike does a case insensitive lower(column_name) search - is the agency index setup with lowercase to accommodate that?

Did some testing (see below). No difference between like and ilike from what I see. Changing to an exact match was a lot better performance (20x on execution). ilike does a sequential scan, and = does a Bitmap index scan.

If we don't need to support agency nesting yet (eg. HHS not getting HHS-XYZ) then we can make it exact match, although I'm not too concerned about the performance.

A quick check, using the query in the summary, but with only the is_draft, current opportunity_summary, and agency filters (eg. agency bit like AND((opportunity.agency ILIKE 'US-MMM%') OR(opportunity.agency ILIKE 'HHS%'))) gives

Limit (cost=357.66..357.67 rows=1 width=207) (actual time=9.897..9.899 rows=0 loops=1) -> Sort (cost=357.65..357.66 rows=4 width=207) (actual time=9.896..9.897 rows=0 loops=1) Sort Key: opportunity.opportunity_id DESC Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=1.71..357.61 rows=4 width=207) (actual time=9.855..9.857 rows=0 loops=1) Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id) -> Nested Loop (cost=1.43..356.55 rows=3 width=239) (actual time=9.855..9.856 rows=0 loops=1) Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id) -> Nested Loop (cost=1.14..355.84 rows=2 width=231) (actual time=9.854..9.855 rows=0 loops=1) -> Nested Loop Semi Join (cost=0.86..355.48 rows=1 width=223) (actual time=9.854..9.855 rows=0 loops=1) Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id) -> Nested Loop (cost=0.57..355.15 rows=1 width=231) (actual time=9.854..9.854 rows=0 loops=1) -> Nested Loop (cost=0.29..354.81 rows=1 width=223) (actual time=9.853..9.854 rows=0 loops=1) -> Seq Scan on opportunity (cost=0.00..346.50 rows=1 width=207) (actual time=9.853..9.853 rows=0 loops=1) " Filter: ((is_draft IS FALSE) AND ((agency ~~* 'US-MMM%'::text) OR (agency ~~* 'HHS%'::text)))" Rows Removed by Filter: 10500 -> Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary (cost=0.29..8.30 rows=1 width=16) (never executed) Index Cond: (opportunity_id = opportunity.opportunity_id) -> Index Only Scan using opportunity_summary_pkey on opportunity_summary (cost=0.29..0.34 rows=1 width=8) (never executed) Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1 (cost=0.29..0.32 rows=1 width=8) (never executed) Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type (cost=0.29..0.34 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument (cost=0.29..0.33 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category (cost=0.29..0.33 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id) Heap Fetches: 0 Planning Time: 27.059 ms Execution Time: 10.096 ms

Changing those queries to = rather than ILIKE (eg. AND((opportunity.agency = 'US-MMM') OR(opportunity.agency = 'HHS'))) gives:

Limit (cost=23.76..23.77 rows=1 width=207) (actual time=0.346..0.348 rows=0 loops=1) -> Sort (cost=23.75..23.76 rows=4 width=207) (actual time=0.345..0.346 rows=0 loops=1) Sort Key: opportunity.opportunity_id DESC Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=10.30..23.71 rows=4 width=207) (actual time=0.321..0.322 rows=0 loops=1) Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id) -> Nested Loop (cost=10.01..22.65 rows=3 width=239) (actual time=0.321..0.322 rows=0 loops=1) Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id) -> Nested Loop (cost=9.73..21.94 rows=2 width=231) (actual time=0.321..0.321 rows=0 loops=1) -> Nested Loop Semi Join (cost=9.44..21.58 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1) Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id) -> Nested Loop (cost=9.16..21.25 rows=1 width=231) (actual time=0.320..0.321 rows=0 loops=1) -> Nested Loop (cost=8.87..20.91 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1) -> Bitmap Heap Scan on opportunity (cost=8.59..12.60 rows=1 width=207) (actual time=0.320..0.320 rows=0 loops=1) " Recheck Cond: ((agency = 'US-MMM'::text) OR (agency = 'HHS'::text))" Filter: (is_draft IS FALSE) -> BitmapOr (cost=8.59..8.59 rows=1 width=0) (actual time=0.319..0.319 rows=0 loops=1) -> Bitmap Index Scan on opportunity_agency_idx (cost=0.00..4.29 rows=1 width=0) (actual time=0.301..0.301 rows=0 loops=1) " Index Cond: (agency = 'US-MMM'::text)" -> Bitmap Index Scan on opportunity_agency_idx (cost=0.00..4.29 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1) " Index Cond: (agency = 'HHS'::text)" -> Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary (cost=0.29..8.30 rows=1 width=16) (never executed) Index Cond: (opportunity_id = opportunity.opportunity_id) -> Index Only Scan using opportunity_summary_pkey on opportunity_summary (cost=0.29..0.34 rows=1 width=8) (never executed) Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1 (cost=0.29..0.32 rows=1 width=8) (never executed) Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type (cost=0.29..0.34 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument (cost=0.29..0.33 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id) Heap Fetches: 0 -> Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category (cost=0.29..0.33 rows=2 width=8) (never executed) Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id) Heap Fetches: 0 Planning Time: 25.776 ms Execution Time: 0.652 ms

…rds could be returned

coilysiren

Figured I would try my best to help out with a review! I have some questions for my own understanding. Feel free to wait on someone with more context for a "real" approval!

coilysiren · 2024-03-14T23:23:26Z

api/src/api/schemas/search_schema.py

+        # We assume it's just a list of strings
+        if allowed_values is None:
+            params: dict = {"metadata": metadata}
+            if minimum_length is not None:
+                params["validate"] = [validators.Length(min=2)]


So we'll need code changes to support search filtering on integer fields, right?

Yes, any other types would require more changes. Just happens that all we want to support now are strings in the same one_of pattern.

When we switch to using a search index, this might be adjusted a bit, but I think this is closer to what we'd want there.

coilysiren · 2024-03-14T23:31:52Z

api/src/services/opportunities_v0_1/search_opportunities.py

+        one_of_opportunity_statuses = filters.opportunity_status.get("one_of")
+
+        if one_of_opportunity_statuses:
+            stmt = stmt.where(


I would be expecting to see...

Suggested change

stmt = stmt.where(

stmt += stmt.where(

...here, to support adding multiple WHERE filters being added. Can you help me understand how compound filters would work with this code?

Any of the filters / configs / etc attached to a statement doesn't actually modify the statement, and instead returns a new one.

You can't just do:

stmt = select(Opportunity) stmt.where(Opportunity.opportunity_id = 5) # stmt will still just represent `select * from opportunity`

as that won't modify the statement in place. Even looking at some of the internals, I'm not entirely sure why, it looks like whenever you call one of these methods it effectively makes a new select object (we take advantage of this when doing the count query in pagination by modify the query for counting).

And we can't do += because they didn't implement the method to do so.

coilysiren · 2024-03-14T23:34:33Z

api/src/api/schemas/search_schema.py

+        self.schema_fields: dict[str, fields.MixinField] = {}
+        self.schema_class_name = schema_class_name
+
+    def with_one_of(


Is the plan to add with_all_of as a filter type as well?

We could add that if needed, although I have no idea how the query to the DB would actually work for that.

jamesbursa

Looks good. I didn't review all the tests in detail.

jamesbursa · 2024-03-15T22:09:55Z

api/src/services/opportunities_v0_1/search_opportunities.py

+            )
+
+    if filters.funding_instrument is not None:
+        stmt = stmt.join(LinkOpportunitySummaryFundingInstrument)


Should this go inside the if below too? So that we only join if it's going to have a where.

The idea is that if we have multiple filters for the field, it adds the join for it collectively, we just only have one for each right now.

I should however make it so the filter model makes it so you'll always have at least one of these set so that we don't ever have a join for no reason. I'll make a follow-up ticket to add that.

api/tests/conftest.py

chouinar and others added 19 commits February 26, 2024 12:27

WIP

f9c051d

Merge branch 'main' into chouinar/next-iteration-of-db-tables

b31a094

WIP

99b24fe

Add DB migrations, fix factories, adjust tests temporarily

86ee26d

Fix factory inconsistency

4aa3731

[Issue #1364] Update API schema with modified DB schema fields

e585861

Support very very large award amounts

579de15

Merge branch 'chouinar/next-iteration-of-db-tables' into chouinar/136…

e3b1b2e

…4-update-api-schema

WIP

111dc77

Making never null columns not nullable

65f6add

Merge branch 'chouinar/next-iteration-of-db-tables' into chouinar/136…

869f62c

…4-update-api-schema

Update database ERD diagrams

34b431e

Cleanup

c361ef2

WIP

64d7aa6

Redo migrations with updated tables

b7f5985

Merge branch 'chouinar/next-iteration-of-db-tables' into chouinar/136…

7ab8873

…4-update-api-schema

Merge branch 'main' into chouinar/1364-update-api-schema

95ea3cb

[Issue #1167] Add filtering to the search endpoint

126bdb0

Minor adjustments

5bbb1d7

chouinar requested a review from jamesbursa March 12, 2024 17:31

chouinar requested review from acouch, aplybeah and SammySteiner as code owners March 12, 2024 17:31

github-actions bot added python api database labels Mar 12, 2024

Update OpenAPI spec

7093a41

rylew1 reviewed Mar 13, 2024

View reviewed changes

Base automatically changed from chouinar/1364-update-api-schema to main March 14, 2024 16:19

chouinar and others added 2 commits March 14, 2024 12:23

Merge branch 'main' into chouinar/1167-add-filtering

3c52f93

Update database ERD diagrams

831df5f

chouinar requested review from sumiat, widal001 and andycochran as code owners March 14, 2024 16:24

Fixing a significant bug where less than the number of specified reco…

c020d3b

…rds could be returned

github-actions bot added the documentation Improvements or additions to documentation label Mar 14, 2024

coilysiren approved these changes Mar 14, 2024

View reviewed changes

jamesbursa approved these changes Mar 15, 2024

View reviewed changes

Minor comment fix

aba0201

chouinar merged commit 15f9d59 into main Mar 18, 2024
8 checks passed

chouinar deleted the chouinar/1167-add-filtering branch March 18, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #1167] Add filtering to the search endpoint #1468

[Issue #1167] Add filtering to the search endpoint #1468

chouinar commented Mar 12, 2024

rylew1 Mar 13, 2024

chouinar Mar 14, 2024

rylew1 Mar 13, 2024 •

edited

Loading

chouinar Mar 14, 2024 •

edited

Loading

coilysiren left a comment

coilysiren Mar 14, 2024

chouinar Mar 15, 2024

coilysiren Mar 14, 2024

chouinar Mar 15, 2024

coilysiren Mar 14, 2024

chouinar Mar 15, 2024

jamesbursa left a comment

jamesbursa Mar 15, 2024

chouinar Mar 18, 2024

[Issue #1167] Add filtering to the search endpoint #1468

[Issue #1167] Add filtering to the search endpoint #1468

Conversation

chouinar commented Mar 12, 2024

Summary

Time to review: 15 mins

Changes proposed

Context for reviewers

Additional information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rylew1 Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

chouinar Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

coilysiren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesbursa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rylew1 Mar 13, 2024 •

edited

Loading

chouinar Mar 14, 2024 •

edited

Loading