Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #1167] Add filtering to the search endpoint #1468

Merged
merged 24 commits into from
Mar 18, 2024

Conversation

chouinar
Copy link
Collaborator

Summary

Fixes #1167

Time to review: 15 mins

Changes proposed

Added filtering to the search endpoint, includes all but the query box parameter which has its own follow-up ticket

Added utilities to help generate the search filter schema

Added indexes to improve the performance of search (see additional info below for details)

Extensive additions to the tests

Added the ability to choose examples on the OpenAPI docs (included an example with no filters, and one with many)

Fixed a bug in the Paginator for handling counts (will follow-up and fix in the template repo)

Context for reviewers

This change has been extensively tested, manually, and through an enormous amount of new unit tests. As the change was already getting quite large, a few things will be dealt with in follow-up tickets:

For the filters, they're all one_of filters which means that only one of the supplied values needs to match for it to pass the where clause (literally the where clauses generate as where table.column in (1, 2, 3)). You can see an example query below.

The agency filter is a bit odd as I made it a startswith style filter instead to handle the way agency codes get nested. We may want to adjust this further in the future, but this will at least technically handle hierarchies of agencies right now.

Additional information

I extensively tested the performance of the queries we run. I locally loaded in ~11k records using our factories (ran the seed-local-db script 300 times). With the API functioning, I make SQLAlchemy output the queries it ran and did an EXPLAIN ANALYZE ... on the big ones. I then added several indexes which improved the performance.

The primary query of the API looks like this:

SELECT
	opportunity.opportunity_id,
	opportunity.opportunity_number,
	opportunity.opportunity_title,
	opportunity.agency,
	opportunity.opportunity_category_id,
	opportunity.category_explanation,
	opportunity.is_draft,
	opportunity.revision_number,
	opportunity.modified_comments,
	opportunity.publisher_user_id,
	opportunity.publisher_profile_id,
	opportunity.created_at,
	opportunity.updated_at
FROM
	opportunity
	JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
	JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
	JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
	JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
	JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
WHERE
	opportunity.is_draft IS FALSE
	AND(EXISTS (
			SELECT
				1 FROM current_opportunity_summary
			WHERE
				opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
	AND current_opportunity_summary.opportunity_status_id IN(1,2)
	AND link_opportunity_summary_funding_instrument.funding_instrument_id IN(1,2)
	AND link_opportunity_summary_funding_category.funding_category_id IN(1,3,20)
	AND link_opportunity_summary_applicant_type.applicant_type_id IN(1, 2, 13)
	AND((opportunity.agency ILIKE 'US-ABC%')
	OR(opportunity.agency ILIKE 'HHS%'))
ORDER BY
	opportunity.opportunity_id DESC
LIMIT 25 OFFSET 25

Without any of the new indexes, EXPLAIN ANALYZE gives this a cost of ~1100 (non-specific unit). With the new indexes it becomes ~800. The actual runtime of these queries is in the 5-10ms range with or without the indexes, so it's minor either way. Note that querying the API locally, this gives response times of 50-150ms (slower initially before caching likely takes hold). Also if we're just filtering by something like opportunity status, then the costs are around 10-15.
See: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE

"link_opportunity_summary_funding_category",
["funding_category_id"],
unique=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would any of the new indexes cause performance issues on write? Or I guess the opportunities are relatively stable and don't need bulk updates too often

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect to see any issues, and adjusting the indexes is pretty whatever. From past experience, testing against a local DB versus one in AWS is a bit different in terms of performance. Can evaluate more once we have the data in the dev DB.

Testing locally with ~10k records, adding/removing the indexes is basically instant. Prod data will be about 80k, so let's just say about 10x. I don't think the scale of our DB is anywhere near "big data" to need to super optimize this, but if we do we can easily add/remove indexes uneventfully.

# ..WHERE ((opportunity.agency ILIKE 'US-ABC%') OR (opportunity.agency ILIKE 'HHS%'))
stmt = stmt.where(
or_(*[Opportunity.agency.istartswith(agency) for agency in one_of_agencies])
)
Copy link
Contributor

@rylew1 rylew1 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have an index on agency - would this do an index scan but not a direct index lookup. Would a direct index lookup on agency with == instead of ilike be faster?

Also I was reading that ilike does a case insensitive lower(column_name) search - is the agency index setup with lowercase to accommodate that?

Copy link
Collaborator Author

@chouinar chouinar Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some testing (see below). No difference between like and ilike from what I see. Changing to an exact match was a lot better performance (20x on execution). ilike does a sequential scan, and = does a Bitmap index scan.

If we don't need to support agency nesting yet (eg. HHS not getting HHS-XYZ) then we can make it exact match, although I'm not too concerned about the performance.


A quick check, using the query in the summary, but with only the is_draft, current opportunity_summary, and agency filters (eg. agency bit like AND((opportunity.agency ILIKE 'US-MMM%') OR(opportunity.agency ILIKE 'HHS%'))) gives

Limit  (cost=357.66..357.67 rows=1 width=207) (actual time=9.897..9.899 rows=0 loops=1)
  ->  Sort  (cost=357.65..357.66 rows=4 width=207) (actual time=9.896..9.897 rows=0 loops=1)
        Sort Key: opportunity.opportunity_id DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Nested Loop  (cost=1.71..357.61 rows=4 width=207) (actual time=9.855..9.857 rows=0 loops=1)
              Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id)
              ->  Nested Loop  (cost=1.43..356.55 rows=3 width=239) (actual time=9.855..9.856 rows=0 loops=1)
                    Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id)
                    ->  Nested Loop  (cost=1.14..355.84 rows=2 width=231) (actual time=9.854..9.855 rows=0 loops=1)
                          ->  Nested Loop Semi Join  (cost=0.86..355.48 rows=1 width=223) (actual time=9.854..9.855 rows=0 loops=1)
                                Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id)
                                ->  Nested Loop  (cost=0.57..355.15 rows=1 width=231) (actual time=9.854..9.854 rows=0 loops=1)
                                      ->  Nested Loop  (cost=0.29..354.81 rows=1 width=223) (actual time=9.853..9.854 rows=0 loops=1)
                                            ->  Seq Scan on opportunity  (cost=0.00..346.50 rows=1 width=207) (actual time=9.853..9.853 rows=0 loops=1)
"                                                  Filter: ((is_draft IS FALSE) AND ((agency ~~* 'US-MMM%'::text) OR (agency ~~* 'HHS%'::text)))"
                                                  Rows Removed by Filter: 10500
                                            ->  Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary  (cost=0.29..8.30 rows=1 width=16) (never executed)
                                                  Index Cond: (opportunity_id = opportunity.opportunity_id)
                                      ->  Index Only Scan using opportunity_summary_pkey on opportunity_summary  (cost=0.29..0.34 rows=1 width=8) (never executed)
                                            Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
                                            Heap Fetches: 0
                                ->  Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1  (cost=0.29..0.32 rows=1 width=8) (never executed)
                                      Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id)
                                      Heap Fetches: 0
                          ->  Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type  (cost=0.29..0.34 rows=2 width=8) (never executed)
                                Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
                                Heap Fetches: 0
                    ->  Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument  (cost=0.29..0.33 rows=2 width=8) (never executed)
                          Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
                          Heap Fetches: 0
              ->  Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category  (cost=0.29..0.33 rows=2 width=8) (never executed)
                    Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
                    Heap Fetches: 0
Planning Time: 27.059 ms
Execution Time: 10.096 ms

Changing those queries to = rather than ILIKE (eg. AND((opportunity.agency = 'US-MMM') OR(opportunity.agency = 'HHS'))) gives:

Limit  (cost=23.76..23.77 rows=1 width=207) (actual time=0.346..0.348 rows=0 loops=1)
  ->  Sort  (cost=23.75..23.76 rows=4 width=207) (actual time=0.345..0.346 rows=0 loops=1)
        Sort Key: opportunity.opportunity_id DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Nested Loop  (cost=10.30..23.71 rows=4 width=207) (actual time=0.321..0.322 rows=0 loops=1)
              Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id)
              ->  Nested Loop  (cost=10.01..22.65 rows=3 width=239) (actual time=0.321..0.322 rows=0 loops=1)
                    Join Filter: (current_opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id)
                    ->  Nested Loop  (cost=9.73..21.94 rows=2 width=231) (actual time=0.321..0.321 rows=0 loops=1)
                          ->  Nested Loop Semi Join  (cost=9.44..21.58 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1)
                                Join Filter: (opportunity.opportunity_id = current_opportunity_summary_1.opportunity_id)
                                ->  Nested Loop  (cost=9.16..21.25 rows=1 width=231) (actual time=0.320..0.321 rows=0 loops=1)
                                      ->  Nested Loop  (cost=8.87..20.91 rows=1 width=223) (actual time=0.320..0.321 rows=0 loops=1)
                                            ->  Bitmap Heap Scan on opportunity  (cost=8.59..12.60 rows=1 width=207) (actual time=0.320..0.320 rows=0 loops=1)
"                                                  Recheck Cond: ((agency = 'US-MMM'::text) OR (agency = 'HHS'::text))"
                                                  Filter: (is_draft IS FALSE)
                                                  ->  BitmapOr  (cost=8.59..8.59 rows=1 width=0) (actual time=0.319..0.319 rows=0 loops=1)
                                                        ->  Bitmap Index Scan on opportunity_agency_idx  (cost=0.00..4.29 rows=1 width=0) (actual time=0.301..0.301 rows=0 loops=1)
"                                                              Index Cond: (agency = 'US-MMM'::text)"
                                                        ->  Bitmap Index Scan on opportunity_agency_idx  (cost=0.00..4.29 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
"                                                              Index Cond: (agency = 'HHS'::text)"
                                            ->  Index Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary  (cost=0.29..8.30 rows=1 width=16) (never executed)
                                                  Index Cond: (opportunity_id = opportunity.opportunity_id)
                                      ->  Index Only Scan using opportunity_summary_pkey on opportunity_summary  (cost=0.29..0.34 rows=1 width=8) (never executed)
                                            Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
                                            Heap Fetches: 0
                                ->  Index Only Scan using current_opportunity_summary_opportunity_id_idx on current_opportunity_summary current_opportunity_summary_1  (cost=0.29..0.32 rows=1 width=8) (never executed)
                                      Index Cond: (opportunity_id = current_opportunity_summary.opportunity_id)
                                      Heap Fetches: 0
                          ->  Index Only Scan using link_opportunity_summary_applicant_type_opportunity_sum_fee2 on link_opportunity_summary_applicant_type  (cost=0.29..0.34 rows=2 width=8) (never executed)
                                Index Cond: (opportunity_summary_id = current_opportunity_summary.opportunity_summary_id)
                                Heap Fetches: 0
                    ->  Index Only Scan using link_opportunity_summary_funding_instrument_opportunity_4597 on link_opportunity_summary_funding_instrument  (cost=0.29..0.33 rows=2 width=8) (never executed)
                          Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
                          Heap Fetches: 0
              ->  Index Only Scan using link_opportunity_summary_funding_category_opportunity_s_9f79 on link_opportunity_summary_funding_category  (cost=0.29..0.33 rows=2 width=8) (never executed)
                    Index Cond: (opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id)
                    Heap Fetches: 0
Planning Time: 25.776 ms
Execution Time: 0.652 ms

Base automatically changed from chouinar/1364-update-api-schema to main March 14, 2024 16:19
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 14, 2024
Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured I would try my best to help out with a review! I have some questions for my own understanding. Feel free to wait on someone with more context for a "real" approval!

Comment on lines +58 to +62
# We assume it's just a list of strings
if allowed_values is None:
params: dict = {"metadata": metadata}
if minimum_length is not None:
params["validate"] = [validators.Length(min=2)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'll need code changes to support search filtering on integer fields, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, any other types would require more changes. Just happens that all we want to support now are strings in the same one_of pattern.

When we switch to using a search index, this might be adjusted a bit, but I think this is closer to what we'd want there.

one_of_opportunity_statuses = filters.opportunity_status.get("one_of")

if one_of_opportunity_statuses:
stmt = stmt.where(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be expecting to see...

Suggested change
stmt = stmt.where(
stmt += stmt.where(

...here, to support adding multiple WHERE filters being added. Can you help me understand how compound filters would work with this code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of the filters / configs / etc attached to a statement doesn't actually modify the statement, and instead returns a new one.

You can't just do:

stmt = select(Opportunity)
stmt.where(Opportunity.opportunity_id = 5)

# stmt will still just represent `select * from opportunity`

as that won't modify the statement in place. Even looking at some of the internals, I'm not entirely sure why, it looks like whenever you call one of these methods it effectively makes a new select object (we take advantage of this when doing the count query in pagination by modify the query for counting).

And we can't do += because they didn't implement the method to do so.

self.schema_fields: dict[str, fields.MixinField] = {}
self.schema_class_name = schema_class_name

def with_one_of(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to add with_all_of as a filter type as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add that if needed, although I have no idea how the query to the DB would actually work for that.

Copy link
Collaborator

@jamesbursa jamesbursa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I didn't review all the tests in detail.

)

if filters.funding_instrument is not None:
stmt = stmt.join(LinkOpportunitySummaryFundingInstrument)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go inside the if below too? So that we only join if it's going to have a where.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that if we have multiple filters for the field, it adds the join for it collectively, we just only have one for each right now.

I should however make it so the filter model makes it so you'll always have at least one of these set so that we don't ever have a join for no reason. I'll make a follow-up ticket to add that.

api/tests/conftest.py Outdated Show resolved Hide resolved
@chouinar chouinar merged commit 15f9d59 into main Mar 18, 2024
8 checks passed
@chouinar chouinar deleted the chouinar/1167-add-filtering branch March 18, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api database documentation Improvements or additions to documentation python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Add filtering and update the request model of the search endpoint
5 participants