-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running all tests together is sloooooow #1119
Comments
I'm suspicious of this test here:
Locally on my machine it causes Postgres to use 100% CPU. However the crunch seems to be worse running it as part of the wider test suite. If I run it standalone Postgres does still use 100% of CPU but the test eventually finishes. That said running it standalone gets slower each time I run it. |
I think I've narrowed it down to actually calculating the KPI aggregations. The 100% CPU usage tends to coincide with this debug logging:
Frustratingly I can't reliably reproduce the 100% CPU usage. So far it's only happened after running other tests but not every time. I've tried running just the test_update_kpiaggregation_model tests repeatedly and it doesn't happen. |
I wonder if during the database seeding for the tests |
Yes it is called but I don't think that's what's slow. I've added some debugging printlns to this run (Github Actions) and I think I've narrowed it down to the organisation level tests:
The calls to The extreme slowness does seem isolated to those Organisation tests. The same test at higher levels (trust etc) takes just a few seconds. |
This run shows the slowness is evaluating the
https://github.com/rcpch/rcpch-audit-engine/actions/runs/12264982554/job/34220762735 It's a big SQL query: SELECT
"epilepsy12_organisation"."ods_code",
COUNT(CASE WHEN "epilepsy12_kpi"."ecg" = 1 THEN 1 ELSE NULL END) AS "ecg_passed",
COUNT(CASE WHEN ("epilepsy12_kpi"."ecg" = 1 OR "epilepsy12_kpi"."ecg" = 0) THEN 1 ELSE NULL END) AS "ecg_total_eligible",
COUNT(CASE WHEN "epilepsy12_kpi"."ecg" = 2 THEN 1 ELSE NULL END) AS "ecg_ineligible",
COUNT(CASE WHEN "epilepsy12_kpi"."ecg" IS NULL THEN 1 ELSE NULL END) AS "ecg_incomplete",
COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" = 1 THEN 1 ELSE NULL END) AS "mental_health_support_passed",
COUNT(CASE WHEN ("epilepsy12_kpi"."mental_health_support" = 1 OR "epilepsy12_kpi"."mental_health_support" = 0) THEN 1 ELSE NULL END) AS "mental_health_support_total_eligible",
COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" = 2 THEN 1 ELSE NULL END) AS "mental_health_support_ineligible",
COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" IS NULL THEN 1 ELSE NULL END) AS "mental_health_support_incomplete"
FROM "epilepsy12_kpi"
INNER JOIN "epilepsy12_registration"
ON ("epilepsy12_kpi"."id" = "epilepsy12_registration"."kpi_id")
INNER JOIN "epilepsy12_organisation"
ON ("epilepsy12_kpi"."organisation_id" = "epilepsy12_organisation"."id")
WHERE "epilepsy12_registration"."id" IN (
SELECT U3."id" FROM "epilepsy12_case" U0
INNER JOIN "epilepsy12_site" U1
ON (U0."id" = U1."case_id")
INNER JOIN "epilepsy12_organisation" U2
ON (U1."organisation_id" = U2."id")
INNER JOIN "epilepsy12_registration" U3
ON (U0."id" = U3."case_id")
INNER JOIN "epilepsy12_auditprogress" U4
ON (U3."audit_progress_id" = U4."id")
WHERE (U2."ods_code" = 'RQM01' AND U4."assessment_complete"
AND U4."epilepsy_context_complete"
AND U4."first_paediatric_assessment_complete"
AND U4."investigations_complete"
AND U4."management_complete"
AND U4."multiaxial_diagnosis_complete"
AND U4."registration_complete"
AND U3."cohort" = 6
AND U3."completed_first_year_of_care_date" <= 2024-12-10
AND U1."site_is_actively_involved_in_epilepsy_care"
AND U1."site_is_primary_centre_of_epilepsy_care"))
GROUP BY "epilepsy12_organisation"."ods_code"
ORDER BY "epilepsy12_organisation"."ods_code" ASC Here's the
|
I've used an online visualiser for the explain output of a slow run (KPI for an organisation) and a faster one (KPI for a trust):
Strangely an Index Scan on Site seems to pop out which I didn't see eyeballing the text output: There is the same for the trust run but it takes a lot less time and hits an order of magnitude less rows: It might be useful to do a test with fewer sites just to see if that makes a difference (obviously not a long term solution but useful to know if that's what makes it run away) |
Here is the query plain from running the test individually: https://explain.depesz.com/s/e17x#html. It's a totally different plan which makes me suspect that when there's high CPU usage Postgres is picking a different plan. I'm still not quite sure how I managed to reproduce the high CPU usage on my local machine but I did! I found that running the test individually was always fast but running all the tests slowed down at that test and Postgres used 100% of CPU. I could reproduce this even when stopping the docker compose stack and starting it again. However, I then ran Update: adding an analyze call to the slow test doesn't seem to speed it up (see this run https://github.com/rcpch/rcpch-audit-engine/actions/runs/12321365885/job/34392606068). I'm still not sure how to push it on to the happier query plan. Perhaps it's worth spinning up the app (including seeding) locally and then running the tests? In CI at the moment we spin up the docker compose stack in the background, which means the seeding is done at the start of the tests. |
This is possibly the most tenacious and brilliant piece of debugging I think I have seen @mbarton . |
Thank you @eatyourpeas I'm not sure about brilliant but as far as I'm concerned the computer doesn't get to spend 20 minutes burning CPU just because it wants to 😆 I've managed to narrow it down further to the interaction between |
Just for completeness, I've narrowed it down further to the call to rcpch-audit-engine/epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/helpers.py Line 92 in 909ebbe
This run doesn't call that for https://github.com/rcpch/rcpch-audit-engine/actions/runs/12373107174/job/34532760889 It runs fast (well faster than how it does normally!). Strangely all the tests in |
Maybe something in the filter? National does not filter against |
Disabling nested loops in the query planner seems to always make it run at a constant speed ( https://www.postgresql.org/docs/17/runtime-config-query.html#GUC-ENABLE-NESTLOOP https://github.com/rcpch/rcpch-audit-engine/actions/runs/12375031581/job/34554140796 However |
OK so I'm still not sure what is causing those nested loops to run out of control but disabling them just for Incidentally, upgrading to Postgres 16 has a similar effect without any customisation (I can't believe I didn't think of this sooner!): https://github.com/rcpch/rcpch-audit-engine/actions/runs/12381126831/job/34558979428 From a scan of the release notes I'm not sure what's made the difference. |
Our CI run is about half an hour, of which most time is spent crunching on some unit tests:
https://github.com/rcpch/rcpch-audit-engine/actions/runs/12121037522/job/33791091653
We should profile what it's doing and see if we can speed it up. CI runs should be under 15 minutes to ensure hotfixes can be deployed quickly.
The text was updated successfully, but these errors were encountered: