Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running all tests together is sloooooow #1119

Open
mbarton opened this issue Dec 2, 2024 · 13 comments
Open

Running all tests together is sloooooow #1119

mbarton opened this issue Dec 2, 2024 · 13 comments

Comments

@mbarton
Copy link
Member

mbarton commented Dec 2, 2024

Our CI run is about half an hour, of which most time is spent crunching on some unit tests:

 epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_for_returns_correct_count[EnumAbstractionLevel.NHS_ENGLAND_REGION-ODSCodes4-expected_count4] PASSED [  5%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_for_returns_correct_count[EnumAbstractionLevel.OPEN_UK-ODSCodes5-expected_count5] PASSED [  5%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_for_returns_correct_count[EnumAbstractionLevel.COUNTRY-ODSCodes6-expected_count6] PASSED [  5%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_for_returns_correct_count[EnumAbstractionLevel.NATIONAL-ODSCodes7-expected_count7] PASSED [  5%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_includes_only_specified_cohort PASSED [  6%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_includes_only_specified_cohort_that_have_completed_one_full_year PASSED [  6%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_get_filtered_cases_queryset_for.py::test_get_filtered_cases_queryset_for_orgs_with_null_ICB PASSED [  6%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.ORGANISATION-abstraction_codes0-ods_codes0] PASSED [  7%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.TRUST-abstraction_codes1-ods_codes1] PASSED [  7%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.LOCAL_HEALTH_BOARD-abstraction_codes2-ods_codes2] PASSED [  7%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.ICB-abstraction_codes3-ods_codes3] PASSED [  8%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.NHS_ENGLAND_REGION-abstraction_codes4-ods_codes4] PASSED [  8%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.OPEN_UK-abstraction_codes5-ods_codes5] PASSED [  8%]

STOPS FOR AGES...

epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.COUNTRY-abstraction_codes6-ods_codes6] PASSED [  9%]
epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.NATIONAL-abstraction_codes7-ods_codes7] PASSED [  9%]
epilepsy12/tests/common_view_functions_tests/calculate_kpi_tests/test_measure_1.py::test_measure_1_should_pass_seen_paediatrician PASSED [  9%]
epilepsy12/tests/common_view_functions_tests/calculate_kpi_tests/test_measure_1.py::test_measure_1_should_pass_seen_neurologist PASSED [ 10%]
epilepsy12/tests/common_view_functions_tests/calculate_kpi_tests/test_measure_1.py::test_measure_1_should_fail_not_seen_14_days_after_referral PASSED [ 10%]
epilepsy12/tests/common_view_functions_tests/calculate_kpi_tests/test_measure_1.py::test_measure_1_should_fail_no_doctor_involved PASSED [ 10%]
epilepsy12/tests/common_view_functions_tests/calculate_kpi_tests/test_measure_10.py::test_measure_10_school_individual_healthcare_plan[age0-True-1] PASSED [ 11%]

https://github.com/rcpch/rcpch-audit-engine/actions/runs/12121037522/job/33791091653

We should profile what it's doing and see if we can speed it up. CI runs should be under 15 minutes to ensure hotfixes can be deployed quickly.

@mbarton
Copy link
Member Author

mbarton commented Dec 6, 2024

I'm suspicious of this test here:

epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.ORGANISATION-abstraction_codes0-ods_codes0]

Locally on my machine it causes Postgres to use 100% CPU. However the crunch seems to be worse running it as part of the wider test suite. If I run it standalone Postgres does still use 100% of CPU but the test eventually finishes.

That said running it standalone gets slower each time I run it.

@mbarton
Copy link
Member Author

mbarton commented Dec 6, 2024

I think I've narrowed it down to actually calculating the KPI aggregations. The 100% CPU usage tends to coincide with this debug logging:

DEBUG [epilepsy12.common_view_functions.aggregate_by:462] updating/saving: Aneurin Bevan University Health Board                                                                                                   
DEBUG [epilepsy12.common_view_functions.aggregate_by:470] created LocalHealthBoardKPIAggregation (LocalHealthBoard=Aneurin Bevan University Health Board, Cohort 6)                                                
DEBUG [epilepsy12.common_view_functions.aggregate_by:475] 1 scored LOCAL_HEALTH_BOARD instances updated with aggregated scores of a total 1 LOCAL_HEALTH_BOARDs                                                    
DEBUG [epilepsy12.common_view_functions.aggregate_by:462] updating/saving: Aneurin Bevan University Health Board
DEBUG [epilepsy12.common_view_functions.aggregate_by:472] updated LocalHealthBoardKPIAggregation (LocalHealthBoard=Aneurin Bevan University Health Board, Cohort 6)                                                
DEBUG [epilepsy12.common_view_functions.aggregate_by:475] 1 scored LOCAL_HEALTH_BOARD instances updated with aggregated scores of a total 1 LOCAL_HEALTH_BOARDs                                                    
DEBUG [epilepsy12.common_view_functions.aggregate_by:462] updating/saving: Swansea Bay University Health Board                                                                                                     
DEBUG [epilepsy12.common_view_functions.aggregate_by:470] created LocalHealthBoardKPIAggregation (LocalHealthBoard=Swansea Bay University Health Board, Cohort 6)                                                  
DEBUG [epilepsy12.common_view_functions.aggregate_by:475] 1 scored LOCAL_HEALTH_BOARD instances updated with aggregated scores of a total 2 LOCAL_HEALTH_BOARDs                                                    
DEBUG [epilepsy12.common_view_functions.aggregate_by:480] Not updated: [('Aneurin Bevan University Health Board',)])                                                                                               
DEBUG [epilepsy12.common_view_functions.aggregate_by:462] updating/saving: Swansea Bay University Health Board                                                                                                     
DEBUG [epilepsy12.common_view_functions.aggregate_by:472] updated LocalHealthBoardKPIAggregation (LocalHealthBoard=Swansea Bay University Health Board, Cohort 6)                                                  
DEBUG [epilepsy12.common_view_functions.aggregate_by:475] 1 scored LOCAL_HEALTH_BOARD instances updated with aggregated scores of a total 2 LOCAL_HEALTH_BOARDs                                                    
DEBUG [epilepsy12.common_view_functions.aggregate_by:480] Not updated: [('Aneurin Bevan University Health Board',)])         

Frustratingly I can't reliably reproduce the 100% CPU usage. So far it's only happened after running other tests but not every time. I've tried running just the test_update_kpiaggregation_model tests repeatedly and it doesn't happen.

@eatyourpeas
Copy link
Member

I wonder if during the database seeding for the tests _seed_all_aggregation_models is called. This would create instances of the KPIAggregation models, one for each cohort and aggregation level.

@mbarton
Copy link
Member Author

mbarton commented Dec 10, 2024

Yes it is called but I don't think that's what's slow. I've added some debugging printlns to this run (Github Actions) and I think I've narrowed it down to the organisation level tests:

epilepsy12/tests/common_view_functions_tests/aggregate_by_tests/test_update_kpiaggregation_model.py::test_update_kpi_aggregation_model_all_levels[EnumAbstractionLevel.ORGANISATION-abstraction_codes0-ods_codes0] !! 13:42:21 START TEST: EnumAbstractionLevel.ORGANISATION !!
!! 13:42:21 Before _register_kpi_scored_cases !!
!! 13:42:24 After _register_kpi_scored_cases !!
!! 13:42:24 Before get_filtered_cases_queryset_for(organisation=ADDENBROOKE'S HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!
!! 13:42:24 After get_filtered_cases_queryset_for
!! 13:42:24 Before calculate_kpi_value_counts_queryset(organisation=ADDENBROOKE'S HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!
!! 13:42:24 After calculate_kpi_value_counts_queryset
!! 13:42:24 Before update_kpi_aggregation_model(organisation=ADDENBROOKE'S HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!

[BIG GAP]

!! 13:49:37 BEFORE update_or_create ADDENBROOKE'S HOSPITAL !!
!! 13:49:37 AFTER update_or_create ADDENBROOKE'S HOSPITAL !!
DEBUG 2024-12-10 13:49:37,122 [epilepsy12.common_view_functions.aggregate_by:466] updating/saving: ADDENBROOKE'S HOSPITAL
DEBUG 2024-12-10 13:49:37,122 [epilepsy12.common_view_functions.aggregate_by:474] created OrganisationKPIAggregation (ods_code=RGT01, name=ADDENBROOKE'S HOSPITAL, Cohort 6) KPIAggregations
!! 13:49:37 After update_kpi_aggregation_model
!! 13:49:37 Before get_filtered_cases_queryset_for(organisation=CHELSEA & WESTMINSTER HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!
!! 13:49:37 After get_filtered_cases_queryset_for
!! 13:49:37 Before calculate_kpi_value_counts_queryset(organisation=CHELSEA & WESTMINSTER HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!
!! 13:49:37 After calculate_kpi_value_counts_queryset
!! 13:49:37 Before update_kpi_aggregation_model(organisation=CHELSEA & WESTMINSTER HOSPITAL, abstraction_level=EnumAbstractionLevel.ORGANISATION) !!

[BIG GAP]

!! 13:58:57 BEFORE update_or_create CHELSEA & WESTMINSTER HOSPITAL !!
DEBUG 2024-12-10 13:58:57,296 [epilepsy12.common_view_functions.aggregate_by:466] updating/saving: CHELSEA & WESTMINSTER HOSPITAL
!! 13:58:57 AFTER update_or_create CHELSEA & WESTMINSTER HOSPITAL !!
DEBUG 2024-12-10 13:58:57,296 [epilepsy12.common_view_functions.aggregate_by:474] created OrganisationKPIAggregation (ods_code=RQM01, name=CHELSEA & WESTMINSTER HOSPITAL, Cohort 6) KPIAggregations
!! 13:58:57 After update_kpi_aggregation_model
!! 13:58:57 END TEST: EnumAbstractionLevel.ORGANISATION !!

The calls to update_kpi_aggregation_model seem to take many minutes. It's not the actual DB query creating the aggregation so I'll add more debugging to the other lines to try and narrow it down.

The extreme slowness does seem isolated to those Organisation tests. The same test at higher levels (trust etc) takes just a few seconds.

@mbarton
Copy link
Member Author

mbarton commented Dec 10, 2024

This run shows the slowness is evaluating the kpi_value_counts queryset:

for value_count in kpi_value_counts:

!! 21:46:31 BEFORE apps.get_model(OrganisationKPIAggregation) !!
!! 21:46:31 AFTER apps.get_model(OrganisationKPIAggregation) !!
!! 21:46:31 BEFORE for value_count in kpi_value_counts:<class 'django.db.models.query.QuerySet'> !!
!! 21:53:54 BEFORE value_count.pop(organisation__ods_code)) !!
!! 21:53:54 AFTER value_count.pop(organisation__ods_code)) !!

https://github.com/rcpch/rcpch-audit-engine/actions/runs/12264982554/job/34220762735

It's a big SQL query:

SELECT
    "epilepsy12_organisation"."ods_code",
    COUNT(CASE WHEN "epilepsy12_kpi"."ecg" = 1 THEN 1 ELSE NULL END) AS "ecg_passed",
    COUNT(CASE WHEN ("epilepsy12_kpi"."ecg" = 1 OR "epilepsy12_kpi"."ecg" = 0) THEN 1 ELSE NULL END) AS "ecg_total_eligible",
    COUNT(CASE WHEN "epilepsy12_kpi"."ecg" = 2 THEN 1 ELSE NULL END) AS "ecg_ineligible",
    COUNT(CASE WHEN "epilepsy12_kpi"."ecg" IS NULL THEN 1 ELSE NULL END) AS "ecg_incomplete",
    COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" = 1 THEN 1 ELSE NULL END) AS "mental_health_support_passed",
    COUNT(CASE WHEN ("epilepsy12_kpi"."mental_health_support" = 1 OR "epilepsy12_kpi"."mental_health_support" = 0) THEN 1 ELSE NULL END) AS "mental_health_support_total_eligible",
    COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" = 2 THEN 1 ELSE NULL END) AS "mental_health_support_ineligible",
    COUNT(CASE WHEN "epilepsy12_kpi"."mental_health_support" IS NULL THEN 1 ELSE NULL END) AS "mental_health_support_incomplete"
FROM "epilepsy12_kpi"
    INNER JOIN "epilepsy12_registration"
    ON ("epilepsy12_kpi"."id" = "epilepsy12_registration"."kpi_id")
        INNER JOIN "epilepsy12_organisation"
        ON ("epilepsy12_kpi"."organisation_id" = "epilepsy12_organisation"."id")
        WHERE "epilepsy12_registration"."id" IN (
            SELECT U3."id" FROM "epilepsy12_case" U0
                INNER JOIN "epilepsy12_site" U1
                ON (U0."id" = U1."case_id")
                    INNER JOIN "epilepsy12_organisation" U2
                    ON (U1."organisation_id" = U2."id")
                        INNER JOIN "epilepsy12_registration" U3
                        ON (U0."id" = U3."case_id")
                            INNER JOIN "epilepsy12_auditprogress" U4
                            ON (U3."audit_progress_id" = U4."id")
                            WHERE (U2."ods_code" = 'RQM01' AND U4."assessment_complete"
                                AND U4."epilepsy_context_complete"
                                AND U4."first_paediatric_assessment_complete"
                                AND U4."investigations_complete"
                                AND U4."management_complete"
                                AND U4."multiaxial_diagnosis_complete"
                                AND U4."registration_complete"
                                AND U3."cohort" = 6
                                AND U3."completed_first_year_of_care_date" <= 2024-12-10
                                AND U1."site_is_actively_involved_in_epilepsy_care"
                                AND U1."site_is_primary_centre_of_epilepsy_care"))
GROUP BY "epilepsy12_organisation"."ods_code"
ORDER BY "epilepsy12_organisation"."ods_code" ASC

Here's the EXPLAIN ANALYZE output for that query:

GroupAggregate  (cost=65.56..65.61 rows=1 width=70) (actual time=506140.478..506140.482 rows=1 loops=1)
  Output: epilepsy12_organisation.ods_code, count(CASE WHEN (epilepsy12_kpi.ecg = 1) THEN 1 ELSE NULL::integer END), count(CASE WHEN ((epilepsy12_kpi.ecg = 1) OR (epilepsy12_kpi.ecg = 0)) THEN 1 ELSE NULL::integer END), count(CASE WHEN (epilepsy12_kpi.ecg = 2) THEN 1 ELSE NULL::integer END), count(CASE WHEN (epilepsy12_kpi.ecg IS NULL) THEN 1 ELSE NULL::integer END), count(CASE WHEN (epilepsy12_kpi.mental_health_support = 1) THEN 1 ELSE NULL::integer END), count(CASE WHEN ((epilepsy12_kpi.mental_health_support = 1) OR (epilepsy12_kpi.mental_health_support = 0)) THEN 1 ELSE NULL::integer END), count(CASE WHEN (epilepsy12_kpi.mental_health_support = 2) THEN 1 ELSE NULL::integer END), count(CASE WHEN (epilepsy12_kpi.mental_health_support IS NULL) THEN 1 ELSE NULL::integer END)
  Group Key: epilepsy12_organisation.ods_code
  ->  Sort  (cost=65.56..65.56 rows=1 width=14) (actual time=506140.457..506140.462 rows=40 loops=1)
        Output: epilepsy12_organisation.ods_code, epilepsy12_kpi.ecg, epilepsy12_kpi.mental_health_support
        Sort Key: epilepsy12_organisation.ods_code
        Sort Method: quicksort  Memory: 27kB
        ->  Nested Loop  (cost=1.29..65.55 rows=1 width=14) (actual time=751.565..506140.349 rows=40 loops=1)
              Output: epilepsy12_organisation.ods_code, epilepsy12_kpi.ecg, epilepsy12_kpi.mental_health_support
              Inner Unique: true
              ->  Nested Loop  (cost=1.02..57.21 rows=1 width=16) (actual time=751.552..506140.003 rows=40 loops=1)
                    Output: epilepsy12_kpi.ecg, epilepsy12_kpi.mental_health_support, epilepsy12_kpi.organisation_id
                    Join Filter: (epilepsy12_registration.kpi_id = epilepsy12_kpi.id)
                    Rows Removed by Join Filter: 3160
                    ->  Index Scan using epilepsy12_kpi_organisation_id_583ca7cb on public.epilepsy12_kpi  (cost=0.12..8.14 rows=1 width=24) (actual time=0.018..0.155 rows=80 loops=1)
                          Output: epilepsy12_kpi.id, epilepsy12_kpi.paediatrician_with_expertise_in_epilepsies, epilepsy12_kpi.epilepsy_specialist_nurse, epilepsy12_kpi.tertiary_input, epilepsy12_kpi.epilepsy_surgery_referral, epilepsy12_kpi.ecg, epilepsy12_kpi.mri, epilepsy12_kpi.assessment_of_mental_health_issues, epilepsy12_kpi.mental_health_support, epilepsy12_kpi.sodium_valproate, epilepsy12_kpi.comprehensive_care_planning_agreement, epilepsy12_kpi.patient_held_individualised_epilepsy_document, epilepsy12_kpi.patient_carer_parent_agreement_to_the_care_planning, epilepsy12_kpi.care_planning_has_been_updated_when_necessary, epilepsy12_kpi.comprehensive_care_planning_content, epilepsy12_kpi.parental_prolonged_seizures_care_plan, epilepsy12_kpi.water_safety, epilepsy12_kpi.first_aid, epilepsy12_kpi.general_participation_and_risk, epilepsy12_kpi.sudep, epilepsy12_kpi.service_contact_details, epilepsy12_kpi.school_individual_healthcare_plan, epilepsy12_kpi.organisation_id
                    ->  Nested Loop Semi Join  (cost=0.90..49.06 rows=1 width=8) (actual time=0.020..6326.733 rows=40 loops=80)
                          Output: epilepsy12_registration.kpi_id
                          Join Filter: (epilepsy12_registration.id = u3.id)
                          Rows Removed by Join Filter: 2380
                          ->  Index Scan using epilepsy12_registration_kpi_id_key on public.epilepsy12_registration  (cost=0.25..8.26 rows=1 width=16) (actual time=0.006..0.076 rows=80 loops=80)
                                Output: epilepsy12_registration.kpi_id, epilepsy12_registration.id
                          ->  Nested Loop  (cost=0.65..40.78 rows=1 width=8) (actual time=0.012..79.080 rows=30 loops=6400)
                                Output: u3.id
                                Inner Unique: true
                                Join Filter: (u3.audit_progress_id = u4.id)
                                Rows Removed by Join Filter: 523
                                ->  Nested Loop  (cost=0.52..32.63 rows=1 width=16) (actual time=0.009..78.923 rows=30 loops=6400)
                                      Output: u3.audit_progress_id, u3.id
                                      Join Filter: (u0.id = u3.case_id)
                                      Rows Removed by Join Filter: [1970](https://github.com/rcpch/rcpch-audit-engine/actions/runs/12276466365/job/34253731534#step:7:1970)
                                      ->  Index Scan using epilepsy12_registration_audit_progress_id_key on public.epilepsy12_registration u3  (cost=0.12..8.14 rows=1 width=24) (actual time=0.003..0.021 rows=50 loops=6400)
                                            Output: u3.case_id, u3.audit_progress_id, u3.id
                                            Filter: ((u3.completed_first_year_of_care_date <= '2024-12-11'::date) AND (u3.cohort = 6))
                                      ->  Nested Loop  (cost=0.40..24.47 rows=1 width=16) (actual time=0.006..1.568 rows=40 loops=321600)
                                            Output: u0.id, u1.case_id
                                            Inner Unique: true
                                            Join Filter: (u1.organisation_id = u2.id)
                                            Rows Removed by Join Filter: 40
                                            ->  Nested Loop  (cost=0.25..16.29 rows=1 width=24) (actual time=0.005..1.496 rows=79 loops=321600)
                                                  Output: u0.id, u1.case_id, u1.organisation_id
                                                  Join Filter: (u0.id = u1.case_id)
                                                  Rows Removed by Join Filter: 6273
                                                  ->  Index Only Scan using epilepsy12_case_pkey on public.epilepsy12_case u0  (cost=0.12..8.14 rows=1 width=8) (actual time=0.002..0.014 rows=79 loops=321600)
                                                        Output: u0.id
                                                        Heap Fetches: 25537600
                                                  ->  Index Scan using epilepsy12_site_organisation_id_43260192 on public.epilepsy12_site u1  (cost=0.12..8.14 rows=1 width=16) (actual time=0.001..0.015 rows=80 loops=25537600)
                                                        Output: u1.case_id, u1.organisation_id
                                                        Filter: (u1.site_is_actively_involved_in_epilepsy_care AND u1.site_is_primary_centre_of_epilepsy_care)
                                            ->  Index Scan using epilepsy12_organisation_ods_code_a9b6ab63_like on public.epilepsy12_organisation u2  (cost=0.15..8.17 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=25537600)
                                                  Output: u2.id, u2.created_at, u2.updated_at, u2.ods_code, u2.name, u2.website, u2.address1, u2.address2, u2.address3, u2.telephone, u2.city, u2.county, u2.latitude, u2.longitude, u2.postcode, u2.geocode_coordinates, u2.active, u2.published_at, u2.country_id, u2.integrated_care_board_id, u2.local_health_board_id, u2.london_borough_id, u2.nhs_england_region_id, u2.openuk_network_id, u2.trust_id, u2.email
                                                  Index Cond: ((u2.ods_code)::text = 'RGT01'::text)
                                ->  Index Scan using epilepsy12_auditprogress_pkey on public.epilepsy12_auditprogress u4  (cost=0.12..8.14 rows=1 width=8) (actual time=0.001..0.004 rows=18 loops=193600)
                                      Output: u4.id, u4.registration_complete, u4.registration_total_expected_fields, u4.registration_total_completed_fields, u4.first_paediatric_assessment_complete, u4.first_paediatric_assessment_total_expected_fields, u4.first_paediatric_assessment_total_completed_fields, u4.assessment_complete, u4.assessment_total_expected_fields, u4.assessment_total_completed_fields, u4.epilepsy_context_complete, u4.epilepsy_context_total_expected_fields, u4.epilepsy_context_total_completed_fields, u4.multiaxial_diagnosis_complete, u4.multiaxial_diagnosis_total_expected_fields, u4.multiaxial_diagnosis_total_completed_fields, u4.investigations_complete, u4.investigations_total_expected_fields, u4.investigations_total_completed_fields, u4.management_complete, u4.management_total_expected_fields, u4.management_total_completed_fields, u4.consent_patient_confirmed, u4.details_patient_confirmed
                                      Filter: (u4.assessment_complete AND u4.epilepsy_context_complete AND u4.first_paediatric_assessment_complete AND u4.investigations_complete AND u4.management_complete AND u4.multiaxial_diagnosis_complete AND u4.registration_complete)
              ->  Index Scan using epilepsy12_organisation_pkey on public.epilepsy12_organisation  (cost=0.27..8.29 rows=1 width=14) (actual time=0.005..0.005 rows=1 loops=40)
                    Output: epilepsy12_organisation.id, epilepsy12_organisation.created_at, epilepsy12_organisation.updated_at, epilepsy12_organisation.ods_code, epilepsy12_organisation.name, epilepsy12_organisation.website, epilepsy12_organisation.address1, epilepsy12_organisation.address2, epilepsy12_organisation.address3, epilepsy12_organisation.telephone, epilepsy12_organisation.city, epilepsy12_organisation.county, epilepsy12_organisation.latitude, epilepsy12_organisation.longitude, epilepsy12_organisation.postcode, epilepsy12_organisation.geocode_coordinates, epilepsy12_organisation.active, epilepsy12_organisation.published_at, epilepsy12_organisation.country_id, epilepsy12_organisation.integrated_care_board_id, epilepsy12_organisation.local_health_board_id, epilepsy12_organisation.london_borough_id, epilepsy12_organisation.nhs_england_region_id, epilepsy12_organisation.openuk_network_id, epilepsy12_organisation.trust_id, epilepsy12_organisation.email
                    Index Cond: (epilepsy12_organisation.id = epilepsy12_kpi.organisation_id)
Planning Time: 0.714 ms
Execution Time: 506140.555 ms

@mbarton
Copy link
Member Author

mbarton commented Dec 11, 2024

I've used an online visualiser for the explain output of a slow run (KPI for an organisation) and a faster one (KPI for a trust):

Strangely an Index Scan on Site seems to pop out which I didn't see eyeballing the text output:

Screenshot 2024-12-11 16 39 17

There is the same for the trust run but it takes a lot less time and hits an order of magnitude less rows:

Screenshot 2024-12-11 16 40 44

It might be useful to do a test with fewer sites just to see if that makes a difference (obviously not a long term solution but useful to know if that's what makes it run away)

@mbarton
Copy link
Member Author

mbarton commented Dec 13, 2024

Here is the query plain from running the test individually: https://explain.depesz.com/s/e17x#html. It's a totally different plan which makes me suspect that when there's high CPU usage Postgres is picking a different plan.

I'm still not quite sure how I managed to reproduce the high CPU usage on my local machine but I did!

I found that running the test individually was always fast but running all the tests slowed down at that test and Postgres used 100% of CPU. I could reproduce this even when stopping the docker compose stack and starting it again.

However, I then ran ANALYZE manually in psql. After that, the tests ran fast again. So I suspect once it has up to date stats about the tables after seeding with test data, it's able to pick a better query plan. That may explain why this problem seems limited to fresh CI runs.

Update: adding an analyze call to the slow test doesn't seem to speed it up (see this run https://github.com/rcpch/rcpch-audit-engine/actions/runs/12321365885/job/34392606068). I'm still not sure how to push it on to the happier query plan.

Perhaps it's worth spinning up the app (including seeding) locally and then running the tests? In CI at the moment we spin up the docker compose stack in the background, which means the seeding is done at the start of the tests.

@eatyourpeas
Copy link
Member

This is possibly the most tenacious and brilliant piece of debugging I think I have seen @mbarton .

@mbarton
Copy link
Member Author

mbarton commented Dec 17, 2024

Thank you @eatyourpeas I'm not sure about brilliant but as far as I'm concerned the computer doesn't get to spend 20 minutes burning CPU just because it wants to 😆

I've managed to narrow it down further to the interaction between test_calculate_kpi_value_counts_queryset and test_update_kpiaggregation_model. This run removes the former and the latter runs quickly.

@mbarton
Copy link
Member Author

mbarton commented Dec 17, 2024

Just for completeness, I've narrowed it down further to the call to e12_case_factory.create_batch

This run doesn't call that for test_calculate_kpi_value_counts_queryset but does for test_update_kpiaggregation_model:

https://github.com/rcpch/rcpch-audit-engine/actions/runs/12373107174/job/34532760889

It runs fast (well faster than how it does normally!). Strangely all the tests in test_calculate_kpi_value_counts_queryset still pass apart from the national one. That's odd, at first glance I'd expect them all to fail.

@eatyourpeas
Copy link
Member

Maybe something in the filter? National does not filter against abstraction_level whereas all the others do

@mbarton
Copy link
Member Author

mbarton commented Dec 17, 2024

Disabling nested loops in the query planner seems to always make it run at a constant speed (enable_nestloop=off).

https://www.postgresql.org/docs/17/runtime-config-query.html#GUC-ENABLE-NESTLOOP

https://github.com/rcpch/rcpch-audit-engine/actions/runs/12375031581/job/34554140796

However test_calculate_kpi_value_counts_queryset also run slower so not an ideal solution

@mbarton
Copy link
Member Author

mbarton commented Dec 17, 2024

OK so I'm still not sure what is causing those nested loops to run out of control but disabling them just for test_update_kpiaggregation_model causes a nice 10 minute run: https://github.com/rcpch/rcpch-audit-engine/actions/runs/12380127886/job/34555791316

Incidentally, upgrading to Postgres 16 has a similar effect without any customisation (I can't believe I didn't think of this sooner!): https://github.com/rcpch/rcpch-audit-engine/actions/runs/12381126831/job/34558979428

From a scan of the release notes I'm not sure what's made the difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants