Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condense simple solr queries #2975

Open
CGillen opened this issue Nov 13, 2023 · 0 comments
Open

Condense simple solr queries #2975

CGillen opened this issue Nov 13, 2023 · 0 comments

Comments

@CGillen
Copy link
Contributor

CGillen commented Nov 13, 2023

Descriptive summary

General catalog searches produce massive solr queries with mostly empty/irrelevant data. See this example of a simple search for water:

Solr query: post select {
  "qt"=>"search",
  "facet.field"=>["copyright_combined_label_sim", "file_format_sim", "resource_type_label_sim", "topic_combined_label_sim", "scientific_combined_label_sim", "creator_combined_label_sim", "date_combined_year_label_ssim", "date_combined_decade_label_ssim", "location_combined_label_sim", "workType_label_sim", "language_label_sim", "non_user_collections_ssim", "local_collection_name_label_sim", "institution_label_sim", "cultural_context_label_sim", "former_owner_sim", "mode_of_issuance_sim", "box_number_sim", "folder_name_sim", "folder_number_sim", "has_number_sim", "is_volume_sim", "series_name_sim", "series_number_sim", "exhibit_sim", "creator_label_sim", "contributor_label_sim", "arranger_label_sim", "artist_label_sim", "author_label_sim", "cartographer_label_sim", "collector_label_sim", "composer_label_sim", "dedicatee_label_sim", "designer_label_sim", "donor_label_sim", "editor_label_sim", "illustrator_label_sim", "interviewee_label_sim", "interviewer_label_sim", "landscape_architect_label_sim", "lyricist_label_sim", "owner_label_sim", "patron_label_sim", "photographer_label_sim", "print_maker_label_sim", "recipient_label_sim", "transcriber_label_sim", "translator_label_sim", "form_of_work_label_sim", "military_branch_label_sim", "subject_label_sim", "keyword_sim", "ethnographic_term_label_sim", "style_or_period_label_sim", "phylum_or_division_label_sim", "taxon_class_label_sim", "order_label_sim", "family_label_sim", "genus_label_sim", "species_label_sim", "location_label_sim", "tgn_label_sim", "ranger_district_label_sim", "water_basin_label_sim", "rights_statement_label_sim", "license_label_sim", "repository_label_sim", "publisher_label_sim", "full_size_download_allowed_label_sim"],
  "facet.query"=>["license_sim:(\\n            https\\\\://creativecommons.org/licenses/by/4.0/ OR\\n            https\\\\://creativecommons.org/licenses/by-sa/4.0/ OR\\n            https\\\\://creativecommons.org/licenses/by-nd/4.0/ OR\\n            https\\\\://creativecommons.org/licenses/by-nc/4.0/ OR\\n            https\\\\://creativecommons.org/licenses/by-nc-nd/4.0/ OR\\n            https\\\\://creativecommons.org/licenses/by-nc-sa/4.0/ OR\\n            http\\\\://creativecommons.org/publicdomain/zero/1.0/ OR\\n            http\\\\://creativecommons.org/publicdomain/mark/1.0/)", "full_size_download_allowed_sim:(1)\\n            OR (\\n              (\\n                visibility_ssi:(open admin osu_user registered osu restricted private restricted private)\\n                OR read_access_group_ssim:(public admin osu_user)\\n                OR read_access_person_ssim:([email protected])\\n              )\\n              AND *:* -primarySet_ssim:(uo-scua uo-jsma)\\n              AND *:* -full_size_download_allowed_sim:(0)\\n            )"],
  "facet.pivot"=>[],
  "fq"=>["", "{!terms f=has_model_ssim}Generic,Image,Video,Document,Audio,Collection", "-suppressed_bsi:true", "!(_query_:\"{!raw f=collection_type_gid_ssim}gid://od2/Hyrax::CollectionType/1\" OR _query_:\"{!raw f=collection_type_gid_ssim}gid://od2/Hyrax::CollectionType/4\")"],
  "hl.fl"=>["title_tesim", "description_tesim", "all_text_tsimv", "hocr_text_tsimv"],
  "rows"=>20,
  "qf"=>"first_line_tesim first_line_chorus_tesim instrumentation_tesim table_of_contents_tesim contained_in_journal_tesim alternative_tesim tribal_title_tesim creator_display_tesim description_tesim abstract_tesim biographical_information_tesim coverage_tesim designer_inscription_tesim former_owner_tesim inscription_tesim military_highest_rank_tesim military_occupation_tesim military_service_location_tesim motif_tesim tribal_notes_tesim award_tesim event_tesim keyword_tesim legal_name_tesim sports_team_tesim tribal_classes_tesim tribal_terms_tesim accepted_name_usage_tesim original_name_usage_tesim scientific_name_authorship_tesim street_address_tesim date_tesim acquisition_date_tesim award_date_tesim collected_date_tesim date_created_tesim issued_tesim view_date_tesim accession_number_tesim barcode_tesim identifier_tesim item_locator_tesim copyright_claimant_tesim rights_holder_tesim rights_note_tesim copy_location_tesim location_copyshelf_location_tesim box_number_tesim current_repository_id_tesim folder_name_tesim folder_number_tesim local_collection_id_tesim provenance_tesim series_name_tesim series_number_tesim source_tesim has_finding_aid_tesim has_version_tesim is_part_of_tesim relation_tesim material_tesim technique_tesim exhibit_tesim bulkrax_identifier_tesim owner_label_tesim creator_label_tesim photographer_label_tesim arranger_label_tesim artist_label_tesim author_label_tesim cartographer_label_tesim collector_label_tesim composer_label_tesim contributor_label_tesim dedicatee_label_tesim designer_label_tesim donor_label_tesim editor_label_tesim illustrator_label_tesim interviewee_label_tesim interviewer_label_tesim landscape_architect_label_tesim lyricist_label_tesim patron_label_tesim print_maker_label_tesim recipient_label_tesim transcriber_label_tesim translator_label_tesim form_of_work_label_tesim subject_label_tesim cultural_context_label_tesim ethnographic_term_label_tesim military_branch_label_tesim style_or_period_label_tesim phylum_or_division_label_tesim taxon_class_label_tesim order_label_tesim family_label_tesim genus_label_tesim species_label_tesim common_name_label_tesim location_label_tesim ranger_district_label_tesim tgn_label_tesim water_basin_label_tesim access_restrictions_label_tesim repository_label_tesim local_collection_name_label_tesim publisher_label_tesim place_of_production_label_tesim publication_place_label_tesim workType_label_tesim institution_label_tesim license_label_tesim resource_type_label_tesim language_label_tesim non_user_collections_tesim title_tesim license_label_tesim file_format_sim all_text_tsimv hocr_text_tsimv",
  "pf"=>"title_tesim",
  "q"=>"water",
  "facet"=>true,
  "f.open_access.facet.limit"=>6,
  "f.copyright_combined_label_sim.facet.limit"=>6,
  "f.file_format_sim.facet.limit"=>6,
  "f.resource_type_label_sim.facet.limit"=>6,
  "f.topic_combined_label_sim.facet.limit"=>6,
  "f.scientific_combined_label_sim.facet.limit"=>6,
  "f.creator_combined_label_sim.facet.limit"=>6,
  "f.date_combined_year_label_ssim.facet.limit"=>6,
  "f.date_combined_decade_label_ssim.facet.limit"=>6,
  "f.location_combined_label_sim.facet.limit"=>6,
  "f.workType_label_sim.facet.limit"=>6,
  "f.language_label_sim.facet.limit"=>6,
  "f.non_user_collections_ssim.facet.limit"=>6,
  "f.local_collection_name_label_sim.facet.limit"=>6,
  "f.institution_label_sim.facet.limit"=>6,
  "f.cultural_context_label_sim.facet.limit"=>6,
  "f.former_owner_sim.facet.limit"=>6,
  "f.mode_of_issuance_sim.facet.limit"=>6,
  "f.box_number_sim.facet.limit"=>6,
  "f.folder_name_sim.facet.limit"=>6,
  "f.folder_number_sim.facet.limit"=>6,
  "f.has_number_sim.facet.limit"=>6,
  "f.is_volume_sim.facet.limit"=>6,
  "f.series_name_sim.facet.limit"=>6,
  "f.series_number_sim.facet.limit"=>6,
  "f.exhibit_sim.facet.limit"=>6,
  "hl"=>true,
  "sort"=>"score desc, system_create_dtsi desc",
  "stats"=>"true",
  "stats.field"=>["date_combined_year_label_ssim"]
  }

The whole search took 2.3sec, of which 1.3 sec was this query.

I'm not sure how each param can be tuned but here are some I know:
qf - Since we're currently not using any boost fields. All of these can go: https://solr.apache.org/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqf_QueryFields_Parameter
f.*.facet.limit - Since we're only using 6 here, we can condense this down to facet.limit=>6 and that might help. We might need a condenser function or we can set a default and use individual params like this to override.

This COULD allow us to go back to GET solr queries. We currently use POST because the queries are too long. This could also be another point of slow down

I honestly don't know if this will speed anything up, so do a little testing maybe.

Expected behavior

Extraneous solr params are cleaned up

Related work

#2862 - This may be the correct solution

Accessibility Concerns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant