Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr is outputting odd/binary-ish characters #206

Open
DonRichards opened this issue Jan 10, 2022 · 1 comment
Open

Solr is outputting odd/binary-ish characters #206

DonRichards opened this issue Jan 10, 2022 · 1 comment
Labels

Comments

@DonRichards
Copy link
Member

To use ss_field_alternative_title as an example. Once the data is put it outputs multiple values ss_ and sort_X3b_en_. You'll notice the odd text in sort_X3b_en_. Although this possibly needed for something (maybe, I don't know what) it was suggested from the Solr community to just simply set DocValues as stored to false.

What seemed to work for me is changing the line in the schema_extra_types.xml file

<!-- FROM -->
<fieldType name="collated_en" class="solr.ICUCollationField" locale="en" strength="primary" caseLevel="false"/>
<!-- TO -->
<fieldType name="collated_en" class="solr.ICUCollationField" locale="en" useDocValuesAsStored="false" strength="primary" caseLevel="false"/>

Sample output on single Islandora object

{
  "id":"q5isd3-default_solr_index-entity:node/15:en",
  "sm_context_tags":["search_api_X2f_index_X3a_default_solr_index",
    "search_api_solr_X2f_site_hash_X3a_q5isd3",
    "drupal_X2f_langcode_X3a_en"],
  "tm_X3b_en_field_description":["Close-up Photo of brown Labrador Retriever puppy with its tongue out."],
  "tm_X3b_en_title":["Puppy with Tongue Out"],
  "ss_field_alternative_title":"xqxxyk",
  "ss_author":"admin",
  "sm_field_extent":["1 item"],
  "sort_X3b_en_field_extent":"\u0014\u00049O1A\u0000",
  "sm_name":["Dogs"],
  "ds_changed":"2022-01-07T22:04:23Z",
  "bs_sticky":false,
  "ss_search_api_id":"entity:node/15:en",
  "sort_X3b_en_author":")/A9C\u0000",
  "sort_X3b_en_name_1":"9A)51\u0000",
  "its_uid":1,
  "sort_X3b_en_name":"/E5M\u0000",
  "index_id":"default_solr_index",
  "timestamp":"2022-01-10T15:25:51Z",
  "sm_node_grants":["node_access_all:0"],
  "sort_X3b_en_title":"GQGGY\u0004U9O7\u0004OEC5Q1\u0004EQO\u0000",
  "ss_type":"islandora_object",
  "sort_X3b_en_type":"9M?)C/EK)\u0005\nE+;1-O\u0000",
  "sort_X3b_en_field_alternative_title":"WIWWY=\u0000",
  "ss_name_1":"Image",
  "ds_changed_1":"2022-01-07T22:04:23Z",
  "site":"https://islandora.traefik.me/",
  "boost_document":1.0,
  "ds_created":"2019-06-04T17:17:50Z",
  "_version_":1721581805715324928,
  "sort_X3b_en_field_description":"-?EM1\u0005\u000eQG\u0004G7EOE\u0004E3\u0004+KEUC\u0004?)+K)/EK\u0004K1OK91S1K\u0004GQGGY\u0004U9O7\u00049OM\u0004OEC5Q1\u0004EQO\b\u0000",
  "sort_X3b_en_node_grants":"CE/1\u0005\n)--1MM\u0005\n)??\u0007,\u0012\u0000",
  "bs_status":true,
  "ss_search_api_datasource":"entity:node",
  "hash":"q5isd3",
  "ss_search_api_language":"en"
}

These binary fields from the Solr document they are just taking the raw binary collation key values and returning the binary/processed keys as UTF8 values (which they aren't). So I was thinking that just dropping the fields as "stored" to clean up the searches and according to the docs
When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*"). - SOLR
I can give you examples of what it looks like before and after if that helps.

Sample after fix is applied

{
  "id":"q5isd3-default_solr_index-entity:node/15:en",
  "sm_context_tags":["search_api_X2f_index_X3a_default_solr_index",
    "search_api_solr_X2f_site_hash_X3a_q5isd3",
    "drupal_X2f_langcode_X3a_en"],
  "tm_X3b_en_field_description":["Close-up Photo of brown Labrador Retriever puppy with its tongue out."],
  "tm_X3b_en_title":["Puppy with Tongue Out"],
  "sm_node_grants":["node_access_all:0"],
  "ss_type":"islandora_object",
  "ss_field_alternative_title":"xqxxyk",
  "ss_author":"admin",
  "sm_field_extent":["1 item"],
  "ss_name_1":"Image",
  "ds_changed_1":"2022-01-07T22:04:23Z",
  "sm_name":["Dogs"],
  "ds_changed":"2022-01-07T22:04:23Z",
  "bs_sticky":false,
  "ss_search_api_id":"entity:node/15:en",
  "site":"https://islandora.traefik.me/",
  "boost_document":1.0,
  "ds_created":"2019-06-04T17:17:50Z",
  "_version_":1721581805715324928,
  "bs_status":true,
  "its_uid":1,
  "ss_search_api_datasource":"entity:node",
  "index_id":"default_solr_index",
  "hash":"q5isd3",
  "timestamp":"2022-01-10T15:25:51Z",
  "ss_search_api_language":"en"
}

And if I'm reading the documentation correctly, this shouldn't impact anything that was depending on those value but should hide them from returning those values when "*" searches are carried out. Would love some feedback. I'd be happy to add this as a pull request if I knew it was a good decision and where to t he file should be modified.

@jasonhildebrand
Copy link
Contributor

I saw these fields recently when working on islandora and was confused by them. I can't imagine that anything in islandora would rely on the presence of these fields, which don't contain usable data. I don't see any downside, realy. In the unlikely case that there is an impact from setting useDocValuesAsStored="false", this change can be easily reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants