Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep length normalization for fulltext fields after all. #2435

Merged
merged 1 commit into from
Nov 21, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 19 additions & 13 deletions solr/config/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -257,24 +257,30 @@
<field name="latest_date" type="date_sortmissinglast" stored="true" indexed="true" multiValued="false"/>
<field name="earliest_date" type="date_sortmissinglast" stored="true" indexed="true" multiValued="false"/>

<!-- Three full text fields (containing oral history transcripts, transcriptions, translations, OCR text, and the like).

<!-- create a field for fulltext (eg oral history transcripts), let's try with omitNorms false.
stored=true is necessary for highlighting.
stored=true is necessary for highlighting.

We've gone back and forth about whether to use length normalization for these fields.
(see omitNorms at https://solr.apache.org/guide/solr/latest/indexing-guide/fields.html).
The consensus is it doesn't make that big of a difference in practice, so we're going
with the standard for fulltext fields (omitNorms="false").
See https://github.com/sciencehistory/scihist_digicoll/issues/2013 for the discussion.

storeOffsetsWithPositions gets us faster highlighting for very large fields, in return for
somewhat larger index size.
storeOffsetsWithPositions gets us faster highlighting for very large fields, in return for
somewhat larger index size.
https://lucene.apache.org/solr/guide/8_0/highlighting.html#Highlighting-SchemaOptionsandPerformanceConsiderations

https://lucene.apache.org/solr/guide/8_0/highlighting.html#Highlighting-SchemaOptionsandPerformanceConsiderations

decided to keep omitNorms=true for now https://github.com/sciencehistory/scihist_digicoll/issues/2013
-->
<field name="searchable_fulltext_en" type="text_en" stored="true" indexed="true" multiValued="true" omitNorms="true" storeOffsetsWithPositions="true"/>
-->

<!-- Full text search for works entirely in English -->
<field name="searchable_fulltext_en" type="text_en" stored="true" indexed="true" multiValued="true" omitNorms="false" storeOffsetsWithPositions="true"/>

<!-- Full text search for works in German -->
<field name="searchable_fulltext_de" type="text_de" stored="true" indexed="true" multiValued="true" omitNorms="true" storeOffsetsWithPositions="true"/>
<!-- Full text search for works entirely in German -->
<field name="searchable_fulltext_de" type="text_de" stored="true" indexed="true" multiValued="true" omitNorms="false" storeOffsetsWithPositions="true"/>

<!-- Full text search for works in neither English nor German -->
<field name="searchable_fulltext_language_agnostic" type="text" stored="true" indexed="true" multiValued="true" omitNorms="true" storeOffsetsWithPositions="true"/>
<!-- Full text search for works that are neither entirely in English, nor entirely in German -->
<field name="searchable_fulltext_language_agnostic" type="text" stored="true" indexed="true" multiValued="true" omitNorms="false" storeOffsetsWithPositions="true"/>

<!-- added by Science History Institute, a dynamic field that's good for string facets, using docValues fields -->
<dynamicField name="*_facet" type="string_dv" multiValued="true"/>
Expand Down