Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assemble list of quirks with our full text search engine (esp. punctuation) #255

Open
joewiz opened this issue Oct 26, 2016 · 9 comments
Open

Comments

@joewiz
Copy link
Member

joewiz commented Oct 26, 2016

@HistoryAtState/editors: Please post any more examples you know of, and we'll work with @HistoryAtState/existsolutions to try out different Lucene analyzers and/or add advanced search form controls to see what combination can produce our expected results.

@joewiz joewiz modified the milestone: 1.2 - (New features) Feb 22, 2017
@joewiz
Copy link
Member Author

joewiz commented Jul 26, 2017

See HistoryAtState/frus@fc3f8b3 for an experiment with stopwords.

@joewiz
Copy link
Member Author

joewiz commented Aug 3, 2017

For more on stopwords, see:

@joewiz
Copy link
Member Author

joewiz commented Oct 2, 2018

Search scope, index configuration, and plumbing issues

  • administrative-timeline: needs to be added to $search:SECTIONS
  • frus-history/articles: need index on tei:div, needs to be added to $search:SECTIONS
  • frus-history/documents: needs index on tei:body, needs to be added to $search:SECTIONS
  • other-publications/serial-set: needs to be added to $search:SECTIONS
  • other-publications/vietnam-guide: needs to be added to $search:SECTIONS
  • wwdai: needs index on tei:div

Search omissions

  • carousel and frus-history/events: have never been included in search, could be useful
  • frus: volume titles not searched, only document titles
  • pocom: searched only on persName, not on position title, org name, or country
  • hsg-shell/pages: contains substantive text in some HTML
  • tags: not searched or leveraged

@joewiz
Copy link
Member Author

joewiz commented Oct 24, 2018

From Michael McCoyer:

Sending you another search “quirk” I just encountered (I hope when you encouraged me to send these along that you really meant it!) –

So, our common terminology in FRUS headers for the National Security Advisor is “President’s Assistant for National Security Affairs.” When I searched that syntax as a phrase in Kristin’s Human Rights volume (https://history.state.gov/historicaldocuments/frus1977-80v02), however, I received no hits. The phrase does, however, appear in a number of headers in the volume – e.g., Docs 4 and 16.

Is it possible that the search queries don’t search on headers, or that they search them differently? I will leave this puzzle in your capable hands…

The exact URL for his search that returned zero hits is:

https://history.state.gov/search?q=%22president%27s+assistant+for+national+security+affairs%22&volume-id=frus1977-80v02

However, I found that if I changed the apostrophe in "president's" from straight (') to curly (), the expected hits suddenly appeared:

https://history.state.gov/search?q=%22president%E2%80%99s+assistant+for+national+security+affairs%22&volume-id=frus1977-80v02&within=documents&sort-by=relevance

Thus, it appears that our search engine treats the curly quote as a literal character - like a letter in a word - rather than as punctuation that should be dropped. We need to get the search engine to treat curly quotes as straight quotes.

@joewiz
Copy link
Member Author

joewiz commented Mar 27, 2019

From @joshbotts via the mailbox:

Wanted to flag this mailbox inquiry as an instance where a case-sensitive search capability would come in handy. Searching for "goa" returns several hundred hits, but most of them are for abbreviations for "Government of ..." rather than the geographical entity in South Asia.

--> User story: I want to search for "Goa" and exclude hits that are upper case ("GOA")

@plutonik-a
Copy link
Contributor

This comment #255 (comment) has been already issued here -> #289

@plutonik-a
Copy link
Contributor

This issue has been spliced into different existing and new issues (including backlinks to this one):

Therefore closing this parent issue.

@marmoure marmoure reopened this Nov 29, 2022
@marmoure
Copy link
Contributor

@joewiz for searching for the term s/s we can escape the forward slash with a backward slash
s\/s and this will pass without the nasty lucene error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants