Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSTOrdPostingsFormat could enable faster Tagger #79

Open
dsmiley opened this issue May 1, 2018 · 1 comment
Open

FSTOrdPostingsFormat could enable faster Tagger #79

dsmiley opened this issue May 1, 2018 · 1 comment

Comments

@dsmiley
Copy link
Member

dsmiley commented May 1, 2018

The Lucene FSTOrdPostingsFormat (Solr schema postingsFormat="FSTOrd50") Is like FSTPostingsFormat but has "ordinals" -- term ordinals. Ordinals are not supported by most postings formats but this one has it. In TermPrefixCursor.java I left a comment that it could be more efficient we we could use ordinals. I think this might be true. Instead of eagerly reading & caching the postings (list of docIDs), we could just capture the ordinal (an int). This'd replace some of the "IntsRef" with this integer ordinal. TPC wouldn't need docIdsCache either. Later when we resolve it in getDocIds(), that's when we do the actual work which is perhaps not expensive. Sometimes we're never consulted to even do that, thus saving some time. The tag may have been eliminated due to overlapping, or it may have effectively been cached at a higher level (TaggerRequestHandler transforms to the uniqueKey values then caches that).

I'm not sure how much benefit this would bring; it could be net loss; hard to be sure.

Down side is we'd basically be limited to this PostingsFormat. At least the PostingsWriterBase aspect of this one is pluggable (kinda) should we want some future improvements to allow a total in-memory option. To ameliorate this down-side, we could support any PF via grabbing the "TermsState" instead, and presumably the termState of FSTOrdPostingsFormat is effectively the ordinal.

@dsmiley
Copy link
Member Author

dsmiley commented May 1, 2018

Upon further inspection of FSTOrdPostringsFormat (actually FSTOrdTermsReader), it has TODOs for ord() which is bizarre -- why does this postingsFormat even exist if it doesn't yet support ords?
I filed an issue: https://issues.apache.org/jira/browse/LUCENE-8285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant