Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTS5 auxiliary and tokenization function support #473

Open
49 of 67 tasks
rogerbinns opened this issue Aug 17, 2023 · 0 comments
Open
49 of 67 tasks

FTS5 auxiliary and tokenization function support #473

rogerbinns opened this issue Aug 17, 2023 · 0 comments

Comments

@rogerbinns
Copy link
Owner

rogerbinns commented Aug 17, 2023

Make all this possible. Especially useful for ranking functions and synonyms.

By far the biggest difficulty is dealing with utf8 byte offsets in the tokenizer instead of codepoints.

  • Wrap calling existing tokenizers
  • Add an exception type for stale pointers that occur in other places in the code too
  • Implement own tokenizers
  • apsw.fts unicode tokenizer
  • apsw.fts.regex tokenizer
  • apsw.fts tokenizer that tries to work out utf8 offsets, see code in regex. Works providing text was not changed
  • apsw.fts stopwords tokenizer
  • apsw.fts synonyms tokenizer
  • apsw.fts test strings
  • apsw.fts facetted searching
  • fossil tokenizers? sqlite site search uses tokenize='html stoken unicode61 tokenchars _'
  • aux function to return which columns matched
  • aux function to score higher earlier in column match occurred
  • aux function to give different weighting to different columns
  • bm25 in python to show how to do it
  • apsw.fts query builder? Useful for query expansion etc
  • shell dot command .ftsq
  • JSON tokenizer
  • Emoji names synonyms (not doing because they can be multiple words)
  • Ngram tokenizer to use units of grapheme clusters not codepoints
  • Tokenizer filter that allows injecting tokens directly (in TOKENIZE_QUERY mode recognise inband signalling and irectly return, otherwise pass upstream. check in fts5table if token injection is supported)
  • Check .pyi file has tokenize constants
  • Wrap auxiliary function Fts5ExtensionApi
  • Implement own aux function
  • Rename various things in apsw.fts to be better
  • like shlex.split but using sqlite quoting rules; useful for shell and other contexts
  • key terms
  • IDF in C if Python too slow
  • grapheme aware highlight
  • sentence aware snippet
  • emoji
  • subsequence matching (eg any consecutive characters are matched in documents where there can be any number of characters between each input)
  • doc -m apsw.fts
  • consider -m apsw.ftstool for all the import tools
  • allow str or bytes everywhere utf8 is a parameter
  • Query expansion like in whoosh
  • "Did you mean?" replacements for mis-typed words like in whoosh
  • Autcomplete example using ngram and subsequence
  • Better ngram than builtin?
  • Performance profile html tokenizer - .ftsq search zeromalloc on sqlite doc takes one second to show results which is way too slow.. RESULT: snippet calls the tokenizer twice (once to score sentences, once to highlight). Most time is spent in the stdlib html parser parse_start/end_tag, goahead methods and all the regular expression stuff they do. Our code is less than 5% of execution time.
  • Update example code
  • Check tokendata works (embedded null in the token) and perhaps advise its use?
  • Type stubs need overload generation to fixup different returns based on parameters for at least tokenizer call
  • Implement Unicode TR-29 and TR-14 #509
  • Helper to figure out eg play station as two tokens could be playstation as one token or vice versa
  • Convenience wrapper around fts5vocab
  • Convenience wrapper around commands
  • Updates for extension api new functions https://www.sqlite.org/src/timeline?r=fts5-token-data
  • Anything else useful including equivalent examples from whoosh
  • Usage of stuff from https://www.nltk.org/
  • Figure out if, and how to handle dates. eg tokenizer in doc mode can colocate various levels of precision, while in query mode can turn yesterday or last year into tokens matching doc mode
  • Check for ::TODO::s
  • Change category mask in _unicodedb to use 64 bit so we get one bit per category
  • Changes doc about new out of scope exception
  • Possible to highlight ftsq matches in shell - using snippet with colour codes fails because they get quoted. Perhaps a private use char to mark begin and end of highlight that modes understand?
  • ShowResourceUsage should also get sqlite3_db_status fields and show changes
  • Tokens should be bytes not str - no they should be str
  • content table as view
  • Check xapian doc for features and examples
  • Check and update typing. Generator should be used on yielding functions
  • Consider adding codepoint names to apsw.unicode - need an effective "compression" mechanism
  • fts5-locale branch
  • GIL release in all the places?
  • Move ftstest.py into tests.py
  • Remove makefile ftscoverage rule
  • Remove "utf8" parameter from all encode and decode calls as it is the default
  • Work with all the CPython versions

Some inspiration:

@rogerbinns rogerbinns changed the title FTS5 auxiliary and tokenization function suport FTS5 auxiliary and tokenization function support Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant