Skip to content

Navigation Menu

Explore
By size
By industry
By use case
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

rogerbinns / apsw Public

Notifications You must be signed in to change notification settings
Fork 97
Star 733

Code
Issues 14
Pull requests 1
Discussions
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

FTS5 auxiliary and tokenization function support #473

Open

49 of 67 tasks

rogerbinns opened this issue Aug 17, 2023 · 0 comments

Open

49 of 67 tasks

FTS5 auxiliary and tokenization function support #473

rogerbinns opened this issue Aug 17, 2023 · 0 comments

Comments

Copy link

Owner

rogerbinns commented Aug 17, 2023 •

edited

Loading

Make all this possible. Especially useful for ranking functions and synonyms.

By far the biggest difficulty is dealing with utf8 byte offsets in the tokenizer instead of codepoints.

Wrap calling existing tokenizers
Add an exception type for stale pointers that occur in other places in the code too
Implement own tokenizers
apsw.fts unicode tokenizer
apsw.fts.regex tokenizer
apsw.fts tokenizer that tries to work out utf8 offsets, see code in regex. Works providing text was not changed
apsw.fts stopwords tokenizer
apsw.fts synonyms tokenizer
apsw.fts test strings
apsw.fts facetted searching
fossil tokenizers? sqlite site search uses tokenize='html stoken unicode61 tokenchars _'
aux function to return which columns matched
aux function to score higher earlier in column match occurred
aux function to give different weighting to different columns
bm25 in python to show how to do it
apsw.fts query builder? Useful for query expansion etc
shell dot command .ftsq
JSON tokenizer
Emoji names synonyms (not doing because they can be multiple words)
Ngram tokenizer to use units of grapheme clusters not codepoints
Tokenizer filter that allows injecting tokens directly (in TOKENIZE_QUERY mode recognise inband signalling and irectly return, otherwise pass upstream. check in fts5table if token injection is supported)
Check .pyi file has tokenize constants
Wrap auxiliary function Fts5ExtensionApi
Implement own aux function
Rename various things in apsw.fts to be better
like shlex.split but using sqlite quoting rules; useful for shell and other contexts
key terms
IDF in C if Python too slow
grapheme aware highlight
sentence aware snippet
emoji
subsequence matching (eg any consecutive characters are matched in documents where there can be any number of characters between each input)
doc -m apsw.fts
consider -m apsw.ftstool for all the import tools
allow str or bytes everywhere utf8 is a parameter
Query expansion like in whoosh
"Did you mean?" replacements for mis-typed words like in whoosh
Autcomplete example using ngram and subsequence
Better ngram than builtin?
Performance profile html tokenizer - .ftsq search zeromalloc on sqlite doc takes one second to show results which is way too slow.. RESULT: snippet calls the tokenizer twice (once to score sentences, once to highlight). Most time is spent in the stdlib html parser parse_start/end_tag, goahead methods and all the regular expression stuff they do. Our code is less than 5% of execution time.
Update example code
Check tokendata works (embedded null in the token) and perhaps advise its use?
Type stubs need overload generation to fixup different returns based on parameters for at least tokenizer call
Implement Unicode TR-29 and TR-14 #509
Helper to figure out eg play station as two tokens could be playstation as one token or vice versa
Convenience wrapper around fts5vocab
Convenience wrapper around commands
Updates for extension api new functions https://www.sqlite.org/src/timeline?r=fts5-token-data
Anything else useful including equivalent examples from whoosh
Usage of stuff from https://www.nltk.org/
Figure out if, and how to handle dates. eg tokenizer in doc mode can colocate various levels of precision, while in query mode can turn yesterday or last year into tokens matching doc mode
Check for ::TODO::s
Change category mask in _unicodedb to use 64 bit so we get one bit per category
Changes doc about new out of scope exception
Possible to highlight ftsq matches in shell - using snippet with colour codes fails because they get quoted. Perhaps a private use char to mark begin and end of highlight that modes understand?
ShowResourceUsage should also get sqlite3_db_status fields and show changes
Tokens should be bytes not str - no they should be str
content table as view
Check xapian doc for features and examples
Check and update typing. Generator should be used on yielding functions
Consider adding codepoint names to apsw.unicode - need an effective "compression" mechanism
fts5-locale branch
GIL release in all the places?
Move ftstest.py into tests.py
Remove makefile ftscoverage rule
Remove "utf8" parameter from all encode and decode calls as it is the default
Work with all the CPython versions

Some inspiration:

(examples) https://neuml.hashnode.dev/building-an-efficient-sparse-keyword-index-in-python
(tokenizer extension) https://github.com/kovidgoyal/calibre/blob/master/src/calibre/db/sqlite_extension.cpp
(different one) https://github.com/hideaki-t/sqlite-fts-python/
(test strings) https://www.unicode.org/Public/15.0.0/ucd/auxiliary/GraphemeBreakTest.txt
(deep dive) https://hsivonen.fi/string-length/

The text was updated successfully, but these errors were encountered:

All reactions

rogerbinns changed the title ~~FTS5 auxiliary and tokenization function suport~~ FTS5 auxiliary and tokenization function support

rogerbinns mentioned this issue

Implement Unicode TR-29 and TR-14 #509

Closed

26 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

1 participant

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.