Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: implement improved fulltext search #671

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

winged
Copy link
Contributor

@winged winged commented Oct 17, 2024

This makes the search endpoint resemble a different data type (not "files"
anymore!) Also, the search behaves now rather differently:

  • Search results are ordered by search rank
  • Search results get the search context (a fragment of the text)
    as part of the response
  • File and document are referenced as related fields (and includes are available)

This should provide a highly performant, useful search that does what it should[tm]

BREAKING CHANGE: This changes the structure and type of the search
endpoint's data.

Drive-By: chore: do not restart minio config container

When MC failed, this would restart the container forever, even when the Alexandria dev env was stopped

Drive-By: feat(filters): add "only_newest" filter for files

This helps selecting files when we're searching (or otherwise looking
for files) and only want the newest version

Drive-By: feat(cmdline): search utility

This runs the search as it were run through the search endpoint.
Note this is mainly to be used for performance testing. No auth
support and no visibility support exists currently. If you enable
visibility / auth, it will likely break or not return anything.

When MC failed, this would restart the container forever, even when the
Alexandria dev env was stopped
alexandria/core/filters.py Outdated Show resolved Hide resolved
alexandria/core/filters.py Outdated Show resolved Hide resolved
alexandria/core/tests/test_search.py Outdated Show resolved Hide resolved
@winged
Copy link
Contributor Author

winged commented Oct 23, 2024

Status update: I let some custom file/document factories run overnight, created over 400k files and then used the FTS in this branch to search.

Using the new commandline utility, I was able to run some tests - general query duration is between 0.05 and 0.2 seconds.

This makes the search endpoint resemble a different data type (not "files"
anymore!) Also, the search behaves now rather differently:

* Search results are ordered by search rank
* Search results get the search context (a fragment of the text)
  as part of the response
* File and document are referenced as related fields (and includes are available)

This should provide a highly performant, useful search that does what it should[tm]

BREAKING CHANGE: This changes the structure and type of the search
endpoint's data.
This helps selecting files when we're searching (or otherwise looking
for files) and only want the newest version
This runs the search as it were run through the search endpoint.
Note this is mainly to be used for performance testing. No auth
support and no visibility support exists currently. If you enable
visibility / auth, it will likely break or not return anything.
)
queryset = queryset.annotate(
search_rank=SearchRank(F("content_vector"), search_query),
search_context=SearchHeadline(F("content_text"), search_query),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the search matches with the file name but not content, then there will be no context. But that should be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants