Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: Nested document query #13

Closed
wants to merge 40 commits into from
Closed

Conversation

neilyio
Copy link

@neilyio neilyio commented Nov 26, 2024

No description provided.

neilyio and others added 7 commits November 13, 2024 10:55
Use Levenshtein distance to score documents in fuzzy term queries

Fix managed paths (#5)

add RegexPhraseQuery (quickwit-oss#2516)

* add RegexPhraseQuery

RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.

Renamed Union to BufferedUnionScorer
Added proptests to test different union types.

* cleanup

* use Box instead of Vec

* use RefCell instead of term_freq(&mut)

* remove wildcard mode

* move RefCell to outer

* clippy

clippy (quickwit-oss#2527)

* clippy

* clippy

* clippy

* clippy

* convert allow to expect and remove unused

* cargo fmt

* cleanup

* export sample

* clippy

chore: Fix merge conflict (#11)
@neilyio neilyio force-pushed the neil/nested-document-query branch from a6f8cab to 29bf48d Compare November 26, 2024 00:31
…se commit message:

feat: Add verbose debugging to BlockJoinQuery implementation
…hese modifications:

```
fix: Improve BlockJoinQuery scoring and matching logic
```

This commit message captures the essence of the changes:
- We fixed the scoring logic in the BlockJoinQuery
- We improved the document matching mechanism
- We addressed issues with scoring modes and document collection

Would you like me to run the tests to confirm the changes?
…nd scoring

The changes address several key issues in the BlockJoinScorer implementation:

1. Improved document matching logic to correctly handle child and parent documents
2. Fixed scoring calculation for different score modes
3. Corrected document seeking in explain method
4. Added proper handling of edge cases like empty child sets

These modifications should resolve the test failures by ensuring more accurate document matching and scoring in block join queries.
@philippemnoel
Copy link

Closing since pre-block

@philippemnoel philippemnoel deleted the neil/nested-document-query branch January 24, 2025 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants