-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Indexing Support for Lucene Byte Sized Vector #937
Add Indexing Support for Lucene Byte Sized Vector #937
Conversation
Signed-off-by: Naveen Tatikonda <[email protected]>
Signed-off-by: Naveen Tatikonda <[email protected]>
Codecov Report
@@ Coverage Diff @@
## feature/lucene_byte_vector #937 +/- ##
================================================================
- Coverage 85.32% 85.00% -0.32%
- Complexity 1117 1129 +12
================================================================
Files 152 154 +2
Lines 4519 4616 +97
Branches 405 412 +7
================================================================
+ Hits 3856 3924 +68
- Misses 480 502 +22
- Partials 183 190 +7
|
The test failures are not related to this. The CI is failing on Windows when it is trying to refresh the env after setting mingw path. Created an issue to track it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test for doing exact search with byte vector using K-NN painless. Since we are changing the doc values I want to make sure that all cases are covered.
A high level comment, |
This is where the serialization process is done for docValues. |
Signed-off-by: Naveen Tatikonda <[email protected]>
ba10115
to
7b6a4e1
Compare
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
return Optional.empty(); | ||
} | ||
validateVectorDimension(dimension, vector.size()); | ||
byte[] array = new byte[vector.size()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use byte[]
from the start instead of converting from ArrayList to [] later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, we don't know what will be the size of the vector provided by user. So, we are using an ArrayList. After parsing everything then we are validating it's length against the dimension value.
|
||
vector.add(value); | ||
validateByteVectorValues(value); | ||
vector.add((byte) value); | ||
token = context.parser().nextToken(); | ||
} | ||
} else if (token == XContentParser.Token.VALUE_NUMBER) { | ||
value = context.parser().floatValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not read intValue instead of floatValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bytes user is expected not to provide decimal values. If we try to parse it as intValue then if user provides any decimal values then it will be trimmed and we can't validate that scenario.
src/main/java/org/opensearch/knn/index/mapper/LuceneFieldMapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the first glance I think we need BWC for this, say today index has fields and data with float32 then we need to make sure we still can read that data
src/main/java/org/opensearch/knn/index/mapper/LuceneFieldMapper.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperTests.java
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperTests.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// Create an index with byte vector data_type and add a doc with decimal values which should throw exception | ||
public void testInvalidVectorData() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test case should be in unit test for mapper as well, maybe with some more edge cases
96140ee
to
c931a77
Compare
Sure, will raise a separate PR for BWC Tests and any other missing tests |
src/main/java/org/opensearch/knn/index/mapper/LuceneFieldMapper.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperTests.java
Outdated
Show resolved
Hide resolved
56f3930
to
ffd2ffa
Compare
src/main/java/org/opensearch/knn/index/KNNVectorFieldMapperUtil.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/mapper/LuceneFieldMapper.java
Outdated
Show resolved
Hide resolved
ffd2ffa
to
d30019f
Compare
Signed-off-by: Naveen Tatikonda <[email protected]>
d30019f
to
5c73aa2
Compare
1386519
into
opensearch-project:feature/lucene_byte_vector
…#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]>
* Add Indexing Support for Lucene Byte Sized Vector (#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add Querying Support to Lucene Byte Sized Vector (#956) * Add Querying Support to Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add DocValues Support for Lucene Byte Sized Vector (#953) Signed-off-by: Naveen Tatikonda <[email protected]> * Update Release Notes Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]>
* Add Indexing Support for Lucene Byte Sized Vector (#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add Querying Support to Lucene Byte Sized Vector (#956) * Add Querying Support to Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add DocValues Support for Lucene Byte Sized Vector (#953) Signed-off-by: Naveen Tatikonda <[email protected]> * Update Release Notes Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> (cherry picked from commit bf04854)
* Add Indexing Support for Lucene Byte Sized Vector (#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add Querying Support to Lucene Byte Sized Vector (#956) * Add Querying Support to Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add DocValues Support for Lucene Byte Sized Vector (#953) Signed-off-by: Naveen Tatikonda <[email protected]> * Update Release Notes Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> (cherry picked from commit bf04854)
* Add Indexing Support for Lucene Byte Sized Vector (#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add Querying Support to Lucene Byte Sized Vector (#956) * Add Querying Support to Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add DocValues Support for Lucene Byte Sized Vector (#953) Signed-off-by: Naveen Tatikonda <[email protected]> * Update Release Notes Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> (cherry picked from commit bf04854) Signed-off-by: Naveen Tatikonda <[email protected]>
* Add Indexing Support for Lucene Byte Sized Vector (#937) * Add Indexing Support for Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add tests for Indexing Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add Querying Support to Lucene Byte Sized Vector (#956) * Add Querying Support to Lucene Byte Sized Vector Signed-off-by: Naveen Tatikonda <[email protected]> * Add CHANGELOG Signed-off-by: Naveen Tatikonda <[email protected]> * Address Review Comments Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> * Add DocValues Support for Lucene Byte Sized Vector (#953) Signed-off-by: Naveen Tatikonda <[email protected]> * Update Release Notes Signed-off-by: Naveen Tatikonda <[email protected]> --------- Signed-off-by: Naveen Tatikonda <[email protected]> (cherry picked from commit bf04854) Signed-off-by: Naveen Tatikonda <[email protected]>
Description
This PR contains changes which adds indexing support to lucene byte sized vector and corresponding tests to validate it. It helps users to index vectors as byte sized vectors(which theoretically saves 75% of memory when compared to float vectors) by setting the optional
data_type
field asbyte
while creating the index. As we are not adding support for Quantization Techniques, users are expected to index vectors that are in the byte range [-128 to 127] without any decimal values. Also, right now this feature is only supported for lucene engine.Issues Resolved
#812
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.