Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MetadataFilter in LanceDB #461

Merged
merged 26 commits into from
Aug 8, 2024
Merged

Use MetadataFilter in LanceDB #461

merged 26 commits into from
Aug 8, 2024

Conversation

blakerosenthal
Copy link
Contributor

@blakerosenthal blakerosenthal commented Jul 25, 2024

Add metadata filter to lancedb storage

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Base automatically changed from metadata-translate to corpus-dev July 26, 2024 09:58
An error occurred while trying to automatically change base from metadata-translate to corpus-dev July 26, 2024 09:58
Copy link
Member

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blakerosenthal Tests and lint / format CI is failing. I'll review the test when everything is green otherwise.

ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
ragna/source_storages/_lancedb.py Outdated Show resolved Hide resolved
test_metadata.ipynb Outdated Show resolved Hide resolved
@blakerosenthal
Copy link
Contributor Author

@pmeier Thanks for all the comments. Ran into a couple quirks with LanceDB's add_columns method, but otherwise all your suggestions worked out. Do you know what's happening with mypy in this commit? It's complaining about missing types in LanceDB but I thought that the overrides in pyproject.toml would take care of that.

@pmeier pmeier changed the title Metadata filter Use MetadataFilter in LanceDB Aug 8, 2024
@pmeier
Copy link
Member

pmeier commented Aug 8, 2024

Do you know what's happening with mypy

Yes. If you have code like this

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from ragna import Rag

def foo() -> Rag:
    from ragna import Rag

    return Rag()

it looks like this should work. Running mypy on this succeeds. But running this script fails with

Traceback (most recent call last):
  File "/home/philip/git/ragna/main.py", line 7, in <module>
    def foo() -> Rag:
NameError: name 'Rag' is not defined

The reason is that Python evaluates the annotations at definition. For example

from ragna import Rag

def foo() -> Rag:
    return Rag()

print(foo.__annotations__)
{'return': <class 'ragna.core.Rag'>}

And since at definition in the top script, the Rag object is not available, you face the error.

That being said, there is an easy solution for it: turn the annotation into a string.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from ragna import Rag

def foo() -> "Rag":
    from ragna import Rag

    return Rag()

print(foo.__annotations__)
{'return': 'Rag'}

Now mypy and Python are happy. Internally mypy resolves the string back to its actual type with typing.get_type_hints() and so does Ragna as well:

method = getattr(cls, method_name)
params = iter(inspect.signature(method).parameters.values())
annotations = get_type_hints(method)

Finally, because it is a little tedious to always stringify the annotations, there is a shortcut. If you from __future__ import annotations at the top of your module, the stringification is applied automatically in the background:

from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from ragna import Rag

def foo() -> Rag:
    from ragna import Rag

    return Rag()

print(foo.__annotations__)

And this is what I did in 4a3b32e.

@pmeier pmeier marked this pull request as ready for review August 8, 2024 12:34
@pmeier
Copy link
Member

pmeier commented Aug 8, 2024

@blakerosenthal I've added the following changes in be943a7:

  • Remove the need to keep a table schema as instance variable. This was only needed because we unconditionally used db.create_table() and this only succeeded if the schema of the existing table matches the one passed exactly. Instead we now check whether the table already exists and if yes just open the table. Thus, table.schema is now the only source of truth.
  • Although it is not possible to add a column of a given type, the string we pass to add_column can contain SQL syntax. Thus, instead of using "''" as value, we use "CAST(NULL {type})" where corresponds to metadata. Thus, we now have the right column type, which is important for later filtering, e.g. > 5 makes no sense on a string.
  • Driveby: instead of calling table.add in a nested for loop, we know collect all new rows first and insert them in one call.
  • Re Use MetadataFilter in LanceDB #461 (comment): my guess into why Chroma and LanceDB had different results is because you defaulted missing metadata in LanceDB to an empty string, while Chroma used None. And None or better NULL is handled differently in databases. For example, checking for a column not equal to some value, will not include the NULL values. To translate in Python, item != value in SQL means roughly item != value and item is not None. With the fix I detailed above, we can now properly NULL missing metadata and thus Chroma and LanceDB behave the same. I also fixed the tests to account for that.

I'm merging this for velocity, but please do take a look if anything is unclear or a follow-up PR is needed.

@pmeier pmeier merged commit 13b1c0f into corpus-dev Aug 8, 2024
20 of 21 checks passed
@pmeier pmeier deleted the metadata-filter branch August 8, 2024 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants