Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

milvus: add array data type for collection create #23219

Merged
merged 41 commits into from
Aug 28, 2024

Conversation

rgupta2508
Copy link
Contributor

@rgupta2508 rgupta2508 commented Jun 20, 2024

Add array data type for milvus vector store collection create

Thank you for contributing to LangChain!

  • PR title: "package: description"

    • Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes.
    • Example: "community: add foobar LLM"
  • PR message: Delete this entire checklist and replace with

    • Description: a description of the change
    • Issue: the issue # it fixes, if applicable
    • Dependencies: any dependencies required for this change
    • Twitter handle: if your PR gets announced, and you'd like a mention, we'll gladly shout you out!
  • Add tests and docs: If you're adding a new integration, please include

    1. a test for the integration, preferably unit tests that do not rely on network access,
    2. an example notebook showing its use. It lives in docs/docs/integrations directory.
  • Lint and test: Run make format, make lint and make test from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:

  • Make sure optional dependencies are imported within a function.
  • Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests.
  • Most PRs should not touch more than one package.
  • Changes should be backwards compatible.
  • If you are adding something to community, do not re-import it in langchain.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

@efriis efriis added the partner label Jun 20, 2024
@efriis efriis self-assigned this Jun 20, 2024
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 20, 2024
Copy link

vercel bot commented Jun 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 28, 2024 4:53pm

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: milvus Primarily related to Milvus vector store integration 🤖:improvement Medium size change to existing code to handle new use-cases labels Jun 20, 2024
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 20, 2024
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Jun 20, 2024
@rgupta2508
Copy link
Contributor Author

Fixing build issue also, numpy-1.24.* not supported in python 3.12 (https://pypi.org/project/numpy/1.24.0/) so build is failing.
finally green build after restricting version to 3.11

Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will send to milvus team but not sure we want this. some issues:

we should try to maintain python 3.12 support if possible. I believe you can do this by adding a numpy dep of

numpy = [
    {version = "^1.24.0", python = "<3.12"},
    {version = "^1.26.0", python = ">=3.12"}
]

and on the array type - it looks like it hard-codes it as a list of varchar, which might not always be the case.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jul 2, 2024
@RafaelXokito
Copy link
Contributor

RafaelXokito commented Jul 17, 2024

Hi @rgupta2508 did you considered using the CollectionSchema (native milvus schema structure) instead of using a Dict?
For instance you could receive a CollectionSchema which would be easier to maintain for langchain and milvus.

Give me your feedback on this.

@rgupta2508 rgupta2508 marked this pull request as ready for review August 9, 2024 05:28
@rgupta2508
Copy link
Contributor Author

Resolved conflict. can you @zc277584121 please review this . if we can do anything on this

@codingjaguar
Copy link

@zc277584121 ping

fields.append(
FieldSchema(key, DataType.VARCHAR, max_length=65_535)
)
elif dtype == DataType.ARRAY:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to still rely on the solution of the pymilvus problem. It seems that pymilvus needs to support return DataType.ARRAY and other informations, before this line of code can take effect milvus-io/pymilvus#2144

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, could you please add the corresponding unit test as the guarantee to the quality of other possible future PRs

@zc277584121
Copy link
Contributor

zc277584121 commented Aug 26, 2024

@rgupta2508 Thank you for you contribution. However, this feature still depends on the resolution of the pymilvus issue, which may not be a high priority for them. @efriis If you have to close this PR because of refactor of pydanit_v2, just close it or draft it.
In the future we will refactor langchain_milvus to use milvus_client SDK instead of ORM SDK, which will also solve this problem.
BTW, @rgupta2508 , we are curious about the scenario of about using ARRAY dtype with LangChain. Can you explain the specific business scenario of this background , or any other useful information will be appreciated.

@efriis
Copy link
Member

efriis commented Aug 26, 2024

closing! Let's move discussion of the above to an issue to get a plan in place before further coding.

@zc277584121 if pymilvus isn't managed by you could you tag in one of your teammates who does?

@efriis efriis closed this Aug 26, 2024
@rgupta2508
Copy link
Contributor Author

@rgupta2508 Thank you for you contribution. However, this feature still depends on the resolution of the pymilvus issue, which may not be a high priority for them. @efriis If you have to close this PR because of refactor of pydanit_v2, just close it or draft it. In the future we will refactor langchain_milvus to use milvus_client SDK instead of ORM SDK, which will also solve this problem. BTW, @rgupta2508 , we are curious about the scenario of about using ARRAY dtype with LangChain. Can you explain the specific business scenario of this background , or any other useful information will be appreciated.

Hi @zc277584121 , there are lot if usecase where we are using array datatype in milvus.

We as, WALMART team trying to embed data of some question answer which is tagged with multiple entities , just to filter using those entity we need array data type. there are some other use cases where we need filter of data.

for example stackoverflow question answer can be tagged with "java", "microservices" ,"api", "rest" or some other tag. just to filter embedding data using one tag, this array data type is very useful .

For now we have developed custom implementation for array data type using langchain. but cant able to upgrade langchian-milvus untill this feature is available . we have some more idea and use case in langchain_milvus but cant add untill this is merged .

@zc277584121
Copy link
Contributor

@rgupta2508 Thanks for your feedback
@efriis sorry, could you please reopen this PR again? Considering the usecase scenario , we intend to re-evaluate whether this PR can be merged as a quick workaround. As a high priority, we will quickly evaluate and give a conclusion within a few days.

@zc277584121
Copy link
Contributor

@rgupta2508 Could you please give me the permission of your repo, so that I can append a commit with some code to your branch array_data_type. After that, this PR can be updated

@rgupta2508
Copy link
Contributor Author

@rgupta2508 Could you please give me the permission of your repo, so that I can append a commit with some code to your branch array_data_type. After that, this PR can be updated

Done

@zc277584121
Copy link
Contributor

@rgupta2508 thanks, I have appended a commit on the top of this branch, including some code refinement and unit test.
rgupta2508@803d2f4
you can also review this code and judge whether it can satisfy your demand

@efriis efriis reopened this Aug 27, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Aug 28, 2024
@zc277584121
Copy link
Contributor

@efriis LGTM, but it is currently just a workaround to support the scenario. Now it supports array type if the user customizes the information passed into the array field.
There is a comment left in the code, suggesting that it may need further refinement in the future, which relies on the improvement of the array_data_type function in pymilvus. (related issue milvus-io/pymilvus#2165, CC @czs007 @longjiquan @xiaocai2333)

@rgupta2508
Copy link
Contributor Author

@rgupta2508 thanks, I have appended a commit on the top of this branch, including some code refinement and unit test. rgupta2508@803d2f4 you can also review this code and judge whether it can satisfy your demand

@zc277584121 Thanks for consideration , Changes are looks good to me.

@efriis efriis changed the title add array data type for milvus vector store collection create milvus: add array data type for collection create Aug 28, 2024
@efriis efriis enabled auto-merge (squash) August 28, 2024 16:53
@efriis efriis merged commit aff50a1 into langchain-ai:master Aug 28, 2024
16 checks passed
@zc277584121
Copy link
Contributor

Found and fix a small issue: #25836

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases 🔌: milvus Primarily related to Milvus vector store integration partner size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants