-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): lance write huggingface dataset directly #1882
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
python/python/lance/dataset.py
Outdated
try: | ||
# Huggingface datasets | ||
import datasets | ||
|
||
if isinstance(data_obj, datasets.Dataset): | ||
if schema is None: | ||
schema = data_obj.features.arrow_schema | ||
data_obj = data_obj.data.to_batches() | ||
except ImportError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this import lazy. Right now it will import huggingface anytime you call this function. It should only import if the object is from the huggingface datasets module.
For how we've done this for Polars and other modules, see: https://github.com/lancedb/lance/blob/main/python/python/lance/dependencies.py
try: | |
# Huggingface datasets | |
import datasets | |
if isinstance(data_obj, datasets.Dataset): | |
if schema is None: | |
schema = data_obj.features.arrow_schema | |
data_obj = data_obj.data.to_batches() | |
except ImportError: | |
pass | |
if _check_for_huggingface(data_obj): | |
# Huggingface datasets | |
import datasets | |
if isinstance(data_obj, datasets.Dataset): | |
if schema is None: | |
schema = data_obj.features.arrow_schema | |
data_obj = data_obj.data.to_batches() | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/python/lance/dataset.py
Outdated
@@ -1988,6 +1989,17 @@ def write_dataset( | |||
a custom class that defines hooks to be called when each fragment is | |||
starting to write and finishing writing. | |||
""" | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a generic safe_import
utility function lance?
python/python/lance/dataset.py
Outdated
schema = data_obj.features.arrow_schema | ||
data_obj = data_obj.data.to_batches() | ||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is a HF dataset but datasets
is not installed, then the user will likely get a very cryptic error message no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use lazy loading
|
||
if isinstance(data_obj, datasets.Dataset): | ||
if schema is None: | ||
schema = data_obj.features.arrow_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the datasets that have embeddings, are they usually list or fsl? do we need to check/cast or anything like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too smart at the lance level tho.
docs/integrations/huggingface.rst
Outdated
The Hugging Face Hub has a great amount of pre-trained models and datasets available. | ||
|
||
It is easy to convert a Huggingface dataset to Lance dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lance <3 HuggingFace
The HuggingFace Hub has become the go to place for ML practitioners to find pre-trained models and useful datasets. HuggingFace datasets can be written directly into Lance format by using the lance.write_dataset
method. You can write the entire dataset or a particular split. For example:
Be able to directly write a huggingface dataset