Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): lance write huggingface dataset directly #1882

Merged
merged 7 commits into from
Jan 30, 2024
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Jan 30, 2024

Be able to directly write a huggingface dataset

@eddyxu eddyxu self-assigned this Jan 30, 2024
Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@eddyxu eddyxu changed the title feat: support huggingface dataset feat(python): lance write huggingface dataset directly Jan 30, 2024
Comment on lines 1992 to 2002
try:
# Huggingface datasets
import datasets

if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()
except ImportError:
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this import lazy. Right now it will import huggingface anytime you call this function. It should only import if the object is from the huggingface datasets module.

For how we've done this for Polars and other modules, see: https://github.com/lancedb/lance/blob/main/python/python/lance/dependencies.py

Suggested change
try:
# Huggingface datasets
import datasets
if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()
except ImportError:
pass
if _check_for_huggingface(data_obj):
# Huggingface datasets
import datasets
if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1988,6 +1989,17 @@ def write_dataset(
a custom class that defines hooks to be called when each fragment is
starting to write and finishing writing.
"""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a generic safe_import utility function lance?

schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()
except ImportError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is a HF dataset but datasets is not installed, then the user will likely get a very cryptic error message no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use lazy loading


if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the datasets that have embeddings, are they usually list or fsl? do we need to check/cast or anything like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too smart at the lance level tho.

Comment on lines 4 to 6
The Hugging Face Hub has a great amount of pre-trained models and datasets available.

It is easy to convert a Huggingface dataset to Lance dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lance <3 HuggingFace

The HuggingFace Hub has become the go to place for ML practitioners to find pre-trained models and useful datasets. HuggingFace datasets can be written directly into Lance format by using the lance.write_dataset method. You can write the entire dataset or a particular split. For example:

@eddyxu eddyxu merged commit b3db3cc into main Jan 30, 2024
8 of 9 checks passed
@eddyxu eddyxu deleted the lei/write_hg branch January 30, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants