Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): lance write huggingface dataset directly #1882

Merged
merged 7 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/integrations/huggingface.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
HuggingFace
-----------

The Hugging Face Hub has a great amount of pre-trained models and datasets available.

It is easy to convert a Huggingface dataset to Lance dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lance <3 HuggingFace

The HuggingFace Hub has become the go to place for ML practitioners to find pre-trained models and useful datasets. HuggingFace datasets can be written directly into Lance format by using the lance.write_dataset method. You can write the entire dataset or a particular split. For example:


.. code-block:: python

# Huggingface datasets
import datasets
import lance

lance.write(datasets.load_dataset(
"poloclub/diffusiondb", split="train[:10]"
), "diffusiondb_train.lance")
1 change: 1 addition & 0 deletions docs/integrations/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ Integrations

.. toctree::

Huggingface <./huggingface>
Tensorflow <./tensorflow>
17 changes: 6 additions & 11 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,7 @@
name = "pylance"
dependencies = ["pyarrow>=12", "numpy>=1.22"]
description = "python wrapper for Lance columnar format"
authors = [
{ name = "Lance Devs", email = "[email protected]" },
]
authors = [{ name = "Lance Devs", email = "[email protected]" }]
license = { file = "LICENSE" }
repository = "https://github.com/eto-ai/lance"
readme = "README.md"
Expand Down Expand Up @@ -48,20 +46,17 @@ build-backend = "maturin"

[project.optional-dependencies]
tests = [
"pandas",
"pytest",
"datasets",
"duckdb",
"ml_dtypes",
"pandas",
"polars[pyarrow,pandas]",
"pytest",
"tensorflow",
"tqdm",
]
benchmarks = [
"pytest-benchmark",
]
torch = [
"torch",
]
benchmarks = ["pytest-benchmark"]
torch = ["torch"]

[tool.ruff]
select = ["F", "E", "W", "I", "G", "TCH", "PERF", "CPY001"]
Expand Down
12 changes: 12 additions & 0 deletions python/python/lance/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1960,6 +1960,7 @@ def write_dataset(
data_obj: Reader-like
The data to be written. Acceptable types are:
- Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader
- Huggingface dataset
uri: str or Path
Where to write the dataset to (directory)
schema: Schema, optional
Expand Down Expand Up @@ -1988,6 +1989,17 @@ def write_dataset(
a custom class that defines hooks to be called when each fragment is
starting to write and finishing writing.
"""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a generic safe_import utility function lance?

# Huggingface datasets
import datasets

if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the datasets that have embeddings, are they usually list or fsl? do we need to check/cast or anything like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too smart at the lance level tho.

data_obj = data_obj.data.to_batches()
except ImportError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is a HF dataset but datasets is not installed, then the user will likely get a very cryptic error message no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use lazy loading


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this import lazy. Right now it will import huggingface anytime you call this function. It should only import if the object is from the huggingface datasets module.

For how we've done this for Polars and other modules, see: https://github.com/lancedb/lance/blob/main/python/python/lance/dependencies.py

Suggested change
try:
# Huggingface datasets
import datasets
if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()
except ImportError:
pass
if _check_for_huggingface(data_obj):
# Huggingface datasets
import datasets
if isinstance(data_obj, datasets.Dataset):
if schema is None:
schema = data_obj.features.arrow_schema
data_obj = data_obj.data.to_batches()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

reader = _coerce_reader(data_obj, schema)
_validate_schema(reader.schema)
# TODO add support for passing in LanceDataset and LanceScanner here
Expand Down
Loading