-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): lance write huggingface dataset directly #1882
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
HuggingFace | ||
----------- | ||
|
||
The Hugging Face Hub has a great amount of pre-trained models and datasets available. | ||
|
||
It is easy to convert a Huggingface dataset to Lance dataset: | ||
|
||
.. code-block:: python | ||
|
||
# Huggingface datasets | ||
import datasets | ||
import lance | ||
|
||
lance.write(datasets.load_dataset( | ||
"poloclub/diffusiondb", split="train[:10]" | ||
), "diffusiondb_train.lance") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,4 +3,5 @@ Integrations | |
|
||
.. toctree:: | ||
|
||
Huggingface <./huggingface> | ||
Tensorflow <./tensorflow> |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,9 +2,7 @@ | |
name = "pylance" | ||
dependencies = ["pyarrow>=12", "numpy>=1.22"] | ||
description = "python wrapper for Lance columnar format" | ||
authors = [ | ||
{ name = "Lance Devs", email = "[email protected]" }, | ||
] | ||
authors = [{ name = "Lance Devs", email = "[email protected]" }] | ||
license = { file = "LICENSE" } | ||
repository = "https://github.com/eto-ai/lance" | ||
readme = "README.md" | ||
|
@@ -48,20 +46,17 @@ build-backend = "maturin" | |
|
||
[project.optional-dependencies] | ||
tests = [ | ||
"pandas", | ||
"pytest", | ||
"datasets", | ||
"duckdb", | ||
"ml_dtypes", | ||
"pandas", | ||
"polars[pyarrow,pandas]", | ||
"pytest", | ||
"tensorflow", | ||
"tqdm", | ||
] | ||
benchmarks = [ | ||
"pytest-benchmark", | ||
] | ||
torch = [ | ||
"torch", | ||
] | ||
benchmarks = ["pytest-benchmark"] | ||
torch = ["torch"] | ||
|
||
[tool.ruff] | ||
select = ["F", "E", "W", "I", "G", "TCH", "PERF", "CPY001"] | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1960,6 +1960,7 @@ def write_dataset( | |||||||||||||||||||||||||||||||||||||||
data_obj: Reader-like | ||||||||||||||||||||||||||||||||||||||||
The data to be written. Acceptable types are: | ||||||||||||||||||||||||||||||||||||||||
- Pandas DataFrame, Pyarrow Table, Dataset, Scanner, or RecordBatchReader | ||||||||||||||||||||||||||||||||||||||||
- Huggingface dataset | ||||||||||||||||||||||||||||||||||||||||
uri: str or Path | ||||||||||||||||||||||||||||||||||||||||
Where to write the dataset to (directory) | ||||||||||||||||||||||||||||||||||||||||
schema: Schema, optional | ||||||||||||||||||||||||||||||||||||||||
|
@@ -1988,6 +1989,17 @@ def write_dataset( | |||||||||||||||||||||||||||||||||||||||
a custom class that defines hooks to be called when each fragment is | ||||||||||||||||||||||||||||||||||||||||
starting to write and finishing writing. | ||||||||||||||||||||||||||||||||||||||||
""" | ||||||||||||||||||||||||||||||||||||||||
try: | ||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a generic |
||||||||||||||||||||||||||||||||||||||||
# Huggingface datasets | ||||||||||||||||||||||||||||||||||||||||
import datasets | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
if isinstance(data_obj, datasets.Dataset): | ||||||||||||||||||||||||||||||||||||||||
if schema is None: | ||||||||||||||||||||||||||||||||||||||||
schema = data_obj.features.arrow_schema | ||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for the datasets that have embeddings, are they usually list or fsl? do we need to check/cast or anything like that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is too smart at the lance level tho. |
||||||||||||||||||||||||||||||||||||||||
data_obj = data_obj.data.to_batches() | ||||||||||||||||||||||||||||||||||||||||
except ImportError: | ||||||||||||||||||||||||||||||||||||||||
pass | ||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if it is a HF dataset but There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changed to use lazy loading |
||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please make this import lazy. Right now it will import huggingface anytime you call this function. It should only import if the object is from the huggingface datasets module. For how we've done this for Polars and other modules, see: https://github.com/lancedb/lance/blob/main/python/python/lance/dependencies.py
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||||||||||||||||||||||||||||||||||||
reader = _coerce_reader(data_obj, schema) | ||||||||||||||||||||||||||||||||||||||||
_validate_schema(reader.schema) | ||||||||||||||||||||||||||||||||||||||||
# TODO add support for passing in LanceDataset and LanceScanner here | ||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lance <3 HuggingFace
The HuggingFace Hub has become the go to place for ML practitioners to find pre-trained models and useful datasets. HuggingFace datasets can be written directly into Lance format by using the
lance.write_dataset
method. You can write the entire dataset or a particular split. For example: