Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added connector folder and HF file #313

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

abhisomala
Copy link

@abhisomala abhisomala commented Jul 8, 2024

HF file takes in any Huggingface identifier and then returns an AtlasDataset

Updates:

  • Doesn't parse through error for config
  • Updated example usage
  • Updated the way it handles lists

Testing:

  • 20 HF datasets with a couple of larger ones, and most of them are smaller ones

Limitations:

  • Audio files
  • Image files

🚀 This description was created by Ellipsis for commit 9ae14f4

Summary:

Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.

Key points:

  • Introduced a new connector for Hugging Face datasets.
  • Added connectors/huggingface_connector.py.
  • Implemented connectors/huggingface_connector.get_hfdata to load datasets and handle configuration issues.
  • Added unique IDs to each dataset entry using a sequential counter.
  • Implemented connectors/huggingface_connector.hf_atlasdataset to create an AtlasDataset.
  • Included data processing functions connectors/huggingface_connector.convert_to_string and connectors/huggingface_connector.process_table.
  • Used Apache Arrow for data processing.
  • Included a command-line interface in connectors/huggingface_connector.py.
  • Updated connectors/__init__.py and examples/HF_example_usage.py.
  • add_data accepts arrow tables directly.
  • Made an interactive script using argparse in example file.
  • Tested with ~80 different datasets, including small and large datasets.
  • Works for text, lists, booleans, numbers, special symbols, file paths, columns with special characters.
  • Limitations: Does not support images or audio.

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 68df99a in 53 seconds

More details
  • Looked at 53 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_g9aSZ6jnsjudJEhx


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 152a99e in 36 seconds

More details
  • Looked at 29 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_SGhnothmBoTYYbbD


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

connectors/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 2f783ca in 1 minute and 33 seconds

More details
  • Looked at 54 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_qrqWYnwFpCtSGGBI


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.



# Function to fetch data from a Hugging Face dataset
def fetch_data_from_huggingface(dataset_identifier, dataset_split=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for network-related exceptions such as requests.exceptions.RequestException to ensure robustness when fetching data from external APIs.

raise e # Re-raise other ValueErrors

# Load function to be used as a connector
def load(dataset_identifier, dataset_split=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also stripping the dataset_split argument to handle potential spaces:

Suggested change
def load(dataset_identifier, dataset_split=None):
data = fetch_data_from_huggingface(dataset_identifier.strip(), dataset_split.strip() if dataset_split else None)

#Takes last two parts of url to get allenai/quartz
atlas_dataset = huggingface_connecter.load('allenai/quartz')

atlas_dataset.create_index(topic_model=True, embedding_model='NomicEmbed')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling around the create_index method to manage potential failures gracefully:

Copy link
Contributor

@RLesser RLesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, maybe some things to change

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved
entry[key] = str(value)


dataset.add_data(data=data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we making an index here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite; that part mainly just converts anything that isn't a string, like booleans or lists, into strings because it will error out if I don't. I didn't really like how it is just a bunch of conditional statements but I couldn't find a better way to resolve it.



# Convert all booleans and lists to strings
for entry in data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two potential issues here:

  1. This seems like it would be extremely slow (and possibly crash) for large huggingface datasets.
  2. I'm a bit worried about assuming this is how people would want to handle these fields, but i guess they can edit it themselves if they want it done differently...


# Processes dataset entries
data = []
for split in dataset.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here about large datasets - are we sure this will not break?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I solve this with batch processing and by passing streaming=True through load dataset? Aaron mentioned that so it seems like that could prevent that issue

@apage43
Copy link
Member

apage43 commented Jul 9, 2024

Loads dataset using load_dataset library and assigns a unique ULID ID for each entry

(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these

The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big

Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up

When using map_data the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here:

data = [{ATLAS_DEFAULT_ID_FIELD: b64int(i)} for i in range(len(embeddings))]


Tested with datasets smaller than 10k for speed but can work with larger datasets

I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]") to only get the first 100k rows.

Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming

here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398

@RLesser
Copy link
Contributor

RLesser commented Jul 9, 2024

Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.

…ssing and getting config without parsing through error message
Comment on lines 10 to 17
try:
# Loads dataset without specifying config
dataset = load_dataset(dataset_identifier)
except ValueError as e:
# Grabs available configs and loads dataset using it
configs = get_dataset_split_names(dataset_identifier)
config = configs[0]
dataset = load_dataset(dataset_identifier, config, trust_remote_code=True, streaming=True, split=config + "[:100000]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would, instead of this, just make the split (and max length) optional arguments to this function (and in turn, to the top level hf_atlasdataset as well) instead of silently checking for splits and grabbing the first one which may not be the one that a user intends - for example, on wikimedia/wikipedia which is split by language, that would be Abkhazian wikipedia, since ab is alphabetically first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also note that streaming=True makes load_dataset return an IterableDataset which has slightly different behaviors than a normal Dataset - should probably always using streaming if you're going to use it - otherwise you'd need to make sure to test both cases.

Also the row limit should probably be an optional argument as well rather than hardcoded - someone may want to upload a >100k row dataset or less - while using split='splitname[:1000] does work its also not the only way - the .take on a dataset will return a new dataset with only that many rows (and on IterableDatasets will do the right thing and only fetch that many): dataset = dataset.take(1000) - this is probably more sensible for exposing the limit as an argument.

another issue is that slicing this or using .take will get the beginning of the dataset - often times if you are wanting to map a sample of a dataset (because you want to quickly get a picture of what's in it without spending the time/compute to map the whole thing) you want a random sample, for example the wikipedia dataset is also in alphabetical order by title so articles near the beginning will just be ones with titles starting with A which - probably won't get a very representative map of the whole dataset.

Datasets also have a .shuffle method which works similarlu to .take and should be applied before .take. E.g. to get 1000 random rows from a dataset you want dataset = dataset.shuffle().take(1000) - it probably makes sense to use this any time a limit is specified, but its not needed if the whole dataset will be uploaded.

import logging

if __name__ == "__main__":
dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse

that way it's easier to handle optional args like split and limit



# Convert the data list to an Arrow table
table = pa.Table.from_pandas(pd.DataFrame(data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably do pa.Table.from_pylist(data) instead and avoid roundtripping through pandas here

def process_table(table):
# Converts columns with complex types to strings
for col in table.schema.names:
column = table[col].to_pandas()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow.compute.cast may be able to handle some of this without having to go through pandas/pure python
https://arrow.apache.org/docs/python/generated/pyarrow.compute.cast.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My code seems to throw an Attribute error without it. There are some cases that it works but with some like this one it doesn't.
https://huggingface.co/datasets/Anthropic/hh-rlhf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can leave this as is for now then



# Adds data to the AtlasDataset
dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_data accepts arrow tables directly, no need for this conversion (add_data will have to convert it back into an arrow table before uploading if you do this)

@apage43
Copy link
Member

apage43 commented Jul 11, 2024

current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?


# Gets data from HF dataset
def get_hfdata(dataset_identifier, split="train", limit=100000):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring to function definition



# Load the dataset
dataset = load_dataset(dataset_identifier, split=split, streaming=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move load_dataset outside of try/except


# Load the dataset
dataset = load_dataset(dataset_identifier, split=split, streaming=True)
except ValueError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to handle this error, seems like handling config through the error message is too risky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can take out the try/except here

# Processes dataset entries using Arrow
id_counter = 0
data = []
if dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no dataset exists, function should fail

column = table[col]
if pa.types.is_boolean(column.type):
table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string()))
elif pa.types.is_list(column.type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you flatten the column as a list then cast as a string, the column will be too long and not match up with the other columns of the table. We may need to refactor this slightly by making sure column length is the same and structs are handled on a row by row basis


if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Create an AtlasDataset from a Hugging Face dataset.')
parser.add_argument('--dataset_identifier', type=str, required=True, help='The Hugging Face dataset identifier')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to explicitly specify in the arg that it's a hf_dataset_identifier because atlas has its own datasets

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 9ae14f4 in 1 minute and 4 seconds

More details
  • Looked at 140 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_5mSlb7zHjU1nAKlz


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

# Gets data from HF dataset
def get_hfdata(dataset_identifier, split="train", limit=100000):
splits = get_dataset_split_names(dataset_identifier)
dataset = load_dataset(dataset_identifier, split=split, streaming=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does not handle the case where the specified split is not available in the dataset. Previously, there was a mechanism to check available splits and use an alternative if the specified one was not found. Consider reintroducing this functionality to avoid runtime errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants