-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added connector folder and HF file #313
base: main
Are you sure you want to change the base?
Changes from 12 commits
68df99a
16257b9
4cd48ff
86b230f
497e04b
74be0dc
a5a41f4
152a99e
2f783ca
9f0575b
d30529d
829d7df
8bf9c52
9ae14f4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
from datasets import load_dataset, get_dataset_split_names | ||
from nomic import AtlasDataset | ||
import numpy as np | ||
import pandas as pd | ||
import pyarrow as pa | ||
|
||
|
||
# Gets data from HF dataset | ||
def get_hfdata(dataset_identifier): | ||
try: | ||
# Loads dataset without specifying config | ||
dataset = load_dataset(dataset_identifier) | ||
except ValueError as e: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need to handle this error, seems like handling config through the error message is too risky There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you can take out the try/except here |
||
# Grabs available configs and loads dataset using it | ||
configs = get_dataset_split_names(dataset_identifier) | ||
config = configs[0] | ||
dataset = load_dataset(dataset_identifier, config, trust_remote_code=True, streaming=True, split=config + "[:100000]") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would, instead of this, just make the split (and max length) optional arguments to this function (and in turn, to the top level There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also note that Also the row limit should probably be an optional argument as well rather than hardcoded - someone may want to upload a >100k row dataset or less - while using another issue is that slicing this or using Datasets also have a |
||
|
||
|
||
# Processes dataset entries using Arrow | ||
id_counter = 0 | ||
data = [] | ||
for split in dataset.keys(): | ||
for example in dataset[split]: | ||
# Adds a sequential ID | ||
example['id'] = str(id_counter) | ||
id_counter += 1 | ||
data.append(example) | ||
|
||
|
||
# Convert the data list to an Arrow table | ||
table = pa.Table.from_pandas(pd.DataFrame(data)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you could probably do |
||
|
||
|
||
return table | ||
|
||
|
||
# Converts booleans, lists etc to strings | ||
def convert_to_string(value): | ||
if isinstance(value, bool): | ||
return str(value) | ||
elif isinstance(value, list): | ||
return ' '.join(map(convert_to_string, value)) | ||
elif isinstance(value, np.ndarray): | ||
return ' '.join(map(str, value.flatten())) | ||
elif hasattr(value, 'tolist'): | ||
return ' '.join(map(str, value.tolist())) | ||
else: | ||
return str(value) | ||
|
||
|
||
# Processes Arrow table and converts necessary fields to strings | ||
def process_table(table): | ||
# Converts columns with complex types to strings | ||
for col in table.schema.names: | ||
column = table[col].to_pandas() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My code seems to throw an Attribute error without it. There are some cases that it works but with some like this one it doesn't. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can leave this as is for now then |
||
if column.dtype == np.bool_ or column.dtype == object or isinstance(column[0], (list, np.ndarray)): | ||
column = column.apply(convert_to_string) | ||
table = table.set_column(table.schema.get_field_index(col), col, pa.array(column)) | ||
|
||
|
||
return table | ||
|
||
|
||
# Creates AtlasDataset from HF dataset | ||
def hf_atlasdataset(dataset_identifier): | ||
table = get_hfdata(dataset_identifier.strip()) | ||
|
||
|
||
map_name = dataset_identifier.replace('/', '_') | ||
if not table: | ||
raise ValueError("No data was found for the provided dataset.") | ||
|
||
|
||
dataset = AtlasDataset( | ||
map_name, | ||
unique_id_field="id", | ||
) | ||
|
||
|
||
# Ensures all values are converted to strings | ||
processed_table = process_table(table) | ||
|
||
|
||
# Adds data to the AtlasDataset | ||
dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records')) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
|
||
return dataset | ||
|
||
|
||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
|
||
from huggingface_connector import hf_atlasdataset | ||
import logging | ||
|
||
if __name__ == "__main__": | ||
dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse that way it's easier to handle optional args like split and limit |
||
|
||
|
||
try: | ||
atlas_dataset = hf_atlasdataset(dataset_identifier) | ||
logging.info(f"AtlasDataset has been created for '{dataset_identifier}'") | ||
except ValueError as e: | ||
logging.error(f"Error creating AtlasDataset: {e}") | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring to function definition