Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with using the released hh dataset. #44

Open
jltchiu opened this issue Jan 23, 2024 · 2 comments
Open

Issues with using the released hh dataset. #44

jltchiu opened this issue Jan 23, 2024 · 2 comments

Comments

@jltchiu
Copy link

jltchiu commented Jan 23, 2024

Hi, I am trying to use your published dataset on the huggingface. I am trying to use it as follows

import os, sys, json
from datasets import load_dataset

dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
print(dataset)

However, I get the below error

Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11008.67it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2365.65it/s]
Generating train split: 151214 examples [00:05, 25560.14 examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
std preference difference: double
rejected score list: list<item: double>
  child 0, item: double
rejected: list<item: string>
  child 0, item: string
chosen score list: list<item: double>
  child 0, item: double
chosen: list<item: string>
  child 0, item: string
mean preference difference: double
GPT4 label: int64
to
{'std preference difference': Value(dtype='float64', id=None), 'rejected score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'rejected': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'chosen score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'chosen': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'mean preference difference': Value(dtype='float64', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/chiujustin01-pvc/workspace/work/get_data.py", line 5, in <module>
    dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I have use the same code for other dataset and it seems to work. Do you know where should I fix my code?

@Pattaro
Copy link

Pattaro commented Jan 29, 2024

我也遇到了 你解决了吗

@refrain-wbh
Copy link
Contributor

After investigation, we identified that the inconsistency between the 'train' and 'valid' fields was causing datasets to be unable to load both of these datasets simultaneously. We addressed this issue by adding an empty field in the 'train' dataset (GPT4 label, set to -1 to indicate the field is empty), and this resolved the problem. Now, with your code, the dataset can be downloaded successfully.

经过排查我们发现是因为train和valid的字段不一致导致datasets不能同时加载这两个数据集,我们通过在train中添加了一个空字段(GPT4 label,被设置为-1表示该字段为空)修复了这个问题,现在使用您的代码可以正常下载数据集。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants