Issues with using the released hh dataset. #44

jltchiu · 2024-01-23T16:37:01Z

Hi, I am trying to use your published dataset on the huggingface. I am trying to use it as follows

import os, sys, json
from datasets import load_dataset

dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
print(dataset)

However, I get the below error

Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11008.67it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2365.65it/s]
Generating train split: 151214 examples [00:05, 25560.14 examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
std preference difference: double
rejected score list: list<item: double>
  child 0, item: double
rejected: list<item: string>
  child 0, item: string
chosen score list: list<item: double>
  child 0, item: double
chosen: list<item: string>
  child 0, item: string
mean preference difference: double
GPT4 label: int64
to
{'std preference difference': Value(dtype='float64', id=None), 'rejected score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'rejected': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'chosen score list': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None), 'chosen': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'mean preference difference': Value(dtype='float64', id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/chiujustin01-pvc/workspace/work/get_data.py", line 5, in <module>
    dataset = load_dataset("fnlp/hh-rlhf-strength-cleaned")
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/root/miniconda3/lib/python3.9/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I have use the same code for other dataset and it seems to work. Do you know where should I fix my code?

The text was updated successfully, but these errors were encountered:

Pattaro · 2024-01-29T03:13:24Z

我也遇到了你解决了吗

refrain-wbh · 2024-01-31T14:07:19Z

After investigation, we identified that the inconsistency between the 'train' and 'valid' fields was causing datasets to be unable to load both of these datasets simultaneously. We addressed this issue by adding an empty field in the 'train' dataset (GPT4 label, set to -1 to indicate the field is empty), and this resolved the problem. Now, with your code, the dataset can be downloaded successfully.

经过排查我们发现是因为train和valid的字段不一致导致datasets不能同时加载这两个数据集，我们通过在train中添加了一个空字段（GPT4 label，被设置为-1表示该字段为空）修复了这个问题，现在使用您的代码可以正常下载数据集。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with using the released hh dataset. #44

Issues with using the released hh dataset. #44

jltchiu commented Jan 23, 2024

Pattaro commented Jan 29, 2024

refrain-wbh commented Jan 31, 2024

Issues with using the released hh dataset. #44

Issues with using the released hh dataset. #44

Comments

jltchiu commented Jan 23, 2024

Pattaro commented Jan 29, 2024

refrain-wbh commented Jan 31, 2024