Skip to content

Commit

Permalink
Add HF datasets to the Ludwig docs (#334)
Browse files Browse the repository at this point in the history
  • Loading branch information
Infernaught authored Dec 21, 2023
1 parent 7c7321f commit 4292df5
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 0 deletions.
Binary file added docs/images/hf_subset_vs_split.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions docs/user_guide/datasets/supported_formats.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## File Formats
Ludwig is able to read UTF-8 encoded data from 14 file formats.
Supported formats are:

Expand All @@ -22,3 +23,17 @@ In case a \*SV file is provided, Ludwig tries to identify the separator (general
The default escape character is `\`.
For example, if `,` is the column separator and one of your data columns has a `,` in it, Pandas would fail to load the data properly.
To handle such cases, we expect the values in the columns to be escaped with backslashes (replace `,` in the data with `\,`).

## Hugging Face Datasets
Ludwig now also supports direct Hugging Face dataset imports with the following syntax (dataset_subset is not always present in Hugging Face datasets, so omit it if necessary).

`"hf://{dataset_name}--{dataset_subset}"`

For example:
`train_stats, _, _ = ludwig_model.train(dataset="hf://mbpp")`
`train_stats, _, _ = ludwig_model.train(dataset="hf://Open-Orca/OpenOrca")`
`train_stats, _, _ = ludwig_model.train(dataset="hf://gsm8k--main")`

Please note that "subset" is not the same as "split". Make sure that you are including the subset name and not the split name when specifying the dataset:

![Alt text](../../images/hf_subset_vs_split.png)

0 comments on commit 4292df5

Please sign in to comment.