Skip to content

Latest commit

 

History

History
26 lines (22 loc) · 1.64 KB

README.md

File metadata and controls

26 lines (22 loc) · 1.64 KB

Concedo's Dataset Explorer

Easily view and modify JSON and JSONL datasets for training large language models

image

Features

  • Easily view and modify JSON and JSONL datasets for training large language models
  • Supports Alpaca (Instruct), ShareGPT, and Text formats (and more)
  • Runs fully portable from your web browser, as a single file with zero other dependencies
  • Browse through your training datasets, with easy search and filter functions to segment your data
  • Supports searching and filtering with regex search or simple substrings search
  • Filter multiple samples by contents, length, matches, and number of turns. Allows combining multiple queries for composite results.
  • Includes an N-gram viewer to inspect selected examples for word frequency and repetition (word cloud)
  • Allows splitting and merging datasets by selecting desired subsets with different criteria.
  • Allows easy dataset deduplication
  • Includes a simple inline editor to modify individual samples or correct typos.
  • Pick individual samples or bulk-combine groups of them to curate your dataset, and save the results as a new JSON dataset
  • Fast and efficient, comfortably handles small to medium sized datasets of up to 400 MB. For larger datasets, it's recommended to split them first.
  • Fully open source, capable of running completely offline (just save the HTML file)

Free and open source. Try now at https://lostruins.github.io/datasetexplorer

Tips

  • JSON > Parquet
  • Alpaca > ChatML
  • Kobo > !Kobo