我们发现在现有的已经处理过的数据集(如 Redpajama,The Pile 等)中仍然存在一些“脏”数据样本。所以我们使用我们的 Data-Juicer 来完善这些数据集,并尝试将它们提供给 LLM 以获得更好的性能。
我们使用简单的 3-σ 规则来设置每个数据处理菜谱中的算子的超参数。
数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
---|---|---|---|---|---|---|
Arxiv | 1,724,497 | 1,655,259 | 95.99% | redpajama-arxiv-refine.yaml | Aliyun ModelScope |
Redpajama |
Books | 205,182 | 195,983 | 95.51% | redpajama-book-refine.yaml | Aliyun ModelScope |
Redpajama |
Wikipedia | 29,834,171 | 26,990,659 | 90.47% | redpajama-wiki-refine.yaml | Aliyun ModelScope |
Redpajama |
C4 | 364,868,892 | 346,217,856 | 94.89% | redpajama-c4-refine.yaml | Aliyun ModelScope |
Redpajama |
Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Github Code | 73,208,524 + 21,387,703 |
49,279,344 | 52.09% | redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml |
Aliyun ModelScope |
Redpajama The Stack |
StackExchange | 45,447,328 | 26,309,203 | 57.89% | redpajama-pile-stackexchange-refine.yaml | Aliyun ModelScope |
Redpajama The Pile |
EuroParl | 69,814 | 61,601 | 88.23% | pile-europarl-refine.yaml | Aliyun ModelScope |
The Pile |
FreeLaw | 3,562,015 | 2,942,612 | 82.61% | pile-freelaw-refine.yaml | Aliyun ModelScope |
The Pile |
HackerNews | 373,027 | 371,331 | 99.55% | pile-hackernews-refine.yaml | Aliyun ModelScope |
The Pile |
NIH ExPorter | 939,661 | 858,492 | 91.36% | pile-nih-refine.yaml | Aliyun ModelScope |
The Pile |
PhilPapers | 32,782 | 29,117 | 88.82% | pile-philpaper-refine.yaml | Aliyun ModelScope |
The Pile |
PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | pile-pubmed-abstract-refine.yaml | Aliyun ModelScope |
The Pile |
PubMed Central | 3,098,930 | 2,694,860 | 86.96% | pile-pubmed-central-refine.yaml | Aliyun ModelScope |
The Pile |
USPTO | 5,883,024 | 4,516,283 | 76.77% | pile-uspto-refine.yaml | Aliyun ModelScope |
The Pile |
数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
---|---|---|---|---|---|---|
Alpaca-Cot EN | 136,219,879 | 未去重版本: 104,573,711 去重版本: TBD |
76.77% | alpaca-cot-en-refine.yaml | Aliyun ModelScope |
来自Alpaca-CoT的39个子集 |
Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | alpaca-cot-zh-refine.yaml | Aliyun ModelScope |
来自Alpaca-CoT的28个子集 |