v0.2.0
What's Changed
- Adds multi node parallelism to local executor by @guipenedo in #85
- Changed fsx default filepath for logging output to user's home by @Anacheron51 in #86
- [
Docs
] Fix typos by @StandardAI in #91 - bugfix stats file not being saved to s3 by @guipenedo in #92
- Fix url stats by @thomwolf in #89
- Efficiency: np.fromiter instead of np.array by @giorgioangel in #88
- Adds language option for nltk by @guipenedo in #94
- Fix compression type by @jordane95 in #95
- Decoupled reading logic from DedupReader by @guipenedo in #98
- Support for arbitrary fasttext models by @guipenedo in #99
- Adds citation by @guipenedo in #101
- Adds parquet writer by @guipenedo in #103
- Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in #105
- Adding doc strings + adding a faster tokenized doc merger by @thomwolf in #90
- Add email on slurm and extend fasttext filter functionalities by @thomwolf in #111
- Add
jobs_status
command. by @lvwerra in #113 - Re-enable
datasets
test by @mariosasko in #114 - Update warc.py by @jordane95 in #115
- Bug fix: when file is empty by @jordane95 in #126
- Load tokenizer using
from_file
by @guipenedo in #122 - Adds
depends=
to LocalPipelineExecutor by @guipenedo in #100 - Improve C4 filter and dedup by @guipenedo in #124
- Adds option to shuffle input files in readers by @guipenedo in #128
- update Trafilatura version by @adbar in #130
- Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in #133
- Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in #134
- add requeue and QOS slurm options by @marianna13 in #144
- Fix substring dedup range by @jordane95 in #132
- Line dedup min remove words option by @guipenedo in #146
- New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in #151
- Url deduplication by @hynky1999 in #145
- Fix race conditions during download/extraction by @hynky1999 in #155
- Adds PII removal by @guipenedo in #156
- Pypi Publish Action by @hynky1999 in #159
New Contributors
- @Anacheron51 made their first contribution in #86
- @StandardAI made their first contribution in #91
- @giorgioangel made their first contribution in #88
- @lvwerra made their first contribution in #113
- @adbar made their first contribution in #130
- @justHungryMan made their first contribution in #134
- @marianna13 made their first contribution in #144
- @hynky1999 made their first contribution in #145
Full Changelog: v0.0.1...v0.2.0