- Bug fixes across the repo
- Minor enhancements and experimentation with single packaging techniques using [extra]
- Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
- The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
- The patch digit for the release of any one component can be increased independently from other component patch number
- Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline
- Bug fixes across the repo
- Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
- Added new transforms and single package for transforms published to pypi
- Improved CI/CD with targeted workflow triggered on specific changes to specific modules
- New enhancements for cutting a release
- Restructure the repository to distinguish/separate runtime libraries
- Split data-processing-lib/ray into python and ray
- Spark runtime
- Updated pyarrow version
- Define required transform() method as abstract to AbstractTableTransform
- Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies
- Add a configurable timeout before destroying the deployed Ray cluster.
- Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, pdf2parquet, HTML2Parquet and PII Transform
- Added ededup python implementation and incremental ededup
- Added fuzzy floating point comparison
- Many bug fixes across the repo, plus the following specifics.
- Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
- Automation of release process branch/tag management
- Documentation improvements
- Split libraries into 3 runtime-specific implementations
- Fix missing final count of processed and add percentages
- Improved fault tolerance in python and ray runtimes
- Report global DataAccess retry metric
- Support for binary data transforms
- Updated to Ray version to 2.24
- Updated to PyArrow version 16.1.0
- Add KFP V2 support
- Create a distinct (timestamped) execution.log file for each retry
- Support for multiple inputs/outputs
- Added language/lang_id - detects language in documents
- Added universal/profiler - counts works/tokens in documents
- Converted ingest2parquet tool to transform named code2parquet
- Split transforms, as appropriate, into python, ray and/or spark.
- Added spark implementations of filter, doc_id and noop transforms.
- Switch from using requirements.txt to pyproject.toml file for each transform runtime
- Repository restructured to move kfp workflow definitions to associated transform project directory