Releases: centic9/CommonCrawlDocumentDownload
Releases · centic9/CommonCrawlDocumentDownload
1.0.0.9
- Switch to Gradle 7.6 and to the new maven-publish plugin
- Update third-party-libraries
- Update to more recent CC-MAIN
- Parse newer fields
- Adjust logging configuration
Full Changelog: 1.0.0.8...1.0.0.9
1.0.0.8
Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central.
Full Changelog: 1.0.0.7...1.0.0.8
1.0.0.10
1.0.0.7
- Add Extension .pot for powerpoint
- Switch to CC-MAIN-2019-39
- Update third-party libraries
Full Changelog: 1.0.0.6...1.0.0.7
1.0.0.6
- Update 3rd party libraries
- Use common-crawl 2018-43 by default
- Write accumulated mimetypes to a separate text-file after each index-file
- Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
- Some small adjustments for behavior changes in Java 11