forked from Gradiant/FARO
-
Notifications
You must be signed in to change notification settings - Fork 3
/
CHANGELOG
executable file
·68 lines (56 loc) · 2.15 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
3.0.0
-----
* Update FARO to allow for plug-in support.
* Decouple FARO 2.0.0 functionality to be run separately in plug-ins
* Add plug-in template to use as a guide for new plug-in integration
* Add plug-in example (address_bitcoin plus tests) based on plug-in template
* Add option to run all plugins in configurable path
* Move tests to separate package
* Simplify configuration
* Support for logging configuration
* Update to tika 1.24
2.0.0
-----
* Add password-protected/encrypted file detection and score them as high sensitivity
* Remove gensim dependency
* Remove pandas dependency
* Remove luhn dependency
* Remove murmurhash dependency
* Remove custom regex library dependency and use standard package
* Clean obsolete or transitive dependencies from requirements
* Fix relative path with deep ancestors issue on the spider. Switch output to absolute paths since it gives more context
* Allow Non-ascii characters on detailed entities output file
* Include new contributors
* Simplify configuration
* Add testing and coverture metrics
* Replace custom ML models with standard ones (Spacy) Cost-Benefit ratio signals is a better approach.
* Remove scikit-learn and sklearn-crfsuite
* Update spacy to most recent version
* Decouple tika
* Add Docker-compose to setup development and production environments
1.1.2
-----
* Fix issue with logging while forcing OCR on PDF documents
1.1.1
-----
* Update to tika 1.23
* Add dockerhub image and update documentation on its use: https://hub.docker.com/r/gradiant/faro
* Fix #32: logging duplicates
* Fix #37 : fixing metadata when a list is extracted in some fields (dates and pages)
1.1.0
-----
* Add OCR capabilities
* Add option to disable OCR for performance reasons
* Let tika handle the supported file formats
* Allow for basic document classification adding metadata to ouput: type of doc, author, creation date, filesize, etc.
* Rewrite metadata handling
* Move log and OCR configuration to envvars to integrate better with docker
1.0.1
-----
* Add Docker support
* Fix path with spaces issue
* Fix sensitivy information patterns and redesign two phase approach
* Add more contextual validations
1.0.0
-----
* Initial release.