diff --git a/.dockerignore b/.dockerignore old mode 100644 new mode 100755 diff --git a/.gitignore b/.gitignore old mode 100644 new mode 100755 index db5dfaf..091a3b7 --- a/.gitignore +++ b/.gitignore @@ -4,9 +4,22 @@ data/* *.orig *.log +*.env +*.list output/ +input/ +.idea/ *.list +!docker_faro_env_example.list +test-reports/* +# Ensure no output files are published *.entity *.score +# Ignore coverage stats and config .coverage -!docker_faro_env_example.list +nosetests.xml +.keep +venv +*.log +*.log* +env diff --git a/CHANGELOG b/CHANGELOG index dbe0c9c..08e5d54 100755 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,16 @@ +3.0.0 +----- +* Update FARO to allow for plug-in support. +* Decouple FARO 2.0.0 functionality to be run separately in plug-ins +* Add plug-in template to use as a guide for new plug-in integration +* Add plug-in example (address_bitcoin plus tests) based on plug-in template +* Add option to run all plugins in configurable path +* Move tests to separate package +* Simplify configuration +* Support for logging configuration +* Update to tika 1.24 + + 2.0.0 ----- * Add password-protected/encrypted file detection and score them as high sensitivity diff --git a/CONTRIBUTORS b/CONTRIBUTORS index 2611629..a403db0 100755 --- a/CONTRIBUTORS +++ b/CONTRIBUTORS @@ -1,7 +1,8 @@ -- Enrique Andrade González (ElevenPaths-TEGRA) -- Hector Cerezo Costas (Gradiant-TEGRA) -- Juan Elosua Tomé (ElevenPaths-TEGRA) -- Rafael P. Martínez Álvarez (Gradiant-TEGRA) +- Enrique Andrade González +- Hector Cerezo Costas +- Juan Elosua Tomé +- Hugo Román García-Pardo Rodríguez +- Rafael P. Martínez Álvarez TEGRA is an R&D Cybersecurity Center based in Galicia (Spain). It is a joint effort from Telefónica, a leading international telecommunications company, through ElevenPaths, its global cybersecurity unit, and Gradiant, an ICT R&D center with more than 100 professionals working in areas like connectivity, security and intelligence, to create innovative products and services inside cybersecurity. diff --git a/Readme.md b/Readme.md index 66bc194..1bbf417 100755 --- a/Readme.md +++ b/Readme.md @@ -47,7 +47,12 @@ IF you are in a rush and just want to give it a try...go [here](docker/README.md The project contains the following folders: * `faro/` : this is FARO module with the main functionality and tests - * `config/`: yaml configuration files go here. There is one yaml file per language (plus one `nolanguage.yaml` to provide basic functionality for non detected languages) and one yaml file with common configurations for all languages `config/commons.yaml`. + * `conf/`: yaml configuration files go here. There is one yaml file per language (plus one `nolanguage.yaml` to provide basic functionality for non detected languages) and one yaml file with common configurations for all languages `config/commons.yaml`. + * `plugins/`: Stores all the available plugins to detect sensitive information with the appropiate language support. + * `utils/`: Utilities for faro execution, for example pre-process of texts and root classes to implement common plugin functionality. + * `docker/`: Everything related with the execution of faro in a container squema. + * `test/`: Unit tests for faro. + * `logs/` and `logger/`: Definition and storage of logging. * `faro_detection.py`: launcher of FARO for standalone operation over a single file. * `faro_spider.sh`: script for bulk processing. * `nose.cfg`: Configuration for testing faro @@ -110,7 +115,7 @@ These other dependencies are used for testing: #### Tika dependency -We provide some utilities in order to get tika server up and running on your local machine in case is useful donwload this [zip file](https://github.com/ElevenPaths/FARO/releases/download/v2.0.0/tika_external.zip) and uncompress somewhere in your local filesystem. +We provide some utilities in order to get tika server up and running on your local machine in case is useful donwload this [zip file](https://github.com/ElevenPaths/FARO/releases/download/v3.0.0/tika_external.zip) and uncompress somewhere in your local filesystem. To fire up tika run: ```unix @@ -120,7 +125,7 @@ $ tika_start.sh To stop tika server: ```unix $ tika_stop.sh -`` +``` ### NER models @@ -148,7 +153,7 @@ FARO creates an "output" folder inside the parent folder of `docker` normally th * `output/scan.$CURRENT_TIME.csv`: is a csv file with the score given to the document and the frequence of indicators in each file. ``` -filepath,score,monetary_quantity,signature,personal_email,mobile_phone_number,financial_data,document_id,custom_words,meta:content-type,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:num_words,meta:num_chars,meta:ocr +filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr /Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf,high,0,0,0,0,0,1,4,application/pdf,Powered By Crystal,1,es,,85739,219,1185,False /Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf,high,6,0,0,0,0,2,8,application/pdf,Python PDF Library - http://pybrary.net/pyPdf/,1,es,,77171,259,1524,True /Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf,high,3,0,0,0,0,1,2,application/pdf,FPDF 1.6,1,es,2019-09-12T20:08:19Z,1545,62,648,False @@ -157,17 +162,17 @@ filepath,score,monetary_quantity,signature,personal_email,mobile_phone_number,fi * `output/scan.$CURRENT_TIME.entity`: is a json with the list of indicators (disaggregated) extracted in a file. For example: ``` -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf", "entities": {"custom_words": {"facturar": 3, "total": 1}, "prob_currency": {"12,0021": 1, "12,00": 1, "9,92": 1, "3,9921": 1, "3,99": 1, "3,30": 1, "15,99": 1, "13,21": 1, "1.106.166": 1, "1,00": 1, "99,00": 1}, "document_id": {"89821284M": 1}}, "datetime": "2019-12-11 14:19:17"} -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf", "entities": {"document_id": {"H82547761": 1, "21809943D": 2}, "custom_words": {"factura": 2, "facturar": 2, "total": 2, "importe": 2}, "monetary_quantity": {"156,20": 4, "2,84": 2, "0,00": 2, "159,04": 2, "32,80": 4, "191,84": 2}, "prob_currency": {"1,00": 6, "189,00": 2}}, "datetime": "2019-12-11 14:19:27"} -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf", "entities": {"document_id": {"C-01107564": 1}, "custom_words": {"factura": 1, "total": 1}, "monetary_quantity": {"3,06": 1, "0,64": 1, "3,70": 1}}, "datetime": "2019-12-11 14:19:33"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf", "entities": {"custom_word": {"facturar": 3, "total": 1}, "probable_currency_amount": {"12,0021": 1, "12,00": 1, "9,92": 1, "3,9921": 1, "3,99": 1, "3,30": 1, "15,99": 1, "13,21": 1, "1.106.166": 1, "1,00": 1, "99,00": 1}, "id_document": {"89821284M": 1}}, "datetime": "2019-12-11 14:19:17"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf", "entities": {"id_document": {"H82547761": 1, "21809943D": 2}, "custom_word": {"factura": 2, "facturar": 2, "total": 2, "importe": 2}, "money": {"156,20": 4, "2,84": 2, "0,00": 2, "159,04": 2, "32,80": 4, "191,84": 2}, "probable_currency_amount": {"1,00": 6, "189,00": 2}}, "datetime": "2019-12-11 14:19:27"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf", "entities": {"document_id": {"C-01107564": 1}, "custom_word": {"factura": 1, "total": 1}, "money": {"3,06": 1, "0,64": 1, "3,70": 1}}, "datetime": "2019-12-11 14:19:33"} ``` #### Finetuning Faro Execution After adding OCR there are some configuration that can be customized for FARO execution through environment variables: * `FARO_DISABLE_OCR`: if this variable is found (with any value) FARO will not execute OCR on the documents -* `FARO_REQUESTS_TIMEOUT`: Number of seconds before FARO will timeout if the tika server does not respond (default: 60) -* `FARO_PDF_OCR_RATIO`: Bytes per character used in PDF mixed documents (text and images) to force OCR (default: 150 bytes/char) +* `FARO_REQUESTS_TIMEOUT`: Number of seconds before FARO will timeout if the tika server does not respond (default: 300) +* `FARO_PDF_OCR_RATIO`: Bytes per character used in PDF mixed documents (text and images) to force OCR (default: 500 bytes/char) Logging configuration can also be configured through environment variables: @@ -205,7 +210,7 @@ a) `.entity`: a json with the list of entities ordered by their type b) `.score`: a json with the types of entities and the number this type of entity appears in the text. This json also contains the sensitivy score in the property "score" (it can be "low", "medium" and "high"). ``` -{"score": "high", "summary": {"monetary_quantity": 1, "mobile_phone_number": 1, "personal_email": 1, "credit_account_number": 2}} +{"score": "high", "summary": {"money": 1, "mobile": 1, "personal_email": 1, "financial_data": 2}} ``` For information about additional arguments that can be passed to our detection script, take a look [here](#faro-detection-additional-arguments). @@ -220,17 +225,17 @@ The FARO entity detector performs two steps: The list of indicators are the following: - * **monetary_quantity**: money quantity (currently only euros and dollars are supported). + * **money**: money quantity (currently only euros and dollars are supported). * **signature**: it outputs the person who signs a document * **personal_email**: emails that are not corporative (e.g. not info@ rrhh@ ) - * **mobile_phone_number**: mobile phone numbers (filtering out non mobile ones) + * **mobile**: mobile phone numbers (filtering out non mobile ones) * **financial_data**: credit cards and IBAN account numbers - * **document_id**: Spanish NIF and CIF. + * **id_document**: Spanish NIF and CIF. The unique counts of these sentences are gathered in a json object and relayed as input to the next step. @@ -246,51 +251,48 @@ The following rules are applied: ### Configuration -It employs a YAML set of files for configuring its functionality (the YAML files are located inside the "config" folder) - -* common.yaml: has the common functionality to every language +It employs a YAML set of files for configuring its functionality (the YAML files are located inside the "conf" folder) -* .yaml: has the specific configuration for a language (currently only spanish is supported: "es" code). It also indicates where the ML Models are located (e.g. by default inside the "models" folder) +* `common.yaml`: has the common configuration for the tool. +* `config.py`: Sets the logging for faro execution #### Configuration of the sensitivity score Those are a collection of conditions that selects a score following the specification of the configuration file. The levels are configured in the sensitivity_list sorted by their intensity (from less to more sensitive). The sensitivity dict contains the conditions (min, max) ordered by type of entity. The system only needs to fulfill one condition of a certain level in order to flag the document with that level of sensitivity. Furtheremore if multiple KPIs of a certain leve are found in the document (as marked by the sensitivity_multiple_kpis parameter), the system increases their sensitivity level (e.g. from medium to high). ``` -sensitivity_list: - - low - - medium - - high - - -sensitivity_multiple_kpis: 3 - sensitivity: - low: - person_position: - min: 1 - max: 5 - monetary_quantity: - min: 1 - max: 5 - - signature: - min: 0 - max: 0 - - personal_email: - min: 0 - max: 0 - - .... - + sensitivity_list: + - low + - medium + - high + sensitivity_multiple_kpis: 3 ``` * sensitivity_list is the list of different sensitivity scores ordered by intensity. * sensitivity_multiple_kpis this number indicates the simultaneous number of scores in a level allowed before leveling up the sensitivy score -* sensitivity is a dict with the sensitivity conditions that must be satisfied in order to reach a sensitivity level. +Also each entity can be configured in terms of the amount of presence needed to be scored as each level: low, medium or high. by using a sensitivity dict with the sensitivity conditions that must be satisfied in order to reach a sensitivity level. + +``` +entities: + MONEY: + description: money + output: true + sensitivity: + low: + min: 1 + max: 6 + medium: + min: 6 + max: 65535 + high: + min: 65535 + max: 65535 + .... +``` + ### Supported Input File Formats @@ -309,8 +311,8 @@ Mails are extracted with RegExp. A ML classifier and heuristics are used to dist `--dump`: the system dumps the information of .score to stdout in csv format. E.g. an example of output might be: ``` -id_file,score,person_jobposition_organization,monetary_quantity,sign,personal_email,mobile_phone_number,credit_account_number,id_document -data/test/test2.pdf,medium,3,0,1,0,0,0,0 +filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr +/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf,high,0,0,0,0,0,1,4,application/pdf,Powered By Crystal,1,es,,85739,219,1185,False ``` diff --git a/faro/test/__init__.py b/conf/__init__.py similarity index 100% rename from faro/test/__init__.py rename to conf/__init__.py diff --git a/conf/commons.yaml b/conf/commons.yaml new file mode 100755 index 0000000..827432d --- /dev/null +++ b/conf/commons.yaml @@ -0,0 +1,149 @@ +entities: + PER: + description: person + output: false + ORG: + description: organization + output: false + LOC: + description: localization + output: false + MISC: + description: miscelaneous + output: false + FINANCIAL_DATA: + description: financial_data + output: true + sensitivity: + low: + min: 0 + max: 0 + medium: + min: 0 + max: 0 + high: + min: 1 + max: 65535 + MONEY: + description: money + output: true + sensitivity: + low: + min: 1 + max: 6 + medium: + min: 6 + max: 65535 + high: + min: 65535 + max: 65535 + PROB_CURRENCY: + description: probable_currency_amount + output: false + EMAIL: + description: personal_email + output: true + sensitivity: + low: + min: 1 + max: 2 + medium: + min: 2 + max: 65535 + high: + min: 65535 + max: 65535 + CORP_EMAIL: + description: corporate_email + output: false + ID_DOCUMENT: + description: id_document + output: true + sensitivity: + low: + min: 0 + max: 0 + medium: + min: 0 + max: 0 + high: + min: 1 + max: 65535 + MOBILE: + description: mobile + output: true + sensitivity: + low: + min: 1 + max: 2 + medium: + min: 2 + max: 4 + high: + min: 4 + max: 65535 + PHONE: + description: phone + output: false + SIGNATURE: + description: signature + output: true + max_distance: 15 + sensitivity: + low: + min: 0 + max: 0 + medium: + min: 1 + max: 2 + high: + min: 2 + max: 65535 + CUSTOM: + description: custom_word + output: true + sensitivity: + low: + min: 0 + max: 0 + medium: + min: 0 + max: 0 + high: + min: 1 + max: 65535 + +plugins: + all: false + available_list: + - financial_data + - mobile + - credit_card + - id_document + - phone + - money + - custom_word + - email + - corporate_email + - probable_currency_amount + - person + - organization + - signature + - address_bitcoin + +# These entities need to be synchronized with faro_spider.sh +spider_output_entities: + - money + - signature + - personal_email + - mobile + - financial_data + - id_document + - custom_word + +sensitivity: + sensitivity_list: + - low + - medium + - high + sensitivity_multiple_kpis: 3 diff --git a/conf/config.py b/conf/config.py new file mode 100755 index 0000000..52906eb --- /dev/null +++ b/conf/config.py @@ -0,0 +1,12 @@ +# Logger +import logging +import os + +LOG_FILE_NAME = 'faro-community.log' +LOG_LEVEL = os.getenv('FARO_LOG_LEVEL', "INFO") + +logging.basicConfig( + level=LOG_LEVEL, + format="%(levelname)s: %(name)20s: %(message)s", + handlers=[logging.StreamHandler()] + ) \ No newline at end of file diff --git a/config/commons.yaml b/config/commons.yaml deleted file mode 100755 index cc6f5a4..0000000 --- a/config/commons.yaml +++ /dev/null @@ -1,153 +0,0 @@ -features: - PER: - description: person - output: false - ORG: - description: organization - output: false - LOC: - description: localization - output: false - MISC: - description: miscelaneous - output: false - FINANCIAL_DATA: - description: financial_data - output: true - MONEY: - description: monetary_quantity - output: true - PROB_CURRENCY: - description: probable_currency_amount - output: false - EMAIL: - description: personal_email - output: true - CORP_EMAIL: - description: corporate_email - output: false - ID_DOCUMENT: - description: document_id - output: true - MOBILE: - description: mobile_phone_number - output: true - PHONE: - description: phone_number - output: false - SIGNATURE: - description: signature - output: true - max_distance: 15 - CUSTOM: - description: custom_words - output: true - -# Features analyzed by faro that will be written to the output -# together with the document metadata and final score -# TODO: how to do this better, maybe wait until we have a frontend? -scoring_output_features: - - monetary_quantity - - signature - - personal_email - - mobile_phone_number - - financial_data - - document_id - - custom_words - - -sensitivity_list: - - low - - medium - - high - -sensitivity_multiple_kpis: 3 - -sensitivity: - low: - monetary_quantity: - min: 1 - max: 6 - - signature: - min: 0 - max: 0 - - personal_email: - min: 1 - max: 2 - - mobile_phone_number: - min: 1 - max: 2 - - financial_data: - min: 0 - max: 0 - - document_id: - min: 0 - max: 0 - - custom_words: - min: 0 - max: 0 - - medium: - monetary_quantity: - min: 6 - max: 65535 - - signature: - min: 1 - max: 2 - - personal_email: - min: 2 - max: 65535 - - mobile_phone_number: - min: 2 - max: 4 - - financial_data: - min: 0 - max: 0 - - document_id: - min: 0 - max: 0 - - custom_words: - min: 0 - max: 0 - - high: - monetary_quantity: - min: 65535 - max: 65535 - - signature: - min: 2 - max: 65535 - - personal_email: - min: 65535 - max: 65535 - - mobile_phone_number: - min: 4 - max: 65535 - - financial_data: - min: 1 - max: 65535 - - document_id: - min: 1 - max: 65535 - - custom_words: - min: 1 - max: 65535 - diff --git a/config/es.yaml b/config/es.yaml deleted file mode 100755 index 38736ec..0000000 --- a/config/es.yaml +++ /dev/null @@ -1,35 +0,0 @@ -regexp_config: - CreditCard: - word_file: keywords_creditcard_es.txt - left_span_len: 20 - right_span_len: 0 - - FinancialData: - word_file: keywords_financialdata_es.txt - left_span_len: 20 - right_span_len: 0 - - DNI_SPAIN: - word_file: keywords_dni_spain_es.txt - left_span_len: 20 - right_span_len: 0 - - PHONE: - word_file: keywords_phone_es.txt - left_span_len: 20 - right_span_len: 0 - - MOBILE: - word_file: keywords_mobile_es.txt - left_span_len: 20 - right_span_len: 0 - -email_config: - excl_file: excl_corp_email_es.txt - -ner_config: - nlp_model : es_core_news_sm - -custom_config: - word_file: keywords_custom_words_es.txt - diff --git a/config/nolanguage.yaml b/config/nolanguage.yaml deleted file mode 100755 index c9e5913..0000000 --- a/config/nolanguage.yaml +++ /dev/null @@ -1,35 +0,0 @@ -regexp_config: - CreditCard: - word_file: keywords_creditcard_es.txt - left_span_len: 20 - right_span_len: 0 - - FinancialData: - word_file: keywords_financialdata_es.txt - left_span_len: 20 - right_span_len: 0 - - DNI_SPAIN: - word_file: keywords_dni_spain_es.txt - left_span_len: 20 - right_span_len: 0 - - PHONE: - word_file: keywords_phone_es.txt - left_span_len: 20 - right_span_len: 0 - - MOBILE: - word_file: keywords_mobile_es.txt - left_span_len: 20 - right_span_len: 0 - -email_config: - excl_file: excl_corp_email_es.txt - -ner_config: - nlp_model : xx_ent_wiki_sm - -custom_config: - word_file: keywords_custom_words_es.txt - diff --git a/docker/Dockerfiles/faro/commands/test-local.sh b/docker/Dockerfiles/faro/commands/test-local.sh index bb62b82..6e32e13 100755 --- a/docker/Dockerfiles/faro/commands/test-local.sh +++ b/docker/Dockerfiles/faro/commands/test-local.sh @@ -17,4 +17,4 @@ then echo "Error: Looks like tika server is unreachable" exit 1 fi -nosetests -sv ./test_*.py ./faro/test/test_*.py --with-coverage --cover-package=faro +nosetests -sv ./test/test_*.py --with-coverage --cover-package=faro diff --git a/docker/README.md b/docker/README.md index 7f1f910..231baa9 100755 --- a/docker/README.md +++ b/docker/README.md @@ -30,8 +30,8 @@ If on the other hand you want to develop or contribute to faro use the [developm First you'll need to download the images binaries from our repo to your target machine, for example: ``` $ cd ~/Downloads -$ wget https://github.com/ElevenPaths/FARO/releases/download/v2.0.0/faro.tar.gz -$ wget https://github.com/ElevenPaths/FARO/releases/download/v2.0.0/tika.tar.gz +$ wget https://github.com/ElevenPaths/FARO/releases/download/v3.0.0/faro.tar.gz +$ wget https://github.com/ElevenPaths/FARO/releases/download/v3.0.0/tika.tar.gz ``` Once in your target machine you'll need to load those images into docker, for example: @@ -91,7 +91,7 @@ FARO creates an "output" folder inside the parent folder of `docker` normally th * `output/scan.$CURRENT_TIME.csv`: is a csv file with the score given to the document and the frequence of indicators in each file. ``` -filepath,score,monetary_quantity,signature,personal_email,mobile_phone_number,financial_data,document_id,custom_words,meta:content-type,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:num_words,meta:num_chars,meta:ocr +filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr /Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf,high,0,0,0,0,0,1,4,application/pdf,Powered By Crystal,1,es,,85739,219,1185,False /Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf,high,6,0,0,0,0,2,8,application/pdf,Python PDF Library - http://pybrary.net/pyPdf/,1,es,,77171,259,1524,True /Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf,high,3,0,0,0,0,1,2,application/pdf,FPDF 1.6,1,es,2019-09-12T20:08:19Z,1545,62,648,False @@ -100,9 +100,9 @@ filepath,score,monetary_quantity,signature,personal_email,mobile_phone_number,fi * `output/scan.$CURRENT_TIME.entity`: is a json with the list of indicators (disaggregated) extracted in a file. For example: ``` -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf", "entities": {"custom_words": {"facturar": 3, "total": 1}, "prob_currency": {"12,0021": 1, "12,00": 1, "9,92": 1, "3,9921": 1, "3,99": 1, "3,30": 1, "15,99": 1, "13,21": 1, "1.106.166": 1, "1,00": 1, "99,00": 1}, "document_id": {"89821284M": 1}}, "datetime": "2019-12-11 14:19:17"} -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf", "entities": {"document_id": {"H82547761": 1, "21809943D": 2}, "custom_words": {"factura": 2, "facturar": 2, "total": 2, "importe": 2}, "monetary_quantity": {"156,20": 4, "2,84": 2, "0,00": 2, "159,04": 2, "32,80": 4, "191,84": 2}, "prob_currency": {"1,00": 6, "189,00": 2}}, "datetime": "2019-12-11 14:19:27"} -{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf", "entities": {"document_id": {"C-01107564": 1}, "custom_words": {"factura": 1, "total": 1}, "monetary_quantity": {"3,06": 1, "0,64": 1, "3,70": 1}}, "datetime": "2019-12-11 14:19:33"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf", "entities": {"custom_word": {"facturar": 3, "total": 1}, "probable_currency_amount": {"12,0021": 1, "12,00": 1, "9,92": 1, "3,9921": 1, "3,99": 1, "3,30": 1, "15,99": 1, "13,21": 1, "1.106.166": 1, "1,00": 1, "99,00": 1}, "id_document": {"89821284M": 1}}, "datetime": "2019-12-11 14:19:17"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf", "entities": {"id_document": {"H82547761": 1, "21809943D": 2}, "custom_word": {"factura": 2, "facturar": 2, "total": 2, "importe": 2}, "money": {"156,20": 4, "2,84": 2, "0,00": 2, "159,04": 2, "32,80": 4, "191,84": 2}, "probable_currency_amount": {"1,00": 6, "189,00": 2}}, "datetime": "2019-12-11 14:19:27"} +{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf", "entities": {"document_id": {"C-01107564": 1}, "custom_word": {"factura": 1, "total": 1}, "money": {"3,06": 1, "0,64": 1, "3,70": 1}}, "datetime": "2019-12-11 14:19:33"} ``` ## Run tests diff --git a/faro/detector.py b/faro/detector.py deleted file mode 100755 index 15c6235..0000000 --- a/faro/detector.py +++ /dev/null @@ -1,284 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -import logging -import os -import spacy -from .utils import normalize_text, clean_text -from stdnum import get_cc_module -from stdnum.luhn import validate -from stdnum.exceptions import InvalidChecksum, InvalidFormat -from .ner import NER -from .email import EmailFilter -from .ner_regex import RegexNer -from .custom_word import CustomWordDetector -from collections import OrderedDict - -CWD = os.path.dirname(__file__) -CONFIG_PATH = os.path.join(CWD, '..', 'config') -MODELS_PATH = os.path.join(CWD, '..', 'models') -_COMMONS_YAML = "%s/commons.yaml" % CONFIG_PATH - -logger = logging.getLogger(__name__) - - -class Detector(object): - """ Main class for extracting KPIs of confidential documents - - """ - - def _get_signature(self, person_signed_idx, next_person_has_signed, total_ent_list): - if next_person_has_signed: - min_itx_signed = self.signature_max_distance - id_min_itx = -1 - - for i in range(len(total_ent_list)): - ent = total_ent_list[i] - if ent[1] == "PER" and int(ent[3]) > person_signed_idx and int( - ent[3]) - person_signed_idx < min_itx_signed: - min_itx_signed = int(ent[3]) - person_signed_idx - id_min_itx = i - next_person_has_signed = False - - if id_min_itx != -1: - ent = total_ent_list[id_min_itx] - total_ent_list.append((ent[0], "SIGNATURE", ent[2], ent[3], ent[4])) - - def _extract_entities_ml(self, sent, offset, total_ent_list): - if self.ml_ner is not None: - ent_list_ner = self.ml_ner.get_model_entities(sent) - - for ent in ent_list_ner: - # storing as entity/label pair - new_ent = [ent[0], - ent[1], - "NER", - str(int(ent[2]) + offset), - str(int(ent[3]) + offset)] - - total_ent_list.append(new_ent) - - def _entity_regex_email(self, ent, offset, total_ent_list): - if self.corp_email_class.is_corp_email(ent[0]): - total_ent_list.append(( - ent[0], - "CORP_EMAIL", - ent[1], - str(ent[2] + offset), - str(ent[3] + offset))) - else: - total_ent_list.append((ent[0], - "EMAIL", - ent[1], - str(ent[2] + offset), - str(ent[3] + offset))) - - @staticmethod - def _entity_regex_credit_card(ent, offset, total_ent_list): - sent = clean_text(ent[0]) - try: - if validate(sent): - logger.debug( - "Credit card accepted {}.{}".format(sent, ent[0])) - - total_ent_list.append((ent[0], - "FINANCIAL_DATA", - ent[1], - str(ent[2] + offset), - str(ent[3] + offset))) - - except (InvalidChecksum, InvalidFormat): - logger.debug("Wrong credit card {}.{}.".format(sent, ent[0])) - - def _entity_signed_person(self, total_ent_list, person_signed_idx, next_person_has_signed): - min_itx_signed = self.signature_max_distance - id_min_itx = -1 - - for i in range(len(total_ent_list)): - _ent = total_ent_list[i] - - if _ent[1] == "PER" and int(_ent[3]) > person_signed_idx and int( - _ent[3]) - person_signed_idx < min_itx_signed: - min_itx_signed = (int(_ent[3]) - person_signed_idx) - id_min_itx = i - next_person_has_signed = False - - if id_min_itx != -1: - _ent = total_ent_list[id_min_itx] - - total_ent_list.append((_ent[0], "SIGNATURE", _ent[2], _ent[3], _ent[4])) - return next_person_has_signed - - @staticmethod - def _entity_financial_data(ent, ent_key, offset, total_ent_list): - sent = clean_text(ent[0]) - if get_cc_module('es', 'ccc').is_valid(sent) or get_cc_module('es', 'iban').is_valid(sent): - total_ent_list.append((ent[0], ent_key, ent[1], str(ent[2] + offset), str(ent[3] + offset))) - else: - logger.debug("Invalid financial data {}.{}".format(sent, ent[0])) - - @staticmethod - def _entity_id_document(ent, ent_key, offset, total_ent_list): - sent = clean_text(ent[0]) - if (get_cc_module('es', 'dni').is_valid(sent) or - get_cc_module('es', 'cif').is_valid(sent) or - get_cc_module('es', 'nie').is_valid(sent)): - total_ent_list.append((ent[0], ent_key, ent[1], str(ent[2] + offset), str(ent[3] + offset))) - else: - logger.debug("Invalid data ID document {}.{}".format(sent, ent[0])) - - def _extract_entities_regex(self, offset, sent, full_text, total_ent_list, next_person_has_signed): - ent_list_regex = self.regex_ner.regex_detection(sent, full_text, offset) - - for ent_key in ent_list_regex.keys(): - for ent in ent_list_regex[ent_key]: - - # We treat differently common corporative/personal emails - if ent_key == "EMAIL": - self._entity_regex_email(ent, offset, total_ent_list) - - elif ent_key == "SIGNATURE": - next_person_has_signed = True - person_signed_idx = int(ent[3]) + offset - - elif ent_key == "CREDIT_CARD": - self._entity_regex_credit_card(ent, offset, total_ent_list) - elif ent_key == "FINANCIAL_DATA": - self._entity_financial_data(ent, ent_key, offset, total_ent_list) - elif ent_key == "ID_DOCUMENT": - self._entity_id_document(ent, ent_key, offset, total_ent_list) - else: - total_ent_list.append((ent[0], - ent_key, - ent[1], - str(ent[2] + offset), - str(ent[3] + offset))) - if next_person_has_signed: - self._entity_signed_person(total_ent_list, person_signed_idx, next_person_has_signed) - - def _detection_custom_word(self, sent, offset, total_ent_list): - custom_list = self.custom_detector.search_custom_words(sent) - for _ent in custom_list: - total_ent_list.append((_ent[0], - _ent[1], - _ent[0], - str(_ent[2] + offset), - str(_ent[3] + offset))) - - def _get_kpis(self, sent_list): - """ Extract KPIs from document """ - - # full_text is used for proximity detection - full_text = "".join(sent_list) - - total_ent_list = [] - - # Flag to indicate that a sign entity is expected (if True) - next_person_has_signed = False - person_signed_idx = 0 - - offset = 0 - - for sent in sent_list: - line_length = len(sent) - - # extract entities (ML) - self._extract_entities_ml(sent, offset, total_ent_list) - - # extract entities (Regex) - self._extract_entities_regex(offset, sent, full_text, total_ent_list, next_person_has_signed) - - # detection of custom words - self._detection_custom_word(sent, offset, total_ent_list) - - offset += line_length - - self._get_signature(person_signed_idx, next_person_has_signed, total_ent_list) - - return total_ent_list - - @staticmethod - def _get_unique_ents(ent_list): - """ Process the entities to obtain a json object """ - unique_ent_dict = {} - for _ent in ent_list: - if _ent[1] not in unique_ent_dict: - unique_ent_dict[_ent[1]] = {} - if _ent[0] not in unique_ent_dict[_ent[1]]: - unique_ent_dict[_ent[1]][_ent[0]] = 0 - unique_ent_dict[_ent[1]][_ent[0]] += 1 - return unique_ent_dict - - def analyse(self, content): - """ Obtain KPIs from a document and obtain the output in the right format (json) - - Keyword arguments: - content -- list of sentences to obtain the entities - - """ - total_ent_list = self._get_kpis(content) - unique_ent_dict = self._get_unique_ents(total_ent_list) - return unique_ent_dict - - def __init__(self, config): - """ Intialization - - Keyword Arguments: - config -- a dict with yaml configuration parameters - - Properties - nlp -- a spacy model or None - custom_word_list -- list with custom words - regexp_config_dict -- configuration of the proximity detections - signature_max_distance -- maximum distance between distance and signature - low_priority_list -- list of entity types with low priority - - """ - - # build the system here - nlp = None - cfg_section = "ner_config" - cfg_item = "nlp_model" - if cfg_section in config and cfg_item in config[cfg_section]: - nlp = spacy.load(config[cfg_section][cfg_item]) - - # Custom word that the organization wants to detect as sensitive - custom_word_list = [] - cfg_section = "custom_config" - cfg_item = "word_file" - if cfg_section in config and cfg_item in config[cfg_section]: - with open('%s/%s' % (CONFIG_PATH, config[cfg_section][cfg_item]), "r") as f_in: - custom_word_list = [line.rstrip("\n") for line in f_in] - - # configuration of the proximity regexp - regexp_config_dict = {} - if "regexp_config" in config: - for key in config["regexp_config"]: - regexp_config_dict[key] = {} - regexp_config_dict[key]["left_span_len"] = int(config["regexp_config"][key]["left_span_len"]) - regexp_config_dict[key]["right_span_len"] = int(config["regexp_config"][key]["right_span_len"]) - - with open('%s/%s' % ( - CONFIG_PATH, config["regexp_config"][key]["word_file"]), "r") as f_in: - word_list = [normalize_text(line.rstrip("\n").strip()) for line in f_in] - - regexp_config_dict[key]["word_list"] = word_list - - # Email filter known corporative (non sensitive) email accounts - cfg_section = "email_config" - cfg_item = "excl_file" - if cfg_section in config and cfg_item in config[cfg_section]: - with open('%s/%s' % (CONFIG_PATH, config[cfg_section][cfg_item]), "r") as f_in: - excl_corp_list = [line.rstrip("\n") for line in f_in] - - if nlp is not None: - self.ml_ner = NER(nlp, None) - else: - self.ml_ner = None - - self.custom_detector = CustomWordDetector(nlp, custom_word_list) - - self.regex_ner = RegexNer(regexp_config_dict=regexp_config_dict) - self.corp_email_class = EmailFilter(excl_corp_list) - - max_distance = config["features"]["SIGNATURE"]["max_distance"] - self.signature_max_distance = max_distance diff --git a/faro/document.py b/faro/document.py index 2fe8351..9c977ae 100755 --- a/faro/document.py +++ b/faro/document.py @@ -1,10 +1,14 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- import logging -from .utils import preprocess_file_content -from .io_parser import parse_file +import os +from conf import config from collections import OrderedDict +from faro.io_parser import parse_file +from logger import logger +from utils.utils import preprocess_file_content + META_AUTHOR = "meta:author" META_CONTENT_TYPE = "meta:content-type" META_ENCRYPTED = "meta:encrypted" @@ -14,7 +18,8 @@ META_FILE_SIZE = "meta:filesize" META_OCR = "meta:ocr" -logger = logging.getLogger(__name__) +script_name = os.path.basename(__file__) +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) def _assign_author_metadata(metadata): @@ -144,7 +149,8 @@ def _parse_metadata(self, metadata): meta_dict -- dict of metadata (as returned by tika) """ - logger.debug("METADATA DICT {}".format(metadata)) + message = "METADATA DICT {}".format(metadata) + faro_logger.debug(script_name, self._parse_metadata.__name__, message) if metadata is None: self.metadata_error = True @@ -170,14 +176,14 @@ def _parse_metadata(self, metadata): # Creation date self.creation_date = _assign_creation_date_metadata(metadata) - def get_document_data(self): + def parse_document_data(self): """ Launch tika parser and retrieve both content and metadata """ # parse input file and join sentences if requested try: tika_content, tika_metadata = parse_file(self.document_path) - except Exception: + except Exception as e: tika_content = "" tika_metadata = None @@ -196,3 +202,4 @@ def __init__(self, document_path, split_lines): self.document_path = document_path # store wether or not we should split lines or not self.split_lines = split_lines + self.content = {} diff --git a/faro/faro_entrypoint.py b/faro/faro_entrypoint.py index e21265e..fae1f03 100755 --- a/faro/faro_entrypoint.py +++ b/faro/faro_entrypoint.py @@ -1,31 +1,33 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -import logging -import io -import os -import sys import csv -import yaml +import datetime +import io import json +import sys import time -import datetime -from langdetect import detect +from pathlib import Path + +import yaml from langdetect import DetectorFactory -from langdetect.lang_detect_exception import LangDetectException -from faro.detector import Detector + +from conf import config +from faro.document import FARODocument +from faro.language.language_detection import language_detection from faro.sensitivity_score import SensitivityScorer -from .document import FARODocument +from logger import logger +from plugins.orchestrator import Orchestrator -CWD = os.path.dirname(__file__) -CONFIG_PATH = os.path.join(CWD, '..', 'config') -MODELS_PATH = os.path.join(CWD, '..', 'models') -_COMMONS_YAML = "%s/commons.yaml" % CONFIG_PATH +CWD = Path(__file__).parent.parent +CONFIG_PATH = CWD / "conf" +_COMMONS_YAML = CONFIG_PATH / "commons.yaml" ACCEPTED_LANGS = ["es"] # init the seeed of the lang detection algorithm DetectorFactory.seed = 0 -logger = logging.getLogger(__name__) +script_name = Path(__file__).name +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) def _check_input_params(params): @@ -46,27 +48,13 @@ def _check_input_params(params): if not hasattr(params, 'dump'): params.dump = False - return params - - -def _customize_faro_engine_by_language(lang): - # TODO: refactor code, we need to simplify the flow since docs with no content - # go through a lot of unnecessary processing - if lang in ACCEPTED_LANGS: - with open("%s/%s.%s" % (CONFIG_PATH, lang, "yaml"), "r") as stream: - config = yaml.load(stream, Loader=yaml.FullLoader) - else: - logger.debug("Language {} is not fully supported. All the " + - "functionality is only implemented for these languages: {}".format( - lang, " ".join(ACCEPTED_LANGS))) - - with open("%s/nolanguage.%s" % (CONFIG_PATH, "yaml"), "r") as stream: - config = yaml.load(stream, Loader=yaml.FullLoader) - return config + if not hasattr(params, 'filehash'): + params.filehash = None + return params -def _generate_entities_output(entities, params, config): +def _generate_entities_output(entities, params, conf): """ Generate entities output humanizing feature descriptions """ @@ -75,16 +63,23 @@ def _generate_entities_output(entities, params, config): if not params.verbose: # Dict comprehension to filter out not verbose output filtered_entities = {k: v for k, - v in entities.items() if config["features"][k]["output"] == True} + v in entities.items() if conf["entities"][k]["output"] == True} else: filtered_entities = entities - output_entities = {config["features"][k]["description"]: v for k, - v in filtered_entities.items()} + output_entities = {conf["entities"][k]["description"]: v for k, + v in filtered_entities.items()} entity_dict = {"filepath": params.input_file, "entities": output_entities, "datetime": st} + return entity_dict + + +def _persist_entities_output(entity_dict, params): + """ + Persist detected entities to disk + """ with io.open(params.output_entity_file, "a+") as f_out: f_out.write("{}\n".format(json.dumps(entity_dict, ensure_ascii=False))) @@ -104,7 +99,7 @@ def _compute_scoring(scorer, entities, faro_doc): return result -def _generate_scoring_output(result, params, config, faro_doc): +def _generate_scoring_output(result, params, conf, faro_doc): # Adding metadata to output result.update(faro_doc.get_metadata()) @@ -117,61 +112,62 @@ def _generate_scoring_output(result, params, config, faro_doc): # Create list with output fieldnames header = ["id_file", "score"] #  Add all sensitive info categories - header.extend(config["scoring_output_features"]) + header.extend(conf["spider_output_entities"]) # Add document metadata header.extend(faro_doc.get_metadata().keys()) writer = csv.DictWriter(sys.stdout, fieldnames=header, extrasaction='ignore', restval=0) result["id_file"] = params.input_file - logging.debug("JSON (Entities detected) {}".format( - json.dumps(result, ensure_ascii=False))) + message = "JSON (Entities detected) {}".format( + json.dumps(result, ensure_ascii=False)) + faro_logger.debug(script_name, + _generate_scoring_output.__name__, + message) writer.writerow(result) -def language_detection(file_lines): - try: - lang = detect(" ".join(file_lines)) - except LangDetectException: - lang = "unk" - return lang - - def faro_execute(params): """ Execution of the main loop """ # Validate params params = _check_input_params(params) # reading commons configuration - with open(_COMMONS_YAML, "r") as f_stream: + with open(_COMMONS_YAML, "r", encoding='utf8') as f_stream: commons_config = yaml.load(f_stream, Loader=yaml.FullLoader) # parse input file and join sentences if requested - logger.info("Analysing {}".format(params.input_file)) + message = "Analysing {}".format(params.input_file) + faro_logger.info(script_name, faro_execute.__name__, message) # Initialize our document representation faro_doc = FARODocument(params.input_file, params.split_lines) # Parse document and extract content and metadata - faro_doc.get_document_data() + faro_doc.parse_document_data() # Language customization lang = language_detection(faro_doc.content) faro_doc.set_language(lang) - config = _customize_faro_engine_by_language(lang) + lang = {"lang": lang} # joining two dicts with configurations # config becomes a shallowly merged dictionary with values from commons_config - #  replacing those from config - config = {**config, **commons_config} + # replacing those from config + conf = {**lang, **commons_config} - # instantiate detector with current configuration - my_detector = Detector(config) - # Detect features in the document content - entities_dict = my_detector.analyse(faro_doc.content) + faro_logger.debug(script_name, faro_execute.__name__, "Running plug-ins") + orchestrator = Orchestrator(conf) + entities_dict = orchestrator.run_plugins(str(faro_doc.content)) # Initialize our scoring class - scorer = SensitivityScorer(config) + scorer = SensitivityScorer(conf) + # score the document, given the extracted entities - result = _compute_scoring(scorer, entities_dict, faro_doc) + scoring = _compute_scoring(scorer, entities_dict, faro_doc) # output - _generate_entities_output(entities_dict, params, config) - _generate_scoring_output(result, params, config, faro_doc) + result = _generate_entities_output(entities_dict, params, conf) + + faro_logger.debug(script_name, faro_execute.__name__, str(entities_dict)) + faro_logger.debug(script_name, faro_execute.__name__, str(result)) + + _persist_entities_output(result, params) + _generate_scoring_output(scoring, params, conf, faro_doc) diff --git a/faro/io_parser.py b/faro/io_parser.py index 62c5580..7afd8fe 100755 --- a/faro/io_parser.py +++ b/faro/io_parser.py @@ -1,16 +1,25 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -import logging -from tika import parser, tika import collections +import logging import os +import sys +from pathlib import Path + +from conf import config +from logger import logger +from tika import parser, tika -logger = logging.getLogger(__name__) # Tika-python will assume the server is running and will not try to download nor start a new tika server +from utils.utils import log_exception + tika.TikaClientOnly = True CHARS_PER_PAGE_PDF = 'pdf:charsPerPage' +script_name = Path(__file__).name +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) + def flatten(iterable): for el in iterable: @@ -35,18 +44,19 @@ def _is_run_ocr(parsed, file_size, pdf_ocr_ratio): force_ocr = True else: filesize_chars_ratio = file_size / chars - logger.debug("PDF filesize_chars_ratio: {:.2f}".format(filesize_chars_ratio)) + message = "size: {}, chars: {}, ratio: {}".format( + file_size, + chars, + filesize_chars_ratio) + faro_logger.debug(script_name, _is_run_ocr.__name__, message) if filesize_chars_ratio > pdf_ocr_ratio: force_ocr = True - logger.debug('size: {}, chars: {}, ratio: {}'.format( - file_size, - chars, - filesize_chars_ratio)) return force_ocr def _run_force_ocr(parsed, file_path, request_options): - logger.info("performing OCR on PDF file: {}".format(file_path)) + message = "performing OCR on PDF file: {}".format(file_path) + faro_logger.info(script_name, _run_force_ocr.__name__, message) parsed['metadata']['ocr_parsing'] = True parsed_ocr_text = parser.from_file( file_path, @@ -74,15 +84,16 @@ def _smarter_strategy_ocr_pdf(parsed, disable_ocr, file_size, pdf_ocr_ratio, fil if parsed['metadata']['Content-Type'] == 'application/pdf': force_ocr = _is_run_ocr(parsed, file_size, pdf_ocr_ratio) - + message = "force_ocr {}".format(force_ocr) + faro_logger.debug(script_name, _smarter_strategy_ocr_pdf.__name__, message) if force_ocr: _run_force_ocr(parsed, file_path, request_options) except KeyError as e: - logger.debug("Did not find key {} in metadata".format(e)) + log_exception(faro_logger, script_name, _smarter_strategy_ocr_pdf.__name__, e, sys) raise e except Exception as e: - logger.error("Unexpected exception while treating PDF OCR strategy {}".format(e)) + log_exception(faro_logger, script_name, _smarter_strategy_ocr_pdf.__name__, e, sys) raise e @@ -96,8 +107,8 @@ def parse_file(file_path): """ # Retrieve envvars - timeout = int(os.getenv('FARO_REQUESTS_TIMEOUT', 60)) - pdf_ocr_ratio = int(os.getenv('FARO_PDF_OCR_RATIO', 150)) + timeout = int(os.getenv('FARO_REQUESTS_TIMEOUT', 300)) + pdf_ocr_ratio = int(os.getenv('FARO_PDF_OCR_RATIO', 500)) disable_ocr = os.getenv('FARO_DISABLE_OCR', False) # OCR is time consuming we will need to raise the request timeout to allow for processing @@ -116,7 +127,7 @@ def parse_file(file_path): if 'X-TIKA:EXCEPTION:runtime' in parsed['metadata']: return parsed['content'], parsed['metadata'] except Exception as e: - logger.error("Unexpected exception during parsing {}".format(e)) + log_exception(faro_logger, script_name, _smarter_strategy_ocr_pdf.__name__, e, sys) raise e # try to implement a smarter strategy for OCRing PDFs diff --git a/faro/language/__init__.py b/faro/language/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/faro/language/language_detection.py b/faro/language/language_detection.py new file mode 100755 index 0000000..33415a6 --- /dev/null +++ b/faro/language/language_detection.py @@ -0,0 +1,39 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +import logging +import sys +from conf import config +from langdetect import DetectorFactory +from langdetect import detect_langs +from langdetect.lang_detect_exception import LangDetectException +from logger import logger +from pathlib import Path + +from utils.utils import log_exception + +DetectorFactory.seed = 0 + +script_name = Path(__file__).name +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) + + +def language_detection(file_lines): + lang = "unk" + try: + """ + El detect no funciona correctamente. + Detecta 'ca' en vez de 'es' + """ + # lang = detect(" ".join(file_lines)) + # print("Detector: " + lang) + file_lines = " ".join(file_lines) + probabilities = detect_langs(file_lines) + # print(probabilities) + if probabilities: + lang = probabilities[0].lang + faro_logger.debug(script_name, + language_detection.__name__, + "lang: %s" % lang) + except LangDetectException as e: + log_exception(faro_logger, script_name, language_detection.__name__, e, sys) + return lang diff --git a/faro/ner.py b/faro/ner.py deleted file mode 100755 index 038d3a4..0000000 --- a/faro/ner.py +++ /dev/null @@ -1,60 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -import logging -from .utils import preprocess_text - - -logger = logging.getLogger(__name__) - - -class NER(object): - """ A class to extract entities using different NERs """ - - @staticmethod - def _spacy(doc, ent_list): - # using SpaCy - for ent in doc.ents: - if ent.label_.upper() in ["PER", "ORG"]: - ent_list.append((ent.text, ent.label_.upper(), ent.start_char, ent.end_char, ent.start)) - - def _spacy_extra_models(self, u_text, ent_list): - if self.nlp_extra is not None: - for nlp_e in self.nlp_extra: - doc = nlp_e(u_text) - - for ent in doc.ents: - ent_list.append((ent.text, ent.label_, ent.start_char, ent.end_char, ent.start)) - - - def get_model_entities(self, sentence): - """ Get enttities with a NER ML model (Spacy) - - Keyword arguments: - sentence -- a string with a sentence or paragraph - - """ - - u_text = preprocess_text(sentence) - - doc = self.nlp(u_text) - - # extracting entities with spacy - ent_list = [] - - # Detect entities: PER -> Persons and ORG -> Organizations - self._spacy(doc, ent_list) - - # extracting entities with crfs - self._spacy_extra_models(u_text, ent_list) - return ent_list - - def __init__(self, nlp, nlp_extra=None): - """ Initialization - - Keyword arguments: - nlp: spacy model - nlp_extra: additional spacy models (e.g. with custom entities) (default None) - """ - - self.nlp = nlp - self.nlp_extra = nlp_extra diff --git a/faro/ner_regex.py b/faro/ner_regex.py deleted file mode 100755 index 6a6275d..0000000 --- a/faro/ner_regex.py +++ /dev/null @@ -1,232 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -import logging -import copy -import re -from .utils import clean_text, normalize_text - -logger = logging.getLogger(__name__) - -# Email -STRICT_REG_EMAIL_ADDRESS_V0 = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+" - -# Credit Card -STRICT_REG_CREDIT_CARD_V0 = ( - r"(?:(?P((?((?((?= self.ranking_dict[ - self.sensitivity_list[current_idx] - ][key]["min"]: - above_min += 1 - except KeyError: - logger.debug("could not find %s in scoring computation" % ( - key)) + for key in self.features: + if self.features[key]['description'] in summary_dict: + try: + if summary_dict[self.features[key]['description']] >= self.features[key]["sensitivity"][ + self.sensitivity_list[current_idx] + ]["min"]: + above_min += 1 + except KeyError: + message = "Could not find %s in scoring computation" % key + faro_logger.debug(script_name, self._check_index_surpass_min_specified.__name__, message) if (above_min > self.sensitivity_multiple_kpis and current_idx < len(self.sensitivity_list) - 1): @@ -36,23 +41,23 @@ def _get_ranking(self, summary_dict): reached_min = False current_idx = 0 - for key in summary_dict: - try: - while (self.ranking_dict[self.sensitivity_list[ - current_idx] - ][key]["max"] <= summary_dict[key]): - current_idx += 1 - reached_min = True - # check if we are already in the max level of sensitivity - if current_idx == len(self.sensitivity_list) - 1: - break - - if summary_dict[key] >= self.ranking_dict[ - self.sensitivity_list[current_idx]][key]["min"]: - reached_min = True - except KeyError: - logger.debug("could not find %s in scoring computation" % ( - key)) + for key in self.features: + if self.features[key]['description'] in summary_dict: + try: + while self.features[key]["sensitivity"][self.sensitivity_list[current_idx]]["max"] \ + <= summary_dict[self.features[key]['description']]: + current_idx += 1 + reached_min = True + # check if we are already in the max level of sensitivity + if current_idx == len(self.sensitivity_list) - 1: + break + + if summary_dict[self.features[key]['description']] >= \ + self.features[key]["sensitivity"][self.sensitivity_list[current_idx]]["min"]: + reached_min = True + except KeyError: + message = "could not find %s in scoring computation" % key + faro_logger.debug(script_name, self._get_ranking.__name__, message) if reached_min: self._check_index_surpass_min_specified(summary_dict, current_idx) @@ -95,14 +100,14 @@ def get_sensitivity_score(self, entity_dict): result_dict["score"] = self._get_ranking(result_dict) return result_dict - def __init__(self, config): + def __init__(self, conf): """ Initialization Keyword arguments: faro configuration """ - self.ranking_dict = config['sensitivity'] - self.sensitivity_list = config['sensitivity_list'] - self.sensitivity_multiple_kpis = config['sensitivity_multiple_kpis'] - self.features = config['features'] + + self.sensitivity_list = conf['sensitivity']['sensitivity_list'] + self.features = conf['entities'] + self.sensitivity_multiple_kpis = conf['sensitivity']['sensitivity_multiple_kpis'] diff --git a/faro/test/data/sensitive_data.docx b/faro/test/data/sensitive_data.docx deleted file mode 100755 index 5b840ec..0000000 Binary files a/faro/test/data/sensitive_data.docx and /dev/null differ diff --git a/faro/test/test_ner_regex.py b/faro/test/test_ner_regex.py deleted file mode 100755 index 5101b1b..0000000 --- a/faro/test/test_ner_regex.py +++ /dev/null @@ -1,1033 +0,0 @@ -#!/usr/bin/env python -# -*- coding: utf-8 -*- -import unittest - -from faro import ner_regex -from faro.utils import clean_text - -MSG_PHONE_NOT_DETECTED = "Phone was not detected" -MSG_PHONE_DETECTED = "PHONE was detected" -MSG_PHONE_WRONG_DETECTED = "wrong phone detected" -MSG_MOBILE_NOT_DETECTED = "Mobile phone was not detected" -MSG_MOBILE_WRONG_DETECTED = "wrong mobile detected" -MSG_IBAN_NOT_DETECTED = "IBAN was not detected" -MSG_IBAN_WRONG_DETECTED = "wrong IBAN detected" -MSG_ID_DOCUMENT_NOT_DETECTED = "ID_DOCUMENT was not detected" -MSG_DNI_WRONG_DETECTED = "wrong dni detected" -MSG_NIF_WRONG_DETECTED = "wrong NIF detected" -MSG_MONEY_NOT_DETECTED = "MONEY was not detected" -MSG_CURRENCY_WRONG_DETECTED = "wrong currency detected" -MSG_EMAIL_NOT_DETECTED = "email was not detected" -MSG_EMAIL_HACK_NOT_DETECTED = "Email Hack was not detected" -MSG_EMAIL_WRONG_DETECTED = "wrong email detected" -MSG_EMAIL_HACK_DETECTED = "Wrong email Hack was detected" -MSG_CREDIT_CARD_NOT_DETECTED = "credit card was not detected" -MSG_CREDIT_CARD_DETECTED = "wrong credit card detected" -MSG_QUANTITY_DETECTED = "Wrong quantity detected" -MSG_PROB_CURRENCY_NOT_DETECTED = "PROB_CURRENCY was not detected" -MONEY_THOUSANDS = "1,000" - -MSG_TEXT = "{} {} {}. Text {}" -MSG_DETECTED = "{} {}. Detected {}" -MSG_EXTRACTED = "{} {}. Extracted {}" - - -class NerRegexTest(unittest.TestCase): - - def setUp(self): - """ Setting up for the test """ - pass - - def tearDown(self): - """ Cleaning up after the test """ - pass - - def test_regexinit(self): - """ Test the initialization of the regex detection class """ - ner_regex.RegexNer() - - def test_0_broad_phone_number_v0(self): - """ Test the detection of a phone number """ - - test = "Mi teléfono es 988 888 888 " - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_APPROX_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0].strip()), - "988888888", - "{self.shortDescription()} {MSG_PHONE_NOT_DETECTED}. Extracted {result['PHONE'][idx]}") - - def test_1_broad_phone_number_v0(self): - """ Test the detection of a phone number """ - - test = "Mi teléfono es +34 988 888 888 " - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_APPROX_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0].strip()), - "34988888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result["PHONE"][idx])) - - def test_2_broad_phone_number_v0(self): - """ Test the detection of a wrong phone number """ - - test = "Mi teléfono es +34 988 888 888 456" - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0].strip()), - "34988888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result["PHONE"][idx])) - - def test_3_broad_phone_number_v0(self): - """ Test the detection of a wrong phone number """ - - test = "Mi teléfono es 45 988 888 888" - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_DETECTED, - result)) - - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0].strip()), - "988888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result["PHONE"][idx])) - - def test_0_CP_MOBILE_NUMBER_V0(self): - """ Test the detection of a phone number """ - - test = "Mi teléfono móvil es 688 888 888 " - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("MOBILE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result)) - - for i, _regexp in enumerate(result["MOBILE"]): - if _regexp[1] == "BROAD_REG_MOBILE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["MOBILE"][idx][0].strip()), - "688888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result["MOBILE"][idx])) - - def test_1_broad_mobile_number_V0(self): - """ Test the detection of a mobile phone number """ - - test = "Mi teléfono es 45 688 888 888 " - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("MOBILE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result)) - - for i, _regexp in enumerate(result["MOBILE"]): - if _regexp[1] == "BROAD_REG_MOBILE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["MOBILE"][idx][0].strip()), - "688888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result["MOBILE"][idx])) - - def test_2_broad_mobile_number_v0(self): - """ Test the detection of a mobile phone number """ - - test = "Mi teléfono es 45688 888 888 " - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "broad") - - self.assertTrue("MOBILE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result)) - - for i, _regexp in enumerate(result["MOBILE"]): - if _regexp[1] == "BROAD_REG_MOBILE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["MOBILE"][idx][0].strip()), - "688888888", - MSG_EXTRACTED.format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result["MOBILE"][idx])) - - def test_strict_iban_v0(self): - """ Test the detection of the IBAN account """ - - test = ("This is the IBAN of the account ES91 2100 0418 4502 " + - "0005 1332 .") - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "strict") - - self.assertTrue("FINANCIAL_DATA" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_IBAN_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["FINANCIAL_DATA"]): - if _regexp[1] == "STRICT_REG_IBAN_V1": - idx = i - break - - self.assertEqual(clean_text(result["FINANCIAL_DATA"][idx][0]), - "ES9121000418450200051332", - MSG_DETECTED.format( - self.shortDescription(), - MSG_IBAN_WRONG_DETECTED, - result["FINANCIAL_DATA"][idx])) - - def test_broad_iban_V1(self): - """ Test the detection of the IBAN account """ - - test = "This is the IBAN of the account ES91 2100 4334471600021142." - - proximity_dict = {"FINANCIAL_DATA": {"left_span_len": 20, - "right_span_len": 0, - "word_list": ["iban"]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - result = ner._detect_regexp(test, "broad") - - self.assertTrue("FINANCIAL_DATA" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_IBAN_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["FINANCIAL_DATA"]): - if _regexp[1] == "BROAD_REG_IBAN_APPROX_V1": - idx = i - break - - self.assertEqual(clean_text(result["FINANCIAL_DATA"][idx][0]), - "ES9121004334471600021142", - MSG_DETECTED.format( - self.shortDescription(), - MSG_IBAN_WRONG_DETECTED, - result["FINANCIAL_DATA"][idx])) - - def test_complete_iban_V1(self): - """ Test the detection of the IBAN account """ - - test = "This is the IBAN of the account ES91 2100 4334471600021142P ." - - proximity_dict = {"FINANCIAL_DATA": {"left_span_len": 20, - "right_span_len": 0, - "word_list": ["iban"]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - result = ner.regex_detection(test, full_text=test) - - self.assertTrue("FINANCIAL_DATA" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_IBAN_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["FINANCIAL_DATA"]): - if _regexp[1] == "BROAD_REG_IBAN_APPROX_V1": - idx = i - break - - self.assertEqual(clean_text(result["FINANCIAL_DATA"][idx][0]), - "ES9121004334471600021142", - MSG_DETECTED.format( - self.shortDescription(), - MSG_IBAN_WRONG_DETECTED, - result["FINANCIAL_DATA"][idx])) - - def test_is_not_iban_V0(self): - """ Test IBAN is not detected """ - - test = ("This is the IBAN of the account ES91 2100 0418 " + - "4502 0005 1332 4576 .") - ner = ner_regex.RegexNer() - - result = ner._detect_regexp(test, "strict") - - self.assertTrue("FINANCIAL_DATA" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_IBAN_NOT_DETECTED, - result)) - idx = -1 - for i, _regexp in enumerate(result["FINANCIAL_DATA"]): - if _regexp[1] == "BROAD_REG_IBAN_APPROX_V1": - idx = i - break - - self.assertEqual(clean_text(result["FINANCIAL_DATA"][idx][0]), - "ES9121000418450200051332", - MSG_DETECTED.format( - self.shortDescription(), - MSG_IBAN_WRONG_DETECTED, - result["FINANCIAL_DATA"][idx])) - - def test_strict_email_v0(self): - """ Detection of email v0 rule """ - - test = "the email of John is deadbeaf@foo.bar" - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("EMAIL" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_EMAIL_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["EMAIL"]): - if _regexp[1] == "STRICT_REG_EMAIL_ADDRESS_V0": - idx = i - break - - self.assertEqual(result["EMAIL"][idx][0], "deadbeaf@foo.bar", - MSG_DETECTED.format( - self.shortDescription(), - MSG_EMAIL_WRONG_DETECTED, - result["EMAIL"][idx])) - - def test_strict_credit_card_v0(self): - """ Detection of card v0 rule """ - - test = "the visa card is 4111111111111111." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("CREDIT_CARD" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_CREDIT_CARD_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["CREDIT_CARD"]): - if _regexp[1] == "STRICT_REG_CREDIT_CARD_V0": - idx = i - break - - self.assertEqual(result["CREDIT_CARD"][idx][0], "4111111111111111", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CREDIT_CARD_DETECTED, - result["CREDIT_CARD"][idx])) - - def test_strict_dni_v0(self): - """ Detection of DNI v0 rule """ - - test = "el dni de Juan es 66666666Y." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["ID_DOCUMENT"]): - if _regexp[1] == "STRICT_REG_DNI_V0": - idx = i - break - - self.assertEqual(clean_text(result["ID_DOCUMENT"][idx][0]), "66666666Y", - MSG_DETECTED.format( - self.shortDescription(), - MSG_DNI_WRONG_DETECTED, - result["ID_DOCUMENT"][idx])) - - def test_dni_with_dash(self): - """ Detection of DNI v0 rule with letter separated by dash """ - - test = "el dni de Juan es 66666666-Y." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["ID_DOCUMENT"]): - if _regexp[1] == "STRICT_REG_DNI_V0": - idx = i - break - - self.assertEqual(clean_text(result["ID_DOCUMENT"][idx][0]), "66666666Y", - MSG_DETECTED.format( - self.shortDescription(), - MSG_DNI_WRONG_DETECTED, - result["ID_DOCUMENT"][idx])) - - def test_strict_dni_v1(self): - """ Detection of DNI v0 rule with Nº stuck to the number """ - - test = "el N.I.F. Nº15373458B." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["ID_DOCUMENT"]): - if _regexp[1] == "STRICT_REG_DNI_V0": - idx = i - break - - self.assertEqual(clean_text(result["ID_DOCUMENT"][idx][0]), "15373458B", - MSG_DETECTED.format( - self.shortDescription(), - MSG_DNI_WRONG_DETECTED, - result["ID_DOCUMENT"][idx])) - - def test_strict_dni_v2(self): - """ Detection of DNI""" - - test = "15373458B" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["ID_DOCUMENT"]): - if _regexp[1] == "STRICT_REG_DNI_V0": - idx = i - break - - self.assertEqual(clean_text(result["ID_DOCUMENT"][idx][0]), "15373458B", - MSG_DETECTED.format( - self.shortDescription(), - MSG_DNI_WRONG_DETECTED, - result["ID_DOCUMENT"][idx])) - - def test_strict_dni_v3(self): - """ Detection of DNI""" - - test = "15373458Bé" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" not in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - def test_broad_phone_number_v4(self): - """ Detection of phone number """ - - test = "el teléfono de Juan es +34 986 000000" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0]), "34986000000", - MSG_DETECTED.format( - self.shortDescription(), - MSG_PHONE_WRONG_DETECTED, - result["PHONE"][idx])) - - def test_broad_phone_number_v5(self): - """ Detection of phone number """ - - test = "tfno.: ESP 980000001" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0]), "980000001", - MSG_DETECTED.format( - self.shortDescription(), - MSG_PHONE_WRONG_DETECTED, - result["PHONE"][idx])) - - def test_broad_phone_number_v6(self): - """ Detection of phone number """ - - test = "teléfono: ESP 980000001A" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "broad") - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - idx = -1 - for i, _regexp in enumerate(result["PHONE"]): - if _regexp[1] == "BROAD_REG_PHONE_NUMBER_GEN_V3": - idx = i - break - - self.assertEqual(clean_text(result["PHONE"][idx][0]), "980000001", - MSG_DETECTED.format( - self.shortDescription(), - MSG_PHONE_WRONG_DETECTED, - result["PHONE"][idx])) - - def test_complete_phone_number(self): - """ Detection of phone number """ - - test = "tel.: ESP 980000007." - - proximity_dict = {"PHONE": { - "left_span_len": 20, - "right_span_len": 0, - "word_list": ["tel."]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - result = ner.regex_detection(test, full_text=test) - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_NOT_DETECTED, - result)) - - self.assertEqual(clean_text(result["PHONE"][0][0]), "980000007", - MSG_DETECTED.format( - self.shortDescription(), - MSG_PHONE_WRONG_DETECTED, - result["PHONE"][0])) - - def test_complete_phone_number_v1(self): - """ Detection of phone number """ - - test = "fax: ESP 98000000" - - proximity_dict = {"PHONE": { - "left_span_len": 20, - "right_span_len": 0, - "word_list": ["tel."]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - - result = ner.regex_detection(test, full_text=test) - - self.assertTrue("PHONE" not in result, - "{} {} but it shouldn't{}".format( - self.shortDescription(), - MSG_PHONE_DETECTED, - result)) - - def test_complete_phone_number_v2(self): - """ Detection of phone number """ - - test = "tlf.: ESP 980000000" - - proximity_dict = {"PHONE": { - "left_span_len": 20, - "right_span_len": 0, - "word_list": ["tlf."]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - - result = ner.regex_detection(test, full_text=test) - - self.assertTrue("PHONE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PHONE_DETECTED, - result)) - - def test_complete_mobile_v0(self): - """ Detection of phone number """ - - test = "móvil: ESP 780000000" - - proximity_dict = {"MOBILE": { - "left_span_len": 20, - "right_span_len": 0, - "word_list": ["movil"]}} - - ner = ner_regex.RegexNer(regexp_config_dict=proximity_dict) - - result = ner.regex_detection(test, full_text=test) - - self.assertTrue("MOBILE" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MOBILE_NOT_DETECTED, - result)) - - self.assertEqual(clean_text(result["MOBILE"][0][0]), "780000000", - MSG_DETECTED.format( - self.shortDescription(), - MSG_MOBILE_WRONG_DETECTED, - result["MOBILE"][0])) - - def test_generic_money_v0(self): - """ Detection of money quantities """ - - test = "el total de la factura es 1,000." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("PROB_CURRENCY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search for the rule CP_MONEY_V1 - for i in range(len(result["PROB_CURRENCY"])): - if result["PROB_CURRENCY"][i][1] == "STRICT_REG_MONEY_V1": - idx = i - - self.assertEqual(result["PROB_CURRENCY"][idx][0], MONEY_THOUSANDS, - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["PROB_CURRENCY"])) - - def test_generic_money_v1(self): - """ Detection of money quantities """ - - test = "el total de la factura es 1.000,000." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("PROB_CURRENCY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search for the rule CP_MONEY_V0 - for i in range(len(result["PROB_CURRENCY"])): - if result["PROB_CURRENCY"][i][1] == "STRICT_REG_MONEY_V0": - idx = i - - self.assertEqual(result["PROB_CURRENCY"][idx][0], "1.000,000", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["PROB_CURRENCY"])) - - def test_generic_money_v2(self): - """ Detection of money quantities """ - - test = "el total de la factura es 1.000,000,52." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("PROB_CURRENCY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PROB_CURRENCY_NOT_DETECTED, - result)) - - # search for the rule CP_MONEY_V0 - for i in range(len(result["PROB_CURRENCY"])): - if result["PROB_CURRENCY"][i][1] == "STRICT_REG_MONEY_V0": - idx = i - - self.assertEqual(result["PROB_CURRENCY"][idx][0], "1.000,000,52", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["PROB_CURRENCY"])) - - def test_generic_money_v3(self): - """ Detection of money quantities """ - - test = "el total de la factura es C42.333.333" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("PROB_CURRENCY" not in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PROB_CURRENCY_NOT_DETECTED, - result)) - - def test_generic_money_v4(self): - """ Detection of money quantities """ - - test = "el total de la factura es 1.000,000,52." - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("PROB_CURRENCY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_PROB_CURRENCY_NOT_DETECTED, - result)) - - idx = -1 - # search for the rule CP_MONEY_V0 - for i in range(len(result["PROB_CURRENCY"])): - if result["PROB_CURRENCY"][i][1] == "CP_MONEY_V1": - idx = i - - self.assertEqual(idx, -1, - MSG_DETECTED.format( - self.shortDescription(), - MSG_QUANTITY_DETECTED, - result["PROB_CURRENCY"])) - - def test_strict_money_euro_v0(self): - """ Detection of euro currency """ - - test = "este aparato cuesta 1,000€" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("MONEY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["MONEY"])): - if result["MONEY"][i][1] == "STRICT_REG_EURO_V0": - idx = i - - self.assertEqual(result["MONEY"][idx][0], MONEY_THOUSANDS, - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["MONEY"])) - - def test_strict_money_euros_v0(self): - """ Detection of euro currency using the word euro """ - - test = "este aparato cuesta 1,000 euros" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("MONEY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["MONEY"])): - if result["MONEY"][i][1] == "STRICT_REG_EURO_V0": - idx = i - - self.assertEqual(result["MONEY"][idx][0], MONEY_THOUSANDS, - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["MONEY"])) - - def test_strict_money_euros_v1(self): - """ Detection of euro currency using the word euro """ - - test = "este aparato cuesta 1000 euros" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("MONEY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["MONEY"])): - if result["MONEY"][i][1] == "STRICT_REG_EURO_V0": - idx = i - - self.assertEqual(result["MONEY"][idx][0], "1000", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["MONEY"])) - - def test_strict_money_euros_v2(self): - """ Detection of euro currency using the word euro """ - - test = "este aparato cuesta 1,000.00 euros" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("MONEY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["MONEY"])): - if result["MONEY"][i][1] == "STRICT_REG_EURO_V0": - idx = i - - self.assertEqual(result["MONEY"][idx][0], "1,000.00", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["MONEY"])) - - def test_money_CP_EURO_V0_euros_v3(self): - """ Detection of euro currency using the word euro """ - - test = "este aparato cuesta 1,000.00 Euros" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("MONEY" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_MONEY_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["MONEY"])): - if result["MONEY"][i][1] == "STRICT_REG_EURO_V0": - idx = i - - self.assertEqual(result["MONEY"][idx][0], "1,000.00", - MSG_DETECTED.format( - self.shortDescription(), - MSG_CURRENCY_WRONG_DETECTED, - result["MONEY"])) - - def test_strict_cif_company(self): - """ Test the detection of the CIF of the company """ - - test = "El CIF de la compañía es A99151276" - - ner = ner_regex.RegexNer() - result = ner._detect_regexp(test, "strict") - - self.assertTrue("ID_DOCUMENT" in result, - "{} {} {}".format( - self.shortDescription(), - MSG_ID_DOCUMENT_NOT_DETECTED, - result)) - - # search where euro is rules matches the sentence - for i in range(len(result["ID_DOCUMENT"])): - if result["ID_DOCUMENT"][i][1] == "STRICT_REG_CIF_V0": - idx = i - - self.assertEqual(result["ID_DOCUMENT"][idx][0], "A99151276", - MSG_DETECTED.format( - self.shortDescription(), - MSG_NIF_WRONG_DETECTED, - result["ID_DOCUMENT"])) - - def test_email_hack_regex(self): - """ Test the detection of mail hacks """ - - test = "Enviar todos vuestros datos a infoAThacktextDOTcom" - - CP_EMAIL_HACK_V0 = (r"[a-zA-Z0-9_.+-]+\s?(\(|-)?\s?(AT|at)\s?(\)|-)?" + - "\s?[a-zA-Z0-9-]+\s?(\(|-)?\s?(DOT|dot)\s" + - "?(\)|-)?\s?[a-zA-Z0-9-.]+") - - HACK_REGEX = {"Email_Hack": [(CP_EMAIL_HACK_V0, "CP_EMAIL_HACK_V0")]} - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_NOT_DETECTED, - result, - test)) - - test = "Enviar todos vuestros datos a info AT hacktext DOT com" - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_NOT_DETECTED, - result, - test)) - - test = "Enviar todos vuestros datos a info (AT) hacktext (DOT) com" - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_NOT_DETECTED, - result, - test)) - - test = "Enviar todos vuestros datos a info-AT-hacktext-DOT-com" - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_NOT_DETECTED, - result, - test)) - - test = "Enviar todos vuestros datos a info-at-hacktext-dot-com" - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_NOT_DETECTED, - result, - test)) - - test = "Enviar todos vuestros datos a at-dot" - - ner = ner_regex.RegexNer(strict_regexp_dict=HACK_REGEX) - result = ner._detect_regexp(test, "strict") - - self.assertTrue("Email_Hack" not in result, - MSG_TEXT.format( - self.shortDescription(), - MSG_EMAIL_HACK_DETECTED, - result, - test)) - - -if __name__ == "__main__": - unittest.main() diff --git a/faro_detection.py b/faro_detection.py index d74ffbe..dd20828 100755 --- a/faro_detection.py +++ b/faro_detection.py @@ -1,25 +1,15 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -import os import argparse -import logging -from faro.faro_entrypoint import faro_execute - +from pathlib import Path -def run(params): - log_level = os.getenv('FARO_LOG_LEVEL', "INFO") - log_file = os.getenv('FARO_LOG_FILE', None) - handlers = [logging.StreamHandler()] - if log_file is not None: - handlers.append(logging.FileHandler(log_file)) - logging.basicConfig( - level=log_level, - format="%(levelname)s: %(name)20s: %(message)s", - handlers=handlers - ) - faro_execute(params) +from conf import config +from faro.faro_entrypoint import faro_execute +from logger import logger +script_name = Path(__file__).name +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) if __name__ == "__main__": parser = argparse.ArgumentParser() @@ -58,4 +48,4 @@ def run(params): if params.output_score_file is None: params.output_score_file = "{}{}".format(params.input_file, ".score") - run(params) + faro_execute(params) diff --git a/faro_spider.sh b/faro_spider.sh index 6b846f7..aaa6c79 100755 --- a/faro_spider.sh +++ b/faro_spider.sh @@ -11,7 +11,7 @@ fi CPU_PARALLEL_USAGE="50%" -echo "filepath,score,monetary_quantity,signature,personal_email,mobile_phone_number,financial_data,document_id,custom_words,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr" > output/scan.$SUFFIX.csv +echo "filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr" > output/scan.$SUFFIX.csv # Run faro over a recursive list of appropriate filetypes find "$INPUT_PATH" -type f \( ! -regex '.*/\.[^.].*' \) | parallel -P $CPU_PARALLEL_USAGE python faro_detection.py -i {} --output_entity_file output/scan.$SUFFIX.entity --dump >> output/scan.$SUFFIX.csv diff --git a/logger/__init__.py b/logger/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/logger/logger.py b/logger/logger.py new file mode 100755 index 0000000..81fcf42 --- /dev/null +++ b/logger/logger.py @@ -0,0 +1,36 @@ +import logging +import logging.handlers +from pathlib import Path + + +class Logger: + + def __init__(self, logger_name, file_name, logging_level=logging.DEBUG, max_bytes=1000000, backup_count=3): + self.separator = " -- " + formatter = logging.Formatter('%(asctime)s -- %(message)s') + self.my_logger = logging.getLogger(logger_name) + self.my_logger.setLevel(logging_level) + file_log = Path(__file__).parent.parent / "logs" / file_name + handler = logging.handlers.RotatingFileHandler(file_log, maxBytes=max_bytes, backupCount=backup_count) + handler.setFormatter(formatter) + self.my_logger.addHandler(handler) + + def debug(self, file_name, method_name, message): + error_msg = "DEBUG" + self.separator + file_name + self.separator + method_name + self.separator + message + self.my_logger.debug(error_msg) + + def info(self, file_name, method_name, message): + error_msg = "INFO" + self.separator + file_name + self.separator + method_name + self.separator + message + self.my_logger.info(error_msg) + + def error(self, file_name, method_name, message): + error_msg = "ERROR" + self.separator + file_name + self.separator + method_name + self.separator + message + self.my_logger.error(error_msg) + + def warning(self, file_name, method_name, message): + error_msg = "WARNING" + self.separator + file_name + self.separator + method_name + self.separator + message + self.my_logger.warning(error_msg) + + def critical(self, file_name, method_name, message): + error_msg = "CRITICAL" + self.separator + file_name + self.separator + method_name + self.separator + message + self.my_logger.critical(error_msg) diff --git a/plugins/__init__.py b/plugins/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/address_bitcoin/__init__.py b/plugins/address_bitcoin/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/address_bitcoin/entrypoint.py b/plugins/address_bitcoin/entrypoint.py new file mode 100755 index 0000000..dd182bc --- /dev/null +++ b/plugins/address_bitcoin/entrypoint.py @@ -0,0 +1,56 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +from utils.pattern.entrypoint_pattern_base import PluginPatternEntrypointBase + +_CWD = os.path.dirname(__file__) + +MANIFEST = { + "name": "Address Bitcoin", + "key": "FINANCIAL_DATA", + "version": "1", + "description": "Address Bitcoin Detection", + "author": "Enrique and Hugo", + "email": "enrique@telefonica.com", + "is_lang_dependent": False, + "info": "https://en.bitcoin.it/wiki/Invoice_address" +} + + +class PluginEntrypoint(PluginPatternEntrypointBase): + """ + Address Bitcoin Plugin entrypoint class. + + """ + + def __init__(self, text, lang='uk'): + """ + Initialize pattern plugin base plus additional parameters. + + :param text: Incoming text + :param lang: Detected language from FARO Core + """ + super().__init__(_CWD, MANIFEST["is_lang_dependent"], MANIFEST["key"], lang, text) + + def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None, + validate_dict=None): + """ + Default output generation method. It can be overriden. + + :param unconsolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() not validated by context. + :param consolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() validated by context. + :param strict_ent_dict: Detected entities by strict regex expression from pattern.strict_regexp() method. + :param validate_dict: Validated detected entities regex expression both from lax and strict from pattern.validate() method. + + :return: Output dictionary with detected entities to be returned. + """ + return super().output(validate_dict=validate_dict) + + def run(self): + """ + Public plugin interface. It can be overriden. + + :return: Output dictionary with detected entities. (Default output is generated by output method). + """ + return super().run() diff --git a/plugins/address_bitcoin/pattern.py b/plugins/address_bitcoin/pattern.py new file mode 100755 index 0000000..35e5b23 --- /dev/null +++ b/plugins/address_bitcoin/pattern.py @@ -0,0 +1,67 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os + +from stdnum.bitcoin import is_valid + +from utils.pattern.pattern_base import PluginPatternBase + +_CWD = os.path.dirname(__file__) + + +class PluginPattern(PluginPatternBase): + """ + Main plugin pattern class. + + """ + + def strict_regexp(self): + """ + Strict regexp method. + + :return: Dictionary with n strict regexp expressions such as + + { + "NAME_V0": r"strict_regex expresion_0", + "NAME_V1": r"strict_regex_expresion_1", + ... + "NAME_Vn": r"strict_regex_expresion_n" + } + + """ + return { + "STRICT_REG_BITCOIN_P2PKH_P2SH_V0": r"[13][a-km-zA-HJ-NP-Z0-9]{26,33}", + "STRICT_REG_BITCOIN_BECH32_V0": r"(bc1)[a-zA-HJ-NP-Z0-9]{25,39}" + + } + + def lax_regexp(self): + """ + Lax regexp method. + + :return: Dictionary with n lax regexp expressions such as + + { + "NAME_V0": r"lax_regex expresion_0", + "NAME_V1": r"lax_regex_expresion_1", + ... + "NAME_Vn": r"lax_regex_expresion_n" + } + + """ + pass + + def validate(self, ent): + """ + Validate detected entities. + + :param ent: Input entity + :return: Return whether (True) or not (False) entity is being validated. + """ + return is_valid(ent) + + def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None): + lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp + strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp + super().__init__(cwd, lax_regexp, strict_regexp) diff --git a/plugins/address_bitcoin/test/__init__.py b/plugins/address_bitcoin/test/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/address_bitcoin/test/data/document.txt b/plugins/address_bitcoin/test/data/document.txt new file mode 100755 index 0000000..9bad7dc --- /dev/null +++ b/plugins/address_bitcoin/test/data/document.txt @@ -0,0 +1,27 @@ + ok "bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq" + +n itself, who seeks after it and wants to have it, simply because it is pain..." +What is Lorem Ipsum? + +Lorem Ipsum is simply dummy text bad valid 3J98t1WpEZ33CNmQviecrnyiWrnqRhWNLy of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. +Why do we use it? + +It is a long established fact that a reader will be distracted by the readable content of a bad valid bc1qar0srrr7xfkvy5l343lydnw9re59gtzzwf5mdq page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like). + +Where does it come from? + +Contrary to popular belief, Lorem Ipsum ok 1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2 is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. + +The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. +Where can I get some? + +There bad regex bc2qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq are many variations of passages of Lorem Ipsum ok 3J98t1WpEZ73CNmQviecrny +iWrnqR +hWNLy available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc. + + +bad valid 1BvBMSEYstWetqTFn0Au4m4GFg7xJaNVN2 + +bad regex 0J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy + + diff --git a/plugins/address_bitcoin/test/test_address_bitcoin.py b/plugins/address_bitcoin/test/test_address_bitcoin.py new file mode 100755 index 0000000..3faaeb0 --- /dev/null +++ b/plugins/address_bitcoin/test/test_address_bitcoin.py @@ -0,0 +1,39 @@ +import unittest +from pathlib import Path + +from plugins.address_bitcoin.entrypoint import PluginEntrypoint, MANIFEST + +CWD = Path(__file__).parent +INPUT_PATH = CWD / "data" +FILE_NAME = "document.txt" +GROUND_TRUTH_RESULT = ["1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2", "3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy", + "bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq"] + + +def load_file(file_path): + with open(INPUT_PATH / file_path, "r", encoding='utf8') as f_stream: + return [f_stream.read().replace('\n', '')] + + +class AddressBitcoinTest(unittest.TestCase): + + def setUp(self): + """ Setting up for the test """ + pass + + def tearDown(self): + """ Cleaning up after the test """ + pass + + def test_for_address_bitcoin(self): + text = load_file(FILE_NAME) + address_bitcoin_plugin = PluginEntrypoint(text=text) + plugin_data = address_bitcoin_plugin.run() + results = list(plugin_data[MANIFEST['key']]) + self.assertTrue(len(results) == len(GROUND_TRUTH_RESULT)) + diff_list = (set(results) ^ set(GROUND_TRUTH_RESULT)) + self.assertTrue(len(diff_list) == 0) + + +if __name__ == '__main__': + unittest.main() diff --git a/plugins/corporate_email/__init__.py b/plugins/corporate_email/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/corporate_email/entrypoint.py b/plugins/corporate_email/entrypoint.py new file mode 100755 index 0000000..c28201c --- /dev/null +++ b/plugins/corporate_email/entrypoint.py @@ -0,0 +1,62 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +from pathlib import Path + +from utils.base_detector import get_unique_ents +from utils.email import EmailFilter +from utils.pattern.entrypoint_pattern_base import PluginPatternEntrypointBase +from utils.pattern.pattern_detector import PatternDetector + +_CWD = Path(__file__).parent +CONFIG_PATH = _CWD / "excl_corp_email_es.txt" + +MANIFEST = { + "name": "Corporate Email", + "key": "CORP_EMAIL", + "version": "0.1", + "type": "Corporate Email", + "description": "Corporate Email", + "author": "Hugo", + "email": "", +} + + +def config_email(): + # Email filter known corporative (non sensitive) email accounts + with open('%s' % CONFIG_PATH, "r", encoding='utf8') as f_in: + excl_corp_list = [line.rstrip("\n") for line in f_in] + return excl_corp_list + + +class PluginEntrypoint(PluginPatternEntrypointBase): + + def __init__(self, text, lang): + self.is_lang_dependent = False + super().__init__(_CWD, self.is_lang_dependent, MANIFEST["key"], lang, text) + excl_corp_list = config_email() + self.corp_email_class = EmailFilter(excl_corp_list) + self.emails_entities = list() + + def filter_emails(self, total_ent_strict_list): + for ent in total_ent_strict_list: + if self.corp_email_class.is_corp_email(ent[0]): + self.emails_entities.append(ent) + + def run_pattern_detector(self): + pattern_detector = PatternDetector(self.text, self.lang, self.get_pattern()) + + total_ent_strict_list, [], [], [] = pattern_detector.get_kpis( + self.text) + + self.filter_emails(total_ent_strict_list) + + return get_unique_ents(self.emails_entities) + + def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None, + validate_dict=None): + return super().output(strict_ent_dict=strict_ent_dict) + + def run(self): + unique_strict_ent_dict = self.run_pattern_detector() + return self.output(strict_ent_dict=unique_strict_ent_dict) diff --git a/config/excl_corp_email_es.txt b/plugins/corporate_email/excl_corp_email_es.txt similarity index 100% rename from config/excl_corp_email_es.txt rename to plugins/corporate_email/excl_corp_email_es.txt diff --git a/plugins/corporate_email/pattern.py b/plugins/corporate_email/pattern.py new file mode 100755 index 0000000..129f528 --- /dev/null +++ b/plugins/corporate_email/pattern.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +""" +**************************************************************** +**************************************************************** +******** ******** +******** La expresión regular incluye €/EUR/euros ******** +******** ******** +**************************************************************** +**************************************************************** +""" + +import os +from utils.pattern.pattern_base import PluginPatternBase + +_CWD = os.path.dirname(__file__) + + +class PluginPattern(PluginPatternBase): + + def strict_regexp(self): + return { + "STRICT_REG_EMAIL_ADDRESS_V0": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+" + } + + def lax_regexp(self): + pass + + def validate(self, ent): + pass + + def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None): + lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp + strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp + super().__init__(cwd, lax_regexp, strict_regexp) diff --git a/plugins/credit_card/__init__.py b/plugins/credit_card/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/credit_card/context-left.txt b/plugins/credit_card/context-left.txt new file mode 100755 index 0000000..e69de29 diff --git a/plugins/credit_card/context-right.txt b/plugins/credit_card/context-right.txt new file mode 100755 index 0000000..e69de29 diff --git a/config/keywords_creditcard_es.txt b/plugins/credit_card/context.txt similarity index 51% rename from config/keywords_creditcard_es.txt rename to plugins/credit_card/context.txt index 5a29909..8bc21a2 100755 --- a/config/keywords_creditcard_es.txt +++ b/plugins/credit_card/context.txt @@ -1,7 +1,4 @@ -tarjeta -credito -debito -visa -mastercard -cvc -cvv +visa +mastercard +cvc +cvv diff --git a/plugins/credit_card/context.yaml b/plugins/credit_card/context.yaml new file mode 100755 index 0000000..6e5db40 --- /dev/null +++ b/plugins/credit_card/context.yaml @@ -0,0 +1,5 @@ +regexp_config: + Context: + word_file: context.txt + left_span_len: 20 + right_span_len: 0 \ No newline at end of file diff --git a/plugins/credit_card/entrypoint.py b/plugins/credit_card/entrypoint.py new file mode 100755 index 0000000..32b0a58 --- /dev/null +++ b/plugins/credit_card/entrypoint.py @@ -0,0 +1,31 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +from utils.pattern.entrypoint_pattern_base import PluginPatternEntrypointBase + +_CWD = os.path.dirname(__file__) + +MANIFEST = { + "name": "Credit card", + "key": "FINANCIAL_DATA", + "version": "0.1", + "type": "Financial", + "description": "Credit card", + "author": "Enrique", + "email": "enrique@telefonica.com", +} + + +class PluginEntrypoint(PluginPatternEntrypointBase): + + def __init__(self, text, lang): + self.is_lang_dependent = True + super().__init__(_CWD, self.is_lang_dependent, MANIFEST["key"], lang, text) + + def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None, + validate_dict=None): + return super().output(validate_dict=validate_dict) + + def run(self): + return super().run() diff --git a/plugins/credit_card/es/__init__.py b/plugins/credit_card/es/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/plugins/credit_card/es/context-left.txt b/plugins/credit_card/es/context-left.txt new file mode 100755 index 0000000..e69de29 diff --git a/plugins/credit_card/es/context-right.txt b/plugins/credit_card/es/context-right.txt new file mode 100755 index 0000000..e69de29 diff --git a/plugins/credit_card/es/context.txt b/plugins/credit_card/es/context.txt new file mode 100755 index 0000000..8d07409 --- /dev/null +++ b/plugins/credit_card/es/context.txt @@ -0,0 +1,16 @@ +crédito +credito +credit +débito +debito +tarj. +tarj +tarjeta +tarjeta de débito +tarjeta de debito +tarjeta de crédito +tarjeta de credito +visa +mastercard +cvc +cvv diff --git a/plugins/credit_card/es/context.yaml b/plugins/credit_card/es/context.yaml new file mode 100755 index 0000000..9f7b854 --- /dev/null +++ b/plugins/credit_card/es/context.yaml @@ -0,0 +1,11 @@ +regexp_config: + Context: + word_file: context.txt + left_span_len: 20 + right_span_len: 0 + Context-left: + word_file: context.txt + span_len: 20 + Context-right: + word_file: context.txt + span_len: 0 \ No newline at end of file diff --git a/plugins/credit_card/es/pattern.py b/plugins/credit_card/es/pattern.py new file mode 100755 index 0000000..62e9169 --- /dev/null +++ b/plugins/credit_card/es/pattern.py @@ -0,0 +1,13 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +from ..pattern import PluginPattern as ParentPluginPattern + +_CWD = os.path.dirname(__file__) + + +class PluginPattern(ParentPluginPattern): + + def __init__(self): + super().__init__(cwd=_CWD, lax_regexp=self.lax_regexp(), strict_regexp=self.strict_regexp()) diff --git a/plugins/credit_card/pattern.py b/plugins/credit_card/pattern.py new file mode 100755 index 0000000..2f6f47f --- /dev/null +++ b/plugins/credit_card/pattern.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os + +from stdnum.luhn import is_valid + +from utils.pattern.pattern_base import PluginPatternBase + +# Own modules for plugin + +_CWD = os.path.dirname(__file__) + + +class PluginPattern(PluginPatternBase): + + def strict_regexp(self): + return { + "STRICT_REG_CREDIT_CARD_V0": r"(?:(?P((?((?((? person_signed_idx and int( + _ent[2]) - person_signed_idx < min_itx_signed: + min_itx_signed = (int(_ent[2]) - person_signed_idx) + id_min_itx = i + + if id_min_itx != -1: + _ent = self.feature_ent_list[id_min_itx] + self.signatures.append((_ent[0], MANIFEST["key"], _ent[2], _ent[3])) + + def update_signatures(self, total_ent_strict_list): + for signature in total_ent_strict_list: + self._entity_signed_person(signature[3]) + + def run_feature_detector(self): + feature_detector = FeatureDetector(self.text, self.lang, self) + self.feature_ent_list = feature_detector.get_kpis(self.text) + + def run_pattern_detector(self): + pattern_detector = PatternDetector(self.text, self.lang, self.get_pattern()) + total_ent_strict_list, [], [], [] = pattern_detector.get_kpis( + self.text) + self.update_signatures(total_ent_strict_list) + return get_unique_ents(self.signatures) + + def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None, + validate_dict=None): + return super().output(strict_ent_dict=strict_ent_dict) + + def run(self): + self.run_feature_detector() + unique_strict_ent_dict = self.run_pattern_detector() + return self.output(strict_ent_dict=unique_strict_ent_dict) diff --git a/plugins/signature/pattern.py b/plugins/signature/pattern.py new file mode 100755 index 0000000..fb73ccc --- /dev/null +++ b/plugins/signature/pattern.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +""" +**************************************************************** +**************************************************************** +******** ******** +******** La expresión regular incluye Signature ******** +******** ******** +**************************************************************** +**************************************************************** +""" + +import os +from utils.pattern.pattern_base import PluginPatternBase + +_CWD = os.path.dirname(__file__) + + +class PluginPattern(PluginPatternBase): + + def strict_regexp(self): + return { + "STRICT_REG_FIRMA_V0": r"Firmado por|Firmado|Fdo\.|Signed by|Firma\s|firma del representante" + } + + def lax_regexp(self): + pass + + def validate(self, ent): + pass + + def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None): + lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp + strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp + super().__init__(cwd, lax_regexp, strict_regexp) diff --git a/requirements.txt b/requirements.txt index 826983d..4d39b5c 100755 --- a/requirements.txt +++ b/requirements.txt @@ -6,7 +6,7 @@ PyYAML==5.1.1 python-stdnum==1.11 langdetect==1.0.7 fuzzywuzzy==0.17.0 -python-Levenshtein==0.12.0 +python-Levenshtein==0.12.2 spacy==2.3.0 https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.3.0/es_core_news_sm-2.3.0.tar.gz https://github.com/explosion/spacy-models/releases/download/xx_ent_wiki_sm-2.3.0/xx_ent_wiki_sm-2.3.0.tar.gz diff --git a/test/__init__.py b/test/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/faro/test/data/lorem.rtf b/test/data/lorem.rtf similarity index 100% rename from faro/test/data/lorem.rtf rename to test/data/lorem.rtf diff --git a/faro/test/data/no_metadata.pdf b/test/data/no_metadata.pdf similarity index 100% rename from faro/test/data/no_metadata.pdf rename to test/data/no_metadata.pdf diff --git a/faro/test/data/ocr.pdf b/test/data/ocr.pdf similarity index 100% rename from faro/test/data/ocr.pdf rename to test/data/ocr.pdf diff --git a/test/data/organizations.txt b/test/data/organizations.txt new file mode 100755 index 0000000..6cc9e1c --- /dev/null +++ b/test/data/organizations.txt @@ -0,0 +1,6 @@ +Hayek, F.A. The Collected Works of F.A. Hayek. Vol. 1, Fatal +Conceit: The Errors of Socialism. Edited by W.W. Bart- +ley. Chicago: University of Chicago Press, 1989. +Hegel, Georg W.F + +Apple Inc. diff --git a/faro/test/data/person_position.pdf b/test/data/person_position.pdf similarity index 100% rename from faro/test/data/person_position.pdf rename to test/data/person_position.pdf diff --git a/faro/test/data/protected.pdf b/test/data/protected.pdf similarity index 100% rename from faro/test/data/protected.pdf rename to test/data/protected.pdf diff --git a/test/data/sensitive_data.docx b/test/data/sensitive_data.docx new file mode 100755 index 0000000..2b91fe5 Binary files /dev/null and b/test/data/sensitive_data.docx differ diff --git a/faro/test/data/sensitive_data.pdf b/test/data/sensitive_data.pdf similarity index 80% rename from faro/test/data/sensitive_data.pdf rename to test/data/sensitive_data.pdf index 6a9caf9..0230b42 100755 Binary files a/faro/test/data/sensitive_data.pdf and b/test/data/sensitive_data.pdf differ diff --git a/faro/test/data/signature_boe.pdf b/test/data/signature_boe.pdf similarity index 100% rename from faro/test/data/signature_boe.pdf rename to test/data/signature_boe.pdf diff --git a/faro/test/data/split_lines.docx b/test/data/split_lines.docx similarity index 100% rename from faro/test/data/split_lines.docx rename to test/data/split_lines.docx diff --git a/faro/test/data/tests.txt b/test/data/tests.txt similarity index 100% rename from faro/test/data/tests.txt rename to test/data/tests.txt diff --git a/test_faro_cli.py b/test/test_faro_cli.py similarity index 71% rename from test_faro_cli.py rename to test/test_faro_cli.py index a8865dd..144c4c0 100755 --- a/test_faro_cli.py +++ b/test/test_faro_cli.py @@ -3,14 +3,16 @@ import os import unittest import subprocess +from pathlib import Path -CWD = os.path.dirname(__file__) -INPUT_PATH = os.path.join(CWD, 'faro/test/data') +CWD = Path(__file__).parent +INPUT_PATH = CWD / "data" INPUT_FILE = 'sensitive_data.pdf' INPUT_SCORE_FILE = '%s.score' % INPUT_FILE INPUT_ENTITY_FILE = '%s.entity' % INPUT_FILE +FARO_DETECTION_PATH = CWD.parent / 'faro_detection.py' -DUMP_DATA = ["sensitive_data.pdf", "high", "2,0,2,3,4,4,0", "application/pdf", "ENRIQUE ANDRADE GONZALEZ"] +DUMP_DATA = ["sensitive_data.pdf", "high", "2,0,2,3,5,4,0", "application/pdf", "ENRIQUE ANDRADE GONZALEZ"] class FaroCommandLineTest(unittest.TestCase): @@ -29,19 +31,19 @@ def tearDown(self): def test_faro_detection_file(self): input_file = '%s/%s' % (INPUT_PATH, INPUT_FILE) - subprocess.run(['./faro_detection.py', '-i', input_file], stdout=subprocess.PIPE, stderr=subprocess.PIPE) + subprocess.run([FARO_DETECTION_PATH, '-i', input_file], stdout=subprocess.PIPE, stderr=subprocess.PIPE) self.assertTrue(os.path.isfile('%s/%s' % (INPUT_PATH, INPUT_ENTITY_FILE))) def test_faro_detection_dump(self): input_file = '%s/%s' % (INPUT_PATH, INPUT_FILE) - result = subprocess.Popen(['./faro_detection.py', '-i', input_file, "--dump"], stdout=subprocess.PIPE, + result = subprocess.Popen([FARO_DETECTION_PATH, '-i', input_file, "--dump"], stdout=subprocess.PIPE, stderr=subprocess.PIPE) out, err = result.communicate() out = out.decode('utf-8') for chain in DUMP_DATA: position = out.find(chain) - self.assertTrue(position!=-1) + self.assertTrue(position != -1) if __name__ == "__main__": diff --git a/faro/test/test_faro_entrypoint.py b/test/test_faro_entrypoint.py similarity index 86% rename from faro/test/test_faro_entrypoint.py rename to test/test_faro_entrypoint.py index 4a49855..3af3bb7 100755 --- a/faro/test/test_faro_entrypoint.py +++ b/test/test_faro_entrypoint.py @@ -5,6 +5,8 @@ import argparse import os from os import path +from pathlib import Path + from faro.faro_entrypoint import faro_execute INPUT_FILE = 'sensitive_data.pdf' @@ -15,9 +17,11 @@ INPUT_FILE_SIGNATURE = 'signature_boe.pdf' INPUT_FILE_TESTS_TXT = 'tests.txt' INPUT_FILE_NO_SENSITIVE = 'lorem.rtf' +INPUT_FILE_ORG = 'organizations.txt' + +CWD = Path(__file__).parent +INPUT_PATH = CWD / "data" -CWD = os.path.dirname(__file__) -INPUT_PATH = os.path.join(CWD, 'data') SCORE_EXT = 'score' ENTITY_EXT = 'entity' @@ -26,8 +30,11 @@ EMAILS = ["soia@telefonica.es", "test@csic.es"] MOBILE_PHONES = ["654456654", "666444222", "651.651.651"] DOCUMENT_ID = ["C59933143", "E-38008785", "36663760-N", "96222560J"] -FINANCIAL_DATA = ["ES6621000418401234567891", "5390700823285988", "4916697015082", "4929432390175839"] +FINANCIAL_DATA = ['ES6621000418401234567891', '4916697015082', '5390700823285988', '4929432390175839', '5105 1051 0 5 10 5100 '] +FINANCIAL_DATA_OCR = ['ES6621000418401234567891', '4916697015082', '5390700823285988', '4929432390175839'] + CUSTOM_WORDS = ["confidencial", "contraseña"] +ORGANIZATIONS = ["University of Chicago Press", "Apple Inc"] LANGUAGE_METADA = "meta:lang" @@ -79,7 +86,7 @@ def tearDown(self): pass def test_document_id_detection(self): - faro_document_ids_entity = list(self.FARO_ENTITY_TEST_1['document_id'].keys()) + faro_document_ids_entity = list(self.FARO_ENTITY_TEST_1['id_document'].keys()) self.assertTrue(len(faro_document_ids_entity) == len(DOCUMENT_ID)) diff_list = (set(faro_document_ids_entity) ^ set(DOCUMENT_ID)) self.assertTrue(len(diff_list) == 0) @@ -91,7 +98,7 @@ def test_document_financial_data_detection(self): self.assertTrue(len(diff_list) == 0) def test_mobile_phone_detection(self): - faro_mobile_phone_number_entity = list(self.FARO_ENTITY_TEST_1['mobile_phone_number'].keys()) + faro_mobile_phone_number_entity = list(self.FARO_ENTITY_TEST_1['mobile'].keys()) for i in range(len(faro_mobile_phone_number_entity)): faro_mobile_phone_number_entity[i] = faro_mobile_phone_number_entity[i].replace(" ", "") @@ -106,7 +113,7 @@ def test_email_detection(self): self.assertTrue(len(diff_list) == 0) def test_monetary_quantity_detection(self): - faro_monetary_quantity_entity = list(self.FARO_ENTITY_TEST_1['monetary_quantity'].keys()) + faro_monetary_quantity_entity = list(self.FARO_ENTITY_TEST_1['money'].keys()) self.assertTrue(len(faro_monetary_quantity_entity) == len(MONETARY_QUANTITY)) diff_list = (set(faro_monetary_quantity_entity) ^ set(MONETARY_QUANTITY)) self.assertTrue(len(diff_list) == 0) @@ -120,6 +127,23 @@ def test_language(self): faro_language_score = _faro_run(INPUT_PATH, INPUT_FILE, SCORE_EXT) self.assertTrue(faro_language_score[LANGUAGE_METADA] == "es") + def test_organizations(self): + entity_file_name = 'test_verbose_entity_org' + + params = argparse.Namespace() + params.input_file = '%s/%s' % (INPUT_PATH, INPUT_FILE_ORG) + params.output_entity_file = '%s/%s.%s' % (INPUT_PATH, entity_file_name, ENTITY_EXT) + params.verbose = True + faro_execute(params) + + faro_verbose = _get_file_data(params.output_entity_file) + faro_verbose_entity = faro_verbose['entities']['organization'] + + self.assertTrue(faro_verbose_entity is not None) + self.assertTrue(len(faro_verbose_entity) == len(ORGANIZATIONS)) + diff_list = (set(faro_verbose_entity) ^ set(ORGANIZATIONS)) + self.assertTrue(len(diff_list) == 0) + def test_unsupported_language(self): faro_language_score = _faro_run(INPUT_PATH, INPUT_FILE_PROTECTED, SCORE_EXT) self.assertTrue(faro_language_score[LANGUAGE_METADA] == "unk") @@ -160,7 +184,7 @@ def test_params_verbose(self): faro_verbose_entity = faro_verbose['entities'] self.assertTrue(faro_verbose_entity['person'] is not None) - self.assertTrue(faro_verbose_entity['phone_number'] is not None) + self.assertTrue(faro_verbose_entity['phone'] is not None) self.assertTrue(faro_verbose_entity['probable_currency_amount'] is not None) def test_params_split_lines(self): @@ -172,13 +196,13 @@ def test_params_split_lines(self): faro_split_lines = _get_file_data(params.output_entity_file) faro_split_lines_entity = faro_split_lines['entities'] - self.assertTrue(faro_split_lines_entity.get('mobile_phone_number') is None) + self.assertTrue(faro_split_lines_entity.get('mobile') is None) def test_ocr(self): faro_ocr = _faro_run(INPUT_PATH, INPUT_FILE_OCR) faro_ocr_financial_data = list(faro_ocr['financial_data'].keys()) - self.assertTrue(len(faro_ocr_financial_data) == len(FINANCIAL_DATA)) - diff_list = (set(faro_ocr_financial_data) ^ set(FINANCIAL_DATA)) + self.assertTrue(len(faro_ocr_financial_data) == len(FINANCIAL_DATA_OCR)) + diff_list = (set(faro_ocr_financial_data) ^ set(FINANCIAL_DATA_OCR)) self.assertTrue(len(diff_list) == 0) def test_signature(self): @@ -189,9 +213,9 @@ def test_signature(self): self.assertTrue(faro_signature_entity[0] == SIGNATURE[0]) def test_custom_words(self): - faro_custom_score = _faro_run(INPUT_PATH, INPUT_FILE_TESTS_TXT, SCORE_EXT)['custom_words'] + faro_custom_score = _faro_run(INPUT_PATH, INPUT_FILE_TESTS_TXT, SCORE_EXT)['custom_word'] faro_custom_entity = _get_file_data(os.path.join(INPUT_PATH, INPUT_FILE_TESTS_TXT + "." + ENTITY_EXT)) - faro_entities = list(faro_custom_entity['entities']['custom_words']) + faro_entities = list(faro_custom_entity['entities']['custom_word']) self.assertTrue(faro_custom_score == 2) diff_list = (set(faro_entities) ^ set(CUSTOM_WORDS)) self.assertTrue(len(diff_list) == 0) diff --git a/faro/test/test_utils.py b/test/test_utils.py similarity index 96% rename from faro/test/test_utils.py rename to test/test_utils.py index 29b7b03..50b38e0 100755 --- a/faro/test/test_utils.py +++ b/test/test_utils.py @@ -2,7 +2,8 @@ # -*- coding: utf-8 -*- import unittest -from faro import utils +from utils import utils + class UtilsTest(unittest.TestCase): @@ -26,5 +27,6 @@ def test_normalize_text_v0(self): self.shortDescription(), norm_text)) + if __name__ == "__main__": unittest.main() diff --git a/utils/__init__.py b/utils/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/utils/base_detector.py b/utils/base_detector.py new file mode 100755 index 0000000..a77e474 --- /dev/null +++ b/utils/base_detector.py @@ -0,0 +1,33 @@ +import logging +import os +from abc import ABC, abstractmethod + +CWD = os.path.dirname(__file__) + +logger = logging.getLogger(__name__) + + +def get_unique_ents(ent_list): + """ Process the entities to obtain a json object """ + unique_ent_dict = {} + for _ent in ent_list: + if _ent[1] not in unique_ent_dict: + unique_ent_dict[_ent[1]] = {} + if _ent[0] not in unique_ent_dict[_ent[1]]: + unique_ent_dict[_ent[1]][_ent[0]] = 0 + unique_ent_dict[_ent[1]][_ent[0]] += 1 + return unique_ent_dict + + +class BaseDetector(ABC): + def __init__(self, text, lang): + self._results = {} + self.text = text + self.lang = lang + + @abstractmethod + def run(self): + pass + + def get_results(self): + return self._results diff --git a/utils/base_plugin.py b/utils/base_plugin.py new file mode 100755 index 0000000..ab4d094 --- /dev/null +++ b/utils/base_plugin.py @@ -0,0 +1,45 @@ +import importlib +import logging +import os +from abc import ABC, abstractmethod + +import yaml + +logger = logging.getLogger(__name__) + + +def get_supported_languages(path): + with os.scandir(path) as files: + supported_languages = [file.name for file in files if file.is_dir()] + return supported_languages + + +def load_module(plugin_name): + try: + plugin = importlib.import_module(plugin_name) + return plugin + except Exception as e: + logger.error(f"[load_plugin] {e}") + + +def load_config(path, config_path): + commons_yaml = os.path.join(path, config_path) + with open(commons_yaml, "r", encoding='utf8') as f_stream: + commons_config = yaml.load(f_stream, Loader=yaml.FullLoader) + return commons_config + + +def get_plugins_list(path): + with os.scandir(path) as files: + plugins = [file.name for file in files if file.is_dir() and not file.name.startswith("__")] + return plugins + + +class BasePlugin(ABC): + def __init__(self, text, lang): + self.text = text + self.lang = lang + + @abstractmethod + def run(self): + pass diff --git a/faro/email.py b/utils/email.py similarity index 86% rename from faro/email.py rename to utils/email.py index 4efa725..9ef6ca0 100755 --- a/faro/email.py +++ b/utils/email.py @@ -1,11 +1,15 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -import logging +import os + from fuzzywuzzy import fuzz from fuzzywuzzy import process +from conf import config +from logger import logger -logger = logging.getLogger(__name__) +script_name = os.path.basename(__file__) +faro_logger = logger.Logger(logger_name=script_name, file_name=config.LOG_FILE_NAME, logging_level=config.LOG_LEVEL) class EmailFilter(object): diff --git a/utils/features/__init__.py b/utils/features/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/utils/features/entrypoint_feature_base.py b/utils/features/entrypoint_feature_base.py new file mode 100755 index 0000000..d368545 --- /dev/null +++ b/utils/features/entrypoint_feature_base.py @@ -0,0 +1,38 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +from abc import abstractmethod + +from utils.base_plugin import BasePlugin, get_supported_languages + +PATTERN_DIRECTORY_NAME = "features" + + +class PluginFeatureEntrypoint(BasePlugin): + + def get_lang(self): + return self._lang + + def get_path_lang(self, path): + _path_lang = path + if self._is_lang_dependent: + if self._lang in get_supported_languages(path): + _path_lang = os.path.join(path, self._lang) + return _path_lang + + def detection(self): + pass + + @staticmethod + def output(_list): + return _list + + def __init__(self, is_lang_dependent, lang, text): + super().__init__(text, lang) + self._is_lang_dependent = is_lang_dependent + self._lang = lang + + @abstractmethod + def run(self): + pass diff --git a/utils/features/feature_detector.py b/utils/features/feature_detector.py new file mode 100755 index 0000000..6cf9080 --- /dev/null +++ b/utils/features/feature_detector.py @@ -0,0 +1,56 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +from utils.base_detector import BaseDetector, get_unique_ents + + +class FeatureDetector(BaseDetector): + """ Main class for extracting KPIs of confidential documents + """ + + def get_kpis(self, sent_list): + """ Extract KPIs from document """ + # full_text is used for proximity detection + full_text = "".join(sent_list) + total_ent_list = [] + + offset = 0 + + for sent in sent_list: + line_length = len(sent) + + result = self.custom_detector.detector(sent) + total_ent_list.extend(result) + + offset += line_length + + return total_ent_list + + def run(self): + """ Obtain KPIs from a document and obtain the output in the right format (json) + + Keyword arguments: + content -- list of sentences to obtain the entities + + """ + ent_list = self.get_kpis(self.text) + unique_list = get_unique_ents(ent_list) + return unique_list + + def __init__(self, text, lang, custom_detector): + """ Intialization + + Keyword Arguments: + config -- a dict with yaml configuration parameters + + Properties + nlp -- a spacy model or None + custom_word_list -- list with custom words + regexp_config_dict -- configuration of the proximity detections + signature_max_distance -- maximum distance between distance and signature + low_priority_list -- list of entity types with low priority + + """ + super().__init__(text, lang) + self.text = text + self.custom_detector = custom_detector diff --git a/utils/features/ner.py b/utils/features/ner.py new file mode 100755 index 0000000..cbfd929 --- /dev/null +++ b/utils/features/ner.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +import logging + +from utils.utils import preprocess_text + +logger = logging.getLogger(__name__) + + +class NER: + """ A class to extract entities using different NERs """ + + def _spacy(self, doc): + # using SpaCy + ent_list = [] + for ent in doc: + if ent[1].upper() in [self.entity]: + ent_list.append(ent) + return ent_list + + def _spacy_extra_models(self, u_text, ent_list): + ent_list = self.spacy.run_extra(u_text) + + def get_model_entities(self, sentence): + """ Get entities with a NER ML model (Spacy) + + Keyword arguments: + sentence -- a string with a sentence or paragraph + + """ + + u_text = preprocess_text(sentence) + + doc, token_list = self.spacy.run(self.lang, u_text) + + ent_list = self._spacy(doc) + + # extracting entities with crfs + # self._spacy_extra_models(u_text, ent_list) + return ent_list + + def __init__(self, spacy, lang, entity="PER"): + """ Initialization + + Keyword arguments: + nlp: spacy model + nlp_extra: additional spacy models (e.g. with custom entities) (default None) + """ + + self.spacy = spacy + self.entity = entity + self.lang = lang diff --git a/utils/features/spacy.py b/utils/features/spacy.py new file mode 100755 index 0000000..1435492 --- /dev/null +++ b/utils/features/spacy.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import logging + +import spacy + +from utils.singleton import Singleton + +logger = logging.getLogger(__name__) + +_SPACY_MODEL_NAME = "_core_news_" +_SPACY_DEFAULT_MODEL_NAME = "xx_ent_wiki_sm" +# lg, md, sm +_SPACY_MODEL_TYPE = "sm" + + +def _load_spacy_model(lang): + model_name = '%s%s%s' % (lang, _SPACY_MODEL_NAME, _SPACY_MODEL_TYPE) + try: + nlp = spacy.load(model_name) + except Exception: + nlp = spacy.load(_SPACY_DEFAULT_MODEL_NAME) + return nlp + + +class Spacy(metaclass=Singleton): + """ A class to manage spacy models to be used as a util resource """ + + def __init__(self): + self._nlp = None + self._nlp_extra = None + self.max_length = 1000000 + self.ent_list = [] + self.token_list = [] + self.ent_list_extra = [] + self.processed = False + self.last_text = "" + self.last_lang = "" + + def run(self, lang, text): + if self.last_lang != lang: + self._nlp = _load_spacy_model(lang) + self.last_lang = lang + self.processed = False + if self.last_text != text: + self.last_text = text + self.processed = False + if not self.processed: + self.processed = True + self.ent_list = [] + self.token_list = [] + chunks = [text[i:i + self.max_length] for i in range(0, len(text), self.max_length)] + for chunk in chunks: + doc = self._nlp(chunk) + [self.ent_list.append((ent.text, ent.label_.upper(), ent.start_char, ent.end_char, ent.start)) + for ent in doc.ents] + [self.token_list.append(token) for token in doc] + return self.ent_list, self.token_list + + def run_extra(self, text): + if self._nlp_extra is None: + return None + else: + for nlp_e in self._nlp_extra: + doc = nlp_e(text) + for ent in doc.ents: + self.ent_list_extra.append((ent.text, ent.label_, ent.start_char, ent.end_char, ent.start)) + return self.ent_list_extra diff --git a/utils/pattern/__init__.py b/utils/pattern/__init__.py new file mode 100755 index 0000000..e69de29 diff --git a/utils/pattern/entrypoint_pattern_base.py b/utils/pattern/entrypoint_pattern_base.py new file mode 100755 index 0000000..12dbe27 --- /dev/null +++ b/utils/pattern/entrypoint_pattern_base.py @@ -0,0 +1,63 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +import os +from abc import abstractmethod + +from utils.base_plugin import BasePlugin, get_supported_languages, load_config, load_module +from utils.pattern.pattern_detector import PatternDetector + +PATTERN_DIRECTORY_NAME = "plugins" +PATTERN_MOD_NAME = "pattern" + + +class PluginPatternEntrypointBase(BasePlugin): + + def get_pattern(self): + return self.pattern + + @abstractmethod + def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None, + validate_dict=None): + result_dict = {} + for _dict in [unconsolidated_lax_dict, consolidated_lax_dict, strict_ent_dict, validate_dict]: + if _dict is None: + continue + for _key in _dict.keys(): + result_dict.update(_dict[_key]) + if result_dict == {}: + result = {} + else: + result = {self._plugin_key: result_dict} + return result + + def _load_pattern_by_lang(self, is_lang_dependent, plugin_path): + """ + is_lang_dependent: Returns True if there are language directories. + supported_lang: List of supported languages. + lang: language of the text to be analyzed. + plugin_path: plugin directory. + """ + plugin_path = os.path.basename(plugin_path) + lang = self.lang.lower() + is_supported_lang = (lang in self.supported_languages) + if is_lang_dependent and is_supported_lang: + plugin_path = plugin_path + "." + lang + module = PATTERN_DIRECTORY_NAME + "." + plugin_path + "." + PATTERN_MOD_NAME + return load_module(module).PluginPattern() + + def __init__(self, plugin_path, is_lang_dependent, plugin_key, lang, text): + super().__init__(text, lang) + self._plugin_key = plugin_key + self.supported_languages = get_supported_languages(plugin_path) + self.pattern = self._load_pattern_by_lang(is_lang_dependent, plugin_path) + + @abstractmethod + def run(self): + detector = PatternDetector(self.text, self.lang, self.get_pattern()) + unique_strict_ent_dict, unique_consolidated_broad_dict, unique_unconsolidated_broad_dict, unique_validate_list \ + = detector.run() + return self.output(unconsolidated_lax_dict=unique_unconsolidated_broad_dict, + consolidated_lax_dict=unique_consolidated_broad_dict, + strict_ent_dict=unique_strict_ent_dict, + validate_dict=unique_validate_list) diff --git a/utils/pattern/ner_regex.py b/utils/pattern/ner_regex.py new file mode 100755 index 0000000..2e60e6c --- /dev/null +++ b/utils/pattern/ner_regex.py @@ -0,0 +1,177 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +import logging +import copy +import re + +from utils.utils import normalize_text, clean_text + +logger = logging.getLogger(__name__) + + +class RegexNer(object): + """ Detection of some number-based entities with regular expressions """ + + def _detect_regexp(self, sentence, _type): + """ Use broad/strict coverage regexp to detect possible entities + + Keyword arguments: + sentence -- string containing the sentence + _type -- type of regexp [broad or strict] + + """ + result_dict = [] + + for _regexp in self.regexp_compiler_dict[_type]: + it = _regexp[0].finditer(sentence) + + for match in it: + result_dict.append( + (sentence[match.start():match.end()], + _regexp[1], match.start(), match.end())) + + return result_dict + + @staticmethod + def _match_found(word_list, span_text, result_dict, _regexp): + is_match_found = False + for _word in word_list: + if _word in span_text: + is_match_found = True + break + if is_match_found: + result_dict.append(_regexp) + return is_match_found + + def _check_proximity_conditions(self, unconsolidated_dict, full_text, offset): + """ Check the proximity of keywords to a regexp detection + + Keyword arguments: + unconsolidated_dict -- dict with entities that were not consolidated + result_dict -- dict to store consolidated entities + full_text -- a full document + offset -- offset in the full document of the current sentence + + """ + result_dict = [] + is_left_search = "Context-left" in self.regexp_config_dict + is_right_search = "Context-right" in self.regexp_config_dict + + if is_left_search: + left_word_list = self.regexp_config_dict["Context-left"]["word_list"] + left_span_len = self.regexp_config_dict["Context-left"]["span_len"] + if is_right_search: + right_word_list = self.regexp_config_dict["Context-right"]["word_list"] + right_span_len = self.regexp_config_dict["Context-right"]["span_len"] + + for _regexp in unconsolidated_dict: + idx_reg_start = _regexp[2] + offset + idx_reg_end = _regexp[3] + offset + is_stop = False + if is_left_search: + span_start = idx_reg_start - left_span_len + span_end = idx_reg_end + # safety check: span_start cannot be lower than 0 (beginning of file) + if span_start < 0: + span_start = 0 + span_text = normalize_text(full_text[span_start:span_end]) + is_stop = self._match_found(left_word_list, span_text, result_dict, _regexp) + + if is_right_search and not is_stop: + span_start = idx_reg_start + span_end = idx_reg_end + right_span_len + span_text = normalize_text(full_text[span_start:span_end]) + self._match_found(right_word_list, span_text, result_dict, _regexp) + return result_dict + + @staticmethod + def _remove_unconsolidated_matches(consolidated_list, unconsolidated_list): + return list(set(unconsolidated_list) - set(consolidated_list)) + + def _validate_list(self, validate_list, _list): + new_list = [] + for regexp in _list: + ent = clean_text(regexp[0]) + # print("ent: " + ent) + if self._func_validate(ent): + # print("\tent:" + ent + " is valid!") + validate_list.append(regexp) + else: + # print("\tent:" + ent + " NO valid!") + new_list.append(regexp) + return validate_list, new_list + + def _validate(self, strict_list, consolidated_list): + validate_list = [] + if self._is_validate: + validate_list, new_strict_list = self._validate_list(validate_list, strict_list) + validate_list, new_broad_list = self._validate_list(validate_list, consolidated_list) + else: + new_strict_list = strict_list + new_broad_list = consolidated_list + return validate_list, new_strict_list, new_broad_list + + def regex_detection(self, sentence, full_text=None, offset=0): + """ Detect entities with a regex in sentence + + Keyword arguments: + sentence -- a sentence in plain text + + """ + # dict to store detections + unconsolidated_broad_list = [] + + result_broad_list = self._detect_regexp(sentence, "broad") + strict_list = copy.deepcopy(self._detect_regexp(sentence, "strict")) + + consolidated_list = [clean_text(regexp[0]) for regexp in strict_list] + + for _broad_regexp in result_broad_list: + if clean_text(_broad_regexp[0]) not in consolidated_list: + unconsolidated_broad_list.append(_broad_regexp) + + # check proximity conditions of broad regexp detections + # Si no se inicializa a [] se duplican resultados + consolidated_broad_list = self._check_proximity_conditions(unconsolidated_broad_list, full_text, offset) + + # Validate + validate_list, strict_list, consolidated_broad_list = self._validate(strict_list, + consolidated_broad_list) + + unconsolidated_broad_list = self._remove_unconsolidated_matches(consolidated_broad_list, + unconsolidated_broad_list) + + return strict_list, consolidated_broad_list, unconsolidated_broad_list, validate_list + + def __init__(self, broad_regexp_dict, strict_regexp_dict, func_validate, regexp_config_dict=None): + """ Initialization + + The process of the application of the regexp is the following: + First and wide coverage regexp is applied to extract as many + candidates as possible. ["broad" + + Keyword arguments: + broad_regexp_dict -- a dict containing the broad coverage regexp + strict_regexp_dict -- a dict containing stricter regexp + regexp_config_dict -- a dict containing the configuration + on the proximity conditions + + """ + + self._is_validate = func_validate is not None + self._func_validate = func_validate + + if regexp_config_dict is None: + regexp_config_dict = {} + + self.regexp_compiler_dict = {"broad": []} + + for _regexp in broad_regexp_dict: + self.regexp_compiler_dict["broad"].append((re.compile(_regexp[0]), _regexp[1])) + + self.regexp_compiler_dict["strict"] = [] + + for _regexp in strict_regexp_dict: + self.regexp_compiler_dict["strict"].append((re.compile(_regexp[0]), _regexp[1])) + + self.regexp_config_dict = regexp_config_dict diff --git a/utils/pattern/pattern_base.py b/utils/pattern/pattern_base.py new file mode 100755 index 0000000..397956f --- /dev/null +++ b/utils/pattern/pattern_base.py @@ -0,0 +1,52 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- + +from utils.utils import clean_text + + +class PluginPatternBase: + def get_strict_regexp(self): + return self.dict_regex_strict + + def get_lax_regexp(self): + return self.dict_regex_lax + + def get_strict_entities(self): + return self._strict_entities + + def get_consolidated_lax_entities(self): + return self._consolidated_entities + + def get_unconsolidated_lax_entities(self): + return self._unconsolidated_entities + + def get_plugin_path(self): + return self.plugin_path + + @staticmethod + def _dict_to_regex_struct(_dict): + if not _dict: + return [] + return [(k, v) for v, k in _dict.items()] + + @staticmethod + def clean_entity(text): + return clean_text(text) + + def strict_regexp(self): + pass + + def lax_regexp(self): + pass + + def validate(self, ent): + pass + + def __init__(self, plugin_path, regex_lax, regex_strict): + self._strict_entities = {} + self._consolidated_entities = {} + self._unconsolidated_entities = {} + + self.dict_regex_lax = self._dict_to_regex_struct(regex_lax) + self.dict_regex_strict = self._dict_to_regex_struct(regex_strict) + self.plugin_path = plugin_path diff --git a/utils/pattern/pattern_detector.py b/utils/pattern/pattern_detector.py new file mode 100755 index 0000000..a92576b --- /dev/null +++ b/utils/pattern/pattern_detector.py @@ -0,0 +1,128 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +import os +from pathlib import Path + +import yaml + +from utils.base_detector import BaseDetector, get_unique_ents, CWD +from utils.pattern.ner_regex import RegexNer +from utils.utils import normalize_text + +CONFIG_PATH = os.path.join(CWD, '..', 'config') +MODELS_PATH = os.path.join(CWD, '..', 'models') +_COMMONS_YAML = "%s/commons.yaml" % CONFIG_PATH + +_REGEXP_CONFIG_KEY_NAME = "regexp_config" +_CONTEXT_KEY_NAME = "Context" +_CONTEXT_LEFT_KEY_NAME = "Context-left" +_CONTEXT_RIGHT_KEY_NAME = "Context-right" +_SPAN_LEN_PARAM_NAME = "span_len" +_WORD_FILE_PARAM_NAME = "word_file" +_WORD_LIST_NAME = "word_list" + + +class PatternDetector(BaseDetector): + """ Main class for extracting KPIs of confidential documents + """ + + def _load_context(self, plugin_path): + context_yaml = "%s/context.yaml" % plugin_path + regexp_config_dict = {} + if Path(context_yaml).is_file(): + with open(context_yaml, "r", encoding='utf8') as f_stream: + context = yaml.load(f_stream, Loader=yaml.FullLoader) + + # configuration of the proximity regexp + regexp_config_dict = {} + if _REGEXP_CONFIG_KEY_NAME in context: + + if _CONTEXT_KEY_NAME in context[_REGEXP_CONFIG_KEY_NAME]: + context_word = self._load_word_file(plugin_path, context[_REGEXP_CONFIG_KEY_NAME][_CONTEXT_KEY_NAME][_WORD_FILE_PARAM_NAME]) + + if _CONTEXT_LEFT_KEY_NAME in context[_REGEXP_CONFIG_KEY_NAME]: + regexp_config_dict[_CONTEXT_LEFT_KEY_NAME] = {} + regexp_config_dict[_CONTEXT_LEFT_KEY_NAME][_SPAN_LEN_PARAM_NAME] = int( + context[_REGEXP_CONFIG_KEY_NAME][_CONTEXT_LEFT_KEY_NAME][_SPAN_LEN_PARAM_NAME]) + context_left_word = self._load_word_file(plugin_path, context[_REGEXP_CONFIG_KEY_NAME][_CONTEXT_LEFT_KEY_NAME][_WORD_FILE_PARAM_NAME]) + context_left_word.extend(context_word) + regexp_config_dict[_CONTEXT_LEFT_KEY_NAME][_WORD_LIST_NAME] = context_left_word + + if _CONTEXT_RIGHT_KEY_NAME in context[_REGEXP_CONFIG_KEY_NAME]: + regexp_config_dict[_CONTEXT_RIGHT_KEY_NAME] = {} + regexp_config_dict[_CONTEXT_RIGHT_KEY_NAME][_SPAN_LEN_PARAM_NAME] = int( + context[_REGEXP_CONFIG_KEY_NAME][_CONTEXT_LEFT_KEY_NAME][_SPAN_LEN_PARAM_NAME]) + context_right_word = self._load_word_file(plugin_path, context[_REGEXP_CONFIG_KEY_NAME][_CONTEXT_RIGHT_KEY_NAME][_WORD_FILE_PARAM_NAME]) + context_right_word.extend(context_word) + regexp_config_dict[_CONTEXT_RIGHT_KEY_NAME][_WORD_LIST_NAME] = context_right_word + + return regexp_config_dict + + def get_kpis(self, sent_list): + """ Extract KPIs from document """ + # full_text is used for proximity detection + full_text = "".join(sent_list) + total_ent_strict_list = [] + ent_consolidated_list = [] + ent_unconsolidated_list = [] + ent_validate_list = [] + valid_regex = [] + + offset = 0 + + for sent in sent_list: + line_length = len(sent) + + # extract entities (ML) + # self._extract_entities_ml(sent, offset, total_ent_strict_list) + strict_regex, con_broad_regex, uncon_broad_regex, valid_regex = self.regex_ner.regex_detection(sent, + full_text, + offset) + total_ent_strict_list.extend(strict_regex) + ent_consolidated_list.extend(con_broad_regex) + ent_unconsolidated_list.extend(uncon_broad_regex) + ent_validate_list.extend(valid_regex) + + offset += line_length + + return total_ent_strict_list, ent_consolidated_list, ent_unconsolidated_list, ent_validate_list + + def run(self): + """ Obtain KPIs from a document and obtain the output in the right format (json) + + """ + + total_ent_strict_list, consolidated_broad_list, unconsolidated_broad_list, validate_list = self.get_kpis( + self.text) + unique_strict_ent_dict = get_unique_ents(total_ent_strict_list) + unique_consolidated_broad_dict = get_unique_ents(consolidated_broad_list) + unique_unconsolidated_broad_dict = get_unique_ents(unconsolidated_broad_list) + unique_validate_list = get_unique_ents(validate_list) + + return unique_strict_ent_dict, unique_consolidated_broad_dict, unique_unconsolidated_broad_dict, unique_validate_list + + @staticmethod + def _load_word_file(plugin_path, word_file_name): + file_path = '%s/%s' % (plugin_path, word_file_name) + if os.path.isfile(file_path): + with open(file_path, "r", encoding='utf8') as f_in: + word_list = [normalize_text(line.rstrip("\n").strip()) for line in f_in if line != "\n"] + else: + word_list = [] + return word_list + + def __init__(self, text, lang, pattern): + super().__init__(text, lang) + + regexp_config_dict = self._load_context(pattern.get_plugin_path()) + + dict_regex_lax = pattern.get_lax_regexp() + dict_regex_strict = pattern.get_strict_regexp() + + # only functions without 'return' or 'pass' return None + validation_func = pattern.validate + if pattern.validate("") is None: + validation_func = None + + self.regex_ner = RegexNer(dict_regex_lax, dict_regex_strict, validation_func, + regexp_config_dict=regexp_config_dict) diff --git a/utils/singleton.py b/utils/singleton.py new file mode 100755 index 0000000..0f32078 --- /dev/null +++ b/utils/singleton.py @@ -0,0 +1,25 @@ +import functools +import threading + + +def synchronized(lock): + def wrapper(f): + @functools.wraps(f) + def inner_wrapper(*args, **kw): + with lock: + return f(*args, **kw) + + return inner_wrapper + + return wrapper + + +class Singleton(type): + _instance = None + _lock = threading.Lock() + + @synchronized(_lock) + def __call__(cls, *args, **kwargs): + if cls._instance is None: + cls._instance = super(Singleton, cls).__call__(*args, **kwargs) + return cls._instance diff --git a/utils/storage.py b/utils/storage.py new file mode 100755 index 0000000..7beaa29 --- /dev/null +++ b/utils/storage.py @@ -0,0 +1,11 @@ +from utils.singleton import Singleton + + +class Storage: + __metaclass__ = Singleton + + def __init__(self): + pass + + def download_file(self, key): + raise NotImplementedError diff --git a/faro/utils.py b/utils/utils.py similarity index 79% rename from faro/utils.py rename to utils/utils.py index a81459e..1d60509 100755 --- a/faro/utils.py +++ b/utils/utils.py @@ -3,6 +3,15 @@ import re +def log_exception(logger_exceptions, filename, method, exception, system_info): + if system_info is not None: + exc_type, exc_obj, exc_tb = system_info.exc_info() + error_msg = "Error: " + str(exception) + ". Line: " + str(exc_tb.tb_lineno) + logger_exceptions.error(filename, method, error_msg) + else: + logger_exceptions.error(filename, method, str(exception)) + + def normalize_text(message): """ Clean text of dots between chars @@ -63,7 +72,7 @@ def preprocess_file_content(content, split_lines): if split_lines: file_lines = [preprocess_text(line) for line in lines] else: - # Combine lines + #  Combine lines file_lines = [" ".join([preprocess_text(line) for line in lines])] return file_lines