diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
deleted file mode 100644
index d31045f..0000000
--- a/.github/CONTRIBUTING.md
+++ /dev/null
@@ -1,29 +0,0 @@
-# Contributing to ASSESS
-
-We conduct the ASSESS project by following an agile approach.
-If you want to contribute to this project, we kindly ask to refer to the following guidelines.
-
-## Getting Started
-
-We constantly work to design and develop high-quality software for the ASSESS technology.
-
-[User Stories](https://martinfowler.com/bliki/UserStory.html) are chunks of desired behavior of a software system. They are widely used in agile software approaches to divide up a large amount of functionality into smaller pieces for planning purposes.
-
-Stories are the smallest unit of work to be done for a project. The [INVEST mnemonic](https://xp123.com/articles/invest-in-good-stories-and-smart-tasks/) describes the characteristics of good stories. The goal of a story is to deliver a unit of value to the customer.
-
-On [GitHub](https://zube.io/blog/agile-project-management-workflow-for-github-issues/), you can make an issue the same way as you would make a story:
-1. Create a new GitHub issue (you will see the suggested template for writing an issue as a user story).
-2. Define a user story title and use the same as issue title.
-3. Assign the priority (values 1-4 where 1 is high priory and 4 is low) to the user story and select a label accordingly (P1, P2, P3, or P4).
-4. Estimate the size of the story in terms of number of iterations (i.e., weeks) needed to be accomplished.
-5. Formulate the user story in the description using the "As a ... I want ... So that ..." form. The "As a" clause refers to who wants the story, "I want" describes what the functionality is, "so that" describes why they want this functionality. The "so that" part provides important context to understand to help get from what the customer think they want to providing what they actually need.
-6. Provide the acceptance criteria by using the [Given-When-Then](https://www.agilealliance.org/glossary/gwt/) formula. Acceptance tests confirm that the story was delivered correctly.
-7. Write notes about the user story.
-
-Some issues can represent open questions (i.e., a matter not yet decided). In this case, you should not follow the aforementioned issue template, but you should just use the _question_ label to identify a GitHub issue as an open question.
-
-You can see [the open issues of the ASSESS project](https://github.com/nasa-jpl/ASSESS/issues).
-
-Any issue related to a specific feature can be marked with labels named as the components of the system as depicted in the following diagram.
-
-![Overview Diagram](https://github.com/nasa-jpl/ASSESS/blob/master/doc/assess_overview.png)
diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md
deleted file mode 100644
index 2d91556..0000000
--- a/.github/ISSUE_TEMPLATE.md
+++ /dev/null
@@ -1,28 +0,0 @@
-# User Story
-
-\
-
-**Priority**: \
-**Estimate**: \
-
-## Description
-
-**_As a_** \
-
-**_I want to_** \
-
-**_so that_** \
-
-## Acceptance Criteria
-
-### Acceptance Criterion 1
-
-**_Given_** \
-
-**_When_** \
-
-**_Then_** \
-
-## Notes
-
-\
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 6404ddd..1251186 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,13 @@
# Change Log
+## [v1.2.0](https://github.com/nasa-jpl/ASSESS/tree/v1.2.0) (2022-01-31)
+- Refactor codebase entirely, removing old code from dashboard, outdated ML, outdated extractors, and consolidate the code that's being used.
+- Add increased API capabilities, allow for ML training, and new FAIS vector library.
+- Increase search capabilites.
+- Dockerize everything.
+- Allow app to use only Elasticsearch.
+- Use different data sources and allow for bulk ingestion.
+
## [v1.1.0](https://github.com/nasa-jpl/ASSESS/tree/v1.1.0) (2019-09-26)
- Improve underlying ASSESS algorithm (run-time, complexity, extraction, interoperability). See issues [56](https://github.com/nasa-jpl/ASSESS/issues/56) and [48](https://github.com/nasa-jpl/ASSESS/issues/48).
- Upgrade to Python 3. See issue [44](https://github.com/nasa-jpl/ASSESS/issues/44).
diff --git a/README.md b/README.md
index 43ed918..093dd64 100755
--- a/README.md
+++ b/README.md
@@ -1,20 +1,22 @@
# Automatic Semantic Search Engine for Suitable Standards
+ASSESS allows you to run an API server that performs document similarity for large troves of text documents as well as manage an application pipeline that allows for ingestion, search, inspection, deletion, training, logging, and editing documents.
+
+**The problem**: Given an SoW, the goal is to produce standards that may be related to that SoW.
+
+To understand the backend code, view the API in [main.py](https://github.com/nasa-jpl/ASSESS/blob/master/api/main.py)
+
+To understand the ML code, view [ml-core.py](https://github.com/nasa-jpl/ASSESS/blob/master/api/ml-core.py)
+
## Getting Started
There are a few main components to ASSESS:
-
-- A React front-end
- A FastAPI server
- An Elastcisearch server with 3 data indices (main index, system logs, and user statistics)
- Kibana for viewing data
+- A redis service for in-memory data storage and rate limiting
-`docker-compose.yml` shows the software stack. You can run the stack using `docker-compose up -d`. Please note, you need the Elasticsearch index data in order to actually have these components working.
-
-Make sure you edit `api/conf.yaml` with the correct server/port locations for elasticsearch.
-
-To understand the backend code, look at the API in [main.py](https://github.com/nasa-jpl/ASSESS/blob/master/api/main.py)
+Make sure you edit `api/conf.yaml` with the correct server/port locations for elasticsearch. `docker-compose.yml` shows the software stack. You can run the stack using `docker-compose up -d`. Please note, you need the corresponding feather data in order to actually have everything working and ingested into Elasticsearch
## Testing the stack
-
-You can test the Rest API with [assess_api_calls.py](https://github.com/nasa-jpl/ASSESS/blob/master/api/assess_api_calls.py)
+You can test the Rest API with [assess_api_calls.py](https://github.com/nasa-jpl/ASSESS/blob/master/api/scripts/assess_api_calls.py)
diff --git a/api/Dockerfile b/api/Dockerfile
index 40316dd..6a88722 100644
--- a/api/Dockerfile
+++ b/api/Dockerfile
@@ -40,8 +40,8 @@ RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN apt install -y python3.8
RUN mv /usr/bin/python3.8 /usr/local/bin/python
RUN python get-pip.py
-RUN python -m pip install --no-cache-dir -r requirements.txt
-RUN python -m pip install --no-cache-dir -r ml_requirements.txt
+RUN python -m pip install --no-cache-dir -r requirements/requirements.txt
+RUN python -m pip install --no-cache-dir -r requirements/ml_requirements.txt
RUN pip3 install jpl.pipedreams==1.0.3
RUN python -m spacy download en_core_web_sm
RUN python -m pip install --no-cache-dir "uvicorn[standard]" gunicorn fastapi
diff --git a/api/conf.yaml b/api/conf.yaml
index b551fec..d8dc71b 100644
--- a/api/conf.yaml
+++ b/api/conf.yaml
@@ -4,7 +4,7 @@ password:
url: http://localhost:8080
df_paths:
- "data/source_1"
- # - "data/source_2"
+ - "data/source_2"
es_server:
#- localhost
- elasticsearch
diff --git a/api/data/README.md b/api/data/README.md
index 134f322..d698eb6 100644
--- a/api/data/README.md
+++ b/api/data/README.md
@@ -1,4 +1,4 @@
# Instructions
-Place the dataframe of the feather file here. This gets bound to the Docker container and used by the machine learning algorithm.
+Place the dataframe of the feather file here. Make sure your api/config.yaml is pointing to these files. This gets ingested into the elasticsearch Docker container and used by the machine learning algorithm.
diff --git a/api/gunicorn_conf.py b/api/gunicorn_conf.py
index a437060..8f9f1e4 100644
--- a/api/gunicorn_conf.py
+++ b/api/gunicorn_conf.py
@@ -32,9 +32,9 @@
use_accesslog = accesslog_var or None
errorlog_var = os.getenv("ERROR_LOG", "-")
use_errorlog = errorlog_var or None
-graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "1400")
-timeout_str = os.getenv("TIMEOUT", "1400")
-keepalive_str = os.getenv("KEEP_ALIVE", "10")
+graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "3600")
+timeout_str = os.getenv("TIMEOUT", "3600")
+keepalive_str = os.getenv("KEEP_ALIVE", "3600")
# Gunicorn config variables
loglevel = use_loglevel
diff --git a/api/main.py b/api/main.py
index 832b274..91194ba 100644
--- a/api/main.py
+++ b/api/main.py
@@ -3,23 +3,16 @@
import os
import os.path
import shutil
-import subprocess
import time
from logging.handlers import RotatingFileHandler
from typing import Optional
import yaml
-import dill
-import pandas as pd
-import requests
import uvicorn
-from elasticsearch import Elasticsearch
from fastapi import (
- Body,
+ BackgroundTasks,
Depends,
FastAPI,
File,
- Form,
- HTTPException,
Request,
UploadFile,
)
@@ -36,10 +29,10 @@
from starlette.requests import Request
from starlette.responses import Response
-from standard_extractor import find_standard_ref
-from text_analysis import extract_prep
-import extraction
-from web_utils import connect_to_es, read_logs
+from standards_extraction import parse
+import ml_core
+from utils import connect_to_es
+import ast
# Define api settings.
app = FastAPI()
@@ -77,6 +70,26 @@
startMsg = {}
startMsg["message"] = "*** Starting Server ***"
fastapi_logger.info(json.dumps(startMsg))
+data_schema = {
+ "type": "object",
+ "properties": {
+ "doc_number": {"type": ["string", "null"]},
+ "id": {"type": ["string", "null"]},
+ "raw_id": {"type": ["string", "null"]},
+ "description": {"type": ["string", "null"]},
+ "ingestion_date": {"type": ["string", "null"]},
+ "hash": {"type": ["string", "null"]},
+ "published_date": {"type": ["string", "null"]},
+ "isbn": {"type": ["string", "null"]},
+ "text": {"type": ["array", "null"]},
+ "status": {"type": ["string", "null"]},
+ "technical_committee": {"type": ["string", "null"]},
+ "title": {"type": ["string", "null"]},
+ "url": {"type": ["string", "null"]},
+ "category": {"type": ["object", "null"]},
+ "sdo": {"type": ["object", "null"]},
+ },
+}
@app.on_event("startup")
@@ -111,128 +124,133 @@ def log_stats(request, data=None, user=None):
return
-def run_predict(request, start, in_text, size, vectorizer_types, index_types):
+def str_to_ls(s):
+ if type(s) is str:
+ s = ast.literal_eval(s)
+ return s
+
+def run_predict(request, start, in_text, size, start_from, vectorizer_types, index_types):
# Globally used
# vectorizer_types = ["tf_idf"]
# index_types = ["flat"]
- list_of_texts = extraction.get_list_of_text(es)
- vectorizers, vector_storage, vector_indexes = extraction.load_into_memory(
+ vectorizers, vector_storage, vector_indexes = ml_core.load_into_memory(
index_types, vectorizer_types
)
- list_of_predictions, scores = extraction.predict(
+ list_of_predictions, scores = ml_core.predict(
in_text,
size,
+ start_from,
vectorizers,
vector_storage,
vector_indexes,
- list_of_texts,
vectorizer_types,
index_types,
)
output = {}
+ # Add mget request here
+ """
+ res = es.mget(index = idx_main, body = {'ids': list_of_predictions})
+ results = [hit["_source"] for hit in res["hits"]["hits"]]
+ """
+ # TODO: Refactor
for i, prediction_id in enumerate(list_of_predictions):
res = es.search(
index=idx_main,
- body={"size": 1, "query": {"match": {"doc_number": prediction_id}}},
+ body={"size": 1, "query": {"match": {"_id": prediction_id}}},
)
for hit in res["hits"]["hits"]:
results = hit["_source"]
- output[i] = results
- output[i]["similarity"] = scores[i]
+ j = start_from + i
+ output[j] = results
+ output[j]["similarity"] = scores[j]
+ # End Refactor
json_compatible_item_data = jsonable_encoder(output)
log_stats(request, data=in_text)
print(f"{time.time() - start}")
return JSONResponse(content=json_compatible_item_data)
-# @app.post(
-# "/recommend_text",
-# dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
-# )
-# async def recommend_text(request: Request, sow: Sow, size: int = 10):
-# """Given an input of Statement of Work as text,
-# return a JSON of recommended standards.
-# """
-# start = time.time()
-# in_text = sow.text_field
-# predictions = old_extract_prep.predict(in_text=in_text, size=size)
-# output = {}
-# results = {}
-# i = 0
-# for prediction in predictions["recommendations"]:
-# i += 1
-# raw_id = prediction["raw_id"]
-# res = es.search(
-# index=idx_main, body={"size": 1, "query": {"match": {"raw_id": raw_id}}}
-# )
-# for hit in res["hits"]["hits"]:
-# results = hit["_source"]
-# output[i] = results
-# output[i]["similarity"] = prediction["sim"]
-# # output["embedded_references"] = predictions["embedded_references"]
-# json_compatible_item_data = jsonable_encoder(output)
-# log_stats(request, data=in_text)
-# print(f"{time.time() - start}")
-# return JSONResponse(content=json_compatible_item_data)
-
+# def background_train(es, index_types, vectorizer_types):
+# ml_core.train(es, index_types, vectorizer_types)
@app.post(
"/train",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
-async def train(index_types=["flat", "flat_sklearn"], vectorizer_types=["tf_idf"]):
- print("Starting training...")
- extraction.train(es, index_types, vectorizer_types)
- return True
+async def train(request: Request, background_tasks: BackgroundTasks, index_types=["flat", "flat_sklearn"], vectorizer_types=["tf_idf"]):
+ vectorizer_types = str_to_ls(vectorizer_types)
+ index_types = str_to_ls(index_types)
+ background_tasks.add_task(ml_core.train, es,
+ index_types, vectorizer_types)
+ log_stats(request, data=None)
+ #message = {}
+ # if in_progress:
+ print("Training task created and sent to the background...")
+ # message = {'status': 'training'}
+ # else:
+ message = {'status': 'in_progress'}
+ return JSONResponse(message)
@app.post(
"/recommend_text",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def recommend_text(
request: Request,
sow: Sow,
size: int = 10,
+ start_from: int = 0,
vectorizer_types=["tf_idf"],
index_types=["flat"],
):
+ vectorizer_types = str_to_ls(vectorizer_types)
+ index_types = str_to_ls(index_types)
"""Given an input of Statement of Work as text,
return a JSON of recommended standards.
"""
in_text = sow.text_field
# df_file = "data/feather_text"
return run_predict(
- request, time.time(), in_text, size, vectorizer_types, index_types
+ request, time.time(), in_text, size, start_from, vectorizer_types, index_types,
)
@app.post(
"/recommend_file",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def recommend_file(
request: Request,
pdf: UploadFile = File(...),
size: int = 10,
+ start_from: int = 0,
vectorizer_types=["tf_idf"],
index_types=["flat"],
):
+ vectorizer_types = str_to_ls(vectorizer_types)
+ index_types = str_to_ls(index_types)
"""Given an input of a Statement of Work as a PDF,
return a JSON of recommended standards.
"""
print("File received.")
- in_text = extract_prep.parse_text(pdf)
- # df_file = "data/feather_text"
+ print(pdf.content_type)
+ print(pdf.filename)
+ in_text = parse.tika_parse(pdf)
+ print(in_text)
return run_predict(
- request, time.time(), in_text, size, vectorizer_types, index_types
+ request, time.time(), in_text, size, start_from, vectorizer_types, index_types
)
@app.post(
"/extract",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def extract(request: Request, pdf: UploadFile = File(...)):
"""Given an input of a Statement of Work (SoW) as a PDF,
@@ -242,8 +260,8 @@ async def extract(request: Request, pdf: UploadFile = File(...)):
# with open(file_location, "wb+") as file_object:
# shutil.copyfileobj(pdf.file, file_object)
print({"info": f"file '{pdf.filename}' saved at '{file_location}'"})
- text = extract_prep.parse_text(file_location)
- refs = find_standard_ref(text)
+ text = parse.tika_parse(file_location)
+ refs = parse.find_standard_ref(text)
out = {}
out["embedded_references"] = refs
out["filename"] = pdf.filename
@@ -255,7 +273,8 @@ async def extract(request: Request, pdf: UploadFile = File(...)):
@app.get(
"/standard_info/",
response_class=ORJSONResponse,
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def standard_info(
request: Request,
@@ -274,34 +293,41 @@ async def standard_info(
url: Optional[str] = None,
hash: Optional[str] = None,
size: int = 1,
+ start_from: int = 0,
):
"""Given a standard ID, get standard information from Elasticsearch."""
if id:
res = es.search(
- index=idx_main, body={"size": size, "query": {"match": {"id": id}}}
+ index=idx_main, body={"from": start_from,
+ "size": size, "query": {"match": {"id": id}}}
)
elif raw_id:
res = es.search(
- index=idx_main, body={"size": size, "query": {"match": {"raw_id": raw_id}}}
+ index=idx_main, body={"from": start_from, "size": size, "query": {
+ "match": {"raw_id": raw_id}}}
)
elif isbn:
res = es.search(
- index=idx_main, body={"size": size, "query": {"match": {"isbn": isbn}}}
+ index=idx_main, body={"from": start_from,
+ "size": size, "query": {"match": {"isbn": isbn}}}
)
elif doc_number:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"doc_number": doc_number}}},
+ body={"from": start_from, "size": size, "query": {
+ "match": {"doc_number": doc_number}}},
)
elif status:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"status": status}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"status": status}}},
)
elif technical_committee:
res = es.search(
index=idx_main,
body={
+ "from": start_from,
"size": size,
"query": {"match": {"technical_committee": technical_committee}},
},
@@ -309,43 +335,51 @@ async def standard_info(
elif published_date:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"published_date": published_date}}},
+ body={"from": start_from, "size": size, "query": {
+ "match": {"published_date": published_date}}},
)
elif ingestion_date:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"ingestion_date": ingestion_date}}},
+ body={"from": start_from, "size": size, "query": {
+ "match": {"ingestion_date": ingestion_date}}},
)
elif title:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"title": title}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"title": title}}},
)
elif sdo:
# res = es.search(index=idx_main, body={"query": {"exists": {"field": sdo_key}}})
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"sdo.abbreviation": sdo}}},
+ body={"from": start_from, "size": size, "query": {
+ "match": {"sdo.abbreviation": sdo}}},
)
elif category:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"category": category}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"category": category}}},
)
elif text:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"text": text}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"text": text}}},
)
elif url:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"url": url}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"url": url}}},
)
elif hash:
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"hash": hash}}},
+ body={"from": start_from, "size": size,
+ "query": {"match": {"hash": hash}}},
)
# print("Got %d Hits:" % res['hits']['total']['value'])
results = {}
@@ -359,15 +393,17 @@ async def standard_info(
@app.get(
"/search/{searchq}",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def search(
- request: Request, searchq: str = Field(example="Airplanes"), size: int = 10
+ request: Request, searchq: str = Field(example="Airplanes"), size: int = 10, start_from: int = 0,
):
"""Search elasticsearch using text."""
res = es.search(
index=idx_main,
- body={"size": size, "query": {"match": {"description": searchq}}},
+ body={"from": start_from, "size": size, "query": {
+ "match": {"description": searchq}}},
)
# print("Got %d Hits:" % res['hits']['total']['value'])
results = {}
@@ -380,10 +416,12 @@ async def search(
@app.post(
"/add_standards",
response_class=HTMLResponse,
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def add_standards(request: Request, doc: dict):
"""Add standards to the main Elasticsearch index by PUTTING a JSON request here."""
+ validate(instance=doc, schema=data_schema)
res = es.index(index=idx_main, body=json.dumps(doc))
print(res)
json_compatible_item_data = jsonable_encoder(doc)
@@ -394,10 +432,12 @@ async def add_standards(request: Request, doc: dict):
@app.put(
"/edit_standards",
response_class=HTMLResponse,
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def edit_standards(request: Request, doc: dict):
"""Add standards to the main Elasticsearch index by PUTTING a JSON request here."""
+ validate(instance=doc, schema=data_schema)
res = es.search(
index="assess_remap",
query={"match": {"id": doc["id"]}},
@@ -413,7 +453,8 @@ async def edit_standards(request: Request, doc: dict):
@app.delete(
"/delete_standards",
response_class=HTMLResponse,
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def delete_standards(request: Request, id: str):
"""Delete standards to the main Elasticsearch index by PUTTING a JSON request here."""
@@ -430,7 +471,8 @@ async def delete_standards(request: Request, id: str):
@app.post(
"/select_standards",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def select_standards(request: Request, selected: dict):
"""After a use likes a standard, this endpoint captures the selected standards into the database."""
@@ -452,7 +494,8 @@ async def select_standards(request: Request, selected: dict):
@app.put(
"/set_standards",
- dependencies=[Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
+ dependencies=[
+ Depends(RateLimiter(times=rate_times, seconds=rate_seconds))],
)
async def set_standards(request: Request, set_standards: dict):
"""Validate and set preference of standards (done by Admin)."""
diff --git a/api/extraction.py b/api/ml_core.py
similarity index 90%
rename from api/extraction.py
rename to api/ml_core.py
index 71f01e3..070340b 100644
--- a/api/extraction.py
+++ b/api/ml_core.py
@@ -91,7 +91,7 @@ def load_into_memory(index_types, vectorizer_types):
def train(es, index_types, vectorizer_types):
# ==== train vectorizers (needs to train on all standards in the corpus)
- list_of_texts = get_list_of_text(es)
+ ES_ids, list_of_texts = get_list_of_text(es)
vectorizers = PluginCollection()
print("\nTraining Vectorizers...")
for vectorizer_type in vectorizer_types:
@@ -107,7 +107,7 @@ def train(es, index_types, vectorizer_types):
list_of_texts, type=vectorizer_type, vectorizers=vectorizers
)
# TODO: remove line
- ES_ids = list(range(len(vectors))) # using dummy values
+ # ES_ids = list(range(len(vectors))) # using dummy values
vector_storage.apply(
"plugins.Vector_Storage",
"basic",
@@ -142,10 +142,10 @@ def train(es, index_types, vectorizer_types):
def predict(
sow,
n,
+ start_from,
vectorizers,
vector_storage,
vector_indexes,
- list_of_texts,
vectorizer_types,
index_types,
):
@@ -176,7 +176,7 @@ def predict(
# iso_data = pd.read_feather("data/feather_text")
# for ES_id in top_n_ES_ids[:n]:
# print(ES_id, list_of_texts[ES_id])
- return top_n_ES_ids[:n], scores.tolist()
+ return top_n_ES_ids[start_from:n], scores.tolist()
def get_list_of_text(es=None):
@@ -186,27 +186,32 @@ def get_list_of_text(es=None):
# print(df.columns)
# TODO: get this information from the text column.
# return the text and the elasticsearch ids
- return list(df["title"] + ". " + df["description"])
+ return list(df["_id"]), list(df["title"] + ". " + df["description"])
def es_to_df(es=None, index="assess_remap", path="data/feather_text"):
if not es:
- df = pd.read_feather("data/feather_text")
+ # ADD FILES YOU WANT TO READ HERE:
+ df1 = pd.read_feather("data/source_1")
+ df2 = pd.read_feather("data/source_1")
+ df = pd.concat([df1, df2])
else:
res = list(scan(es, query={}, index=index))
output_all = deque()
- output_all.extend([x["_source"] for x in res])
+ output_all.extend([((x["_source"]['description']), (x["_source"]['title']), (x["_id"])) for x in res])
+ output_all = [{'description': t[0], 'title': t[1], '_id': t[2]} for t in output_all]
df = json_normalize(output_all)
+ df = df[["_id", "title", "description"]]
return df
if __name__ == "__main__":
- # es = Elasticsearch(http_compress=True)
- es = None
+ es = Elasticsearch(http_compress=True)
+ #es = None
do_training = True
index_types = ["flat", "flat_sklearn"]
vectorizer_types = ["tf_idf"]
- list_of_texts = get_list_of_text(es)
+ ES_ids, list_of_texts = get_list_of_text(es)
# list_of_texts=['computer science', 'space science', 'global summit for dummies', 'deep neural nets', 'technology consultants',
# 'space science', 'global summit for dummies', 'deep neural nets', 'technology consultants',
# '', '', '', '']
@@ -222,10 +227,10 @@ def es_to_df(es=None, index="assess_remap", path="data/feather_text"):
r = predict(
"Computer software and stuff!!!",
10,
+ 0,
vectorizers,
vector_storage,
vector_indexes,
- list_of_texts,
vectorizer_types,
index_types,
)
diff --git a/api/models/graph.zip b/api/models/graph.zip
deleted file mode 100644
index b966da2..0000000
Binary files a/api/models/graph.zip and /dev/null differ
diff --git a/api/models/ics_dict_general b/api/models/ics_dict_general
deleted file mode 100644
index e1de43f..0000000
Binary files a/api/models/ics_dict_general and /dev/null differ
diff --git a/api/models/pos_ b/api/models/pos_
deleted file mode 100644
index 7e2231b..0000000
Binary files a/api/models/pos_ and /dev/null differ
diff --git a/api/plugins/Index/flat_sklearn.py b/api/plugins/Index/flat_sklearn.py
index 5261703..0561cbc 100644
--- a/api/plugins/Index/flat_sklearn.py
+++ b/api/plugins/Index/flat_sklearn.py
@@ -17,7 +17,7 @@ def create_index(self, vectors):
# assuming numpy array
vectors_ = vectors
self.index = NearestNeighbors(
- n_neighbors=vectors_.shape[1], algorithm="brute", metric="cosine"
+ n_neighbors=vectors_.shape[0], algorithm="brute", metric="cosine"
)
self.index.fit(vectors_)
diff --git a/api/plugins/Vector_Storage/basic.py b/api/plugins/Vector_Storage/basic.py
index 6d29567..402bf22 100644
--- a/api/plugins/Vector_Storage/basic.py
+++ b/api/plugins/Vector_Storage/basic.py
@@ -2,7 +2,8 @@
import pickle
import os
import numpy as np
-
+from tqdm import tqdm
+import gc
class Basic(Template):
def __init__(self):
@@ -30,6 +31,7 @@ def _remove_vector(self, id, vec_type):
def _save_to_disk(self):
with open("data/basic_vector_storage.pk", "wb") as storage:
pickle.dump(self.vector_storage, storage)
+ gc.collect()
with open("data/basic_sorted_ids.pk", "wb") as ids:
pickle.dump(self.sorted_ids, ids)
@@ -51,9 +53,12 @@ def clean_storage(self):
os.remove("data/basic_sorted_ids.pk")
def add_update_vectors(self, ids, vectors, vec_type):
- for id, vector in zip(ids, vectors.tolist()):
- self._add_update_vector(id, np.array(vector), vec_type)
+ vectors = np.asarray(vectors)
+ for id, vector in tqdm(zip(ids, vectors), total=len(ids)):
+ self._add_update_vector(id, vector, vec_type)
+ print('writing to disk..')
self._save_to_disk()
+ print('writing complete!')
def remove_vectors(self, ids, vec_type):
for id, vector in zip(ids):
diff --git a/api/plugins/Vectorizer/tf_idf.py b/api/plugins/Vectorizer/tf_idf.py
index 9cb66b5..ca02694 100644
--- a/api/plugins/Vectorizer/tf_idf.py
+++ b/api/plugins/Vectorizer/tf_idf.py
@@ -13,17 +13,18 @@ class TF_IDF(Template):
def __init__(self):
super().__init__()
self.description = 'Implements the TF-IDF vectorizer'
- self.vectorizer=TfidfVectorizer(tokenizer=identity_func, lowercase=False)
+ self.vectorizer=TfidfVectorizer(tokenizer=identity_func, lowercase=False, max_features=5000)
self.nlp = en_core_web_sm.load()
def train(self, list_of_texts):
- print('Preprocessing text (tokenize, lemmatize and punctuation removal)..')
+ print('Preprocessing text for training (tokenize, lemmatize and punctuation removal)..')
# we add a dummy token 'lemma_lemma' Faiss flat index is giving a higher matching values to vectors for empty strings than some more relevant ones!
list_of_texts= [spacy_tokenize_lemmatize_punc_remove(item, self.nlp)+['lemma_lemma'] for item in tqdm(list_of_texts)]
self.vectorizer.fit(list_of_texts)
def vectorize(self, list_of_texts):
# we add a dummy token 'lemma_lemma' Faiss flat index is giving a higher matching values to vectors for empty strings than some more relevant ones!
+ print('Preprocessing for vectorization (tokenize, lemmatize and punctuation removal)..')
list_of_texts= [spacy_tokenize_lemmatize_punc_remove(item, self.nlp)+['lemma_lemma'] for item in tqdm(list_of_texts)]
return self.vectorizer.transform(list_of_texts).todense()
diff --git a/api/plugins/Vectorizer/utilities.py b/api/plugins/Vectorizer/utilities.py
index 422c739..bf375bc 100644
--- a/api/plugins/Vectorizer/utilities.py
+++ b/api/plugins/Vectorizer/utilities.py
@@ -24,7 +24,8 @@ def get_BERT_vectors(list_of_texts, model, tokenizer, layers=None, batch_size=10
# when we say paragraph we mean any length of text.
paragraph_idx_to_sentence_idxs = []
all_sentences = []
- for text in list_of_texts:
+ print('creating sentences...')
+ for text in tqdm(list_of_texts):
sentences = list(get_sentences(text, 20))
all_sentences_next_id = len(all_sentences)
paragraph_idx_to_sentence_idxs.append(
@@ -55,6 +56,7 @@ def get_activations(list_of_sentences):
# chunk list of sentences into smaller batches to improve memory and can be parallelized in future.
# WARNING: currently, the Hugginface Model class does not lend itself to parallelization, due to some picklization error!
all_sent_vectors = []
+ print('creating vectors for num sentences (will process in a batch size of ' + str(batch_size) + '):', len(all_sentences))
for chunked_sentences in tqdm(list(divide_chunks(all_sentences, batch_size))):
chunk_of_vector = get_activations(chunked_sentences)
all_sent_vectors.extend(chunk_of_vector)
@@ -76,13 +78,15 @@ def preprocessor(text):
return text
def spacy_tokenize_lemmatize_punc_remove(text, nlp):
-
+ is_there_digit = re.compile("\d")
+ #print("******* THIS IS TEXT DEBUG *******")
+ #print(text, type(text))
processed = nlp(text)
lemma_list = []
for token in processed:
if token.is_stop is False:
token_preprocessed = preprocessor(token.lemma_.lower())
- if token_preprocessed != '':
+ if token_preprocessed != '' and is_there_digit.search(token_preprocessed) is None:
lemma_list.append(token_preprocessed)
return lemma_list
diff --git a/api/ml_requirements.txt b/api/requirements/ml_requirements.txt
similarity index 85%
rename from api/ml_requirements.txt
rename to api/requirements/ml_requirements.txt
index b78a6df..236c04b 100644
--- a/api/ml_requirements.txt
+++ b/api/requirements/ml_requirements.txt
@@ -1,12 +1,12 @@
# python 3.9.5
-tensorflow==2.6.0
+tensorflow==2.7.0
faiss-cpu==1.7.1post2
scikit-learn==0.24.2 # gives memory segmentation with faiss error when version 1.0
-transformers==4.10.0
+transformers==4.15.0
elasticsearch==7.15.1
pandas==1.3.4
pyarrow==5.0.0
-datasets==1.14.0
+datasets==1.17.0
spacy==3.2.0
# install the below manually (jpl.pipedreams has some dependency conflicts with others.)
diff --git a/api/requirements.txt b/api/requirements/requirements.txt
similarity index 100%
rename from api/requirements.txt
rename to api/requirements/requirements.txt
diff --git a/api/assess_api_calls.py b/api/scripts/assess_api_calls.py
similarity index 94%
rename from api/assess_api_calls.py
rename to api/scripts/assess_api_calls.py
index 9d158a0..3cf0505 100644
--- a/api/assess_api_calls.py
+++ b/api/scripts/assess_api_calls.py
@@ -15,7 +15,7 @@ def format_json(jsonText):
def train():
- print("Sending GET request to `/train`.")
+ print("Sending POST request to `/train`.")
r = requests.post(
f"{root}/train",
)
@@ -27,7 +27,7 @@ def recommend_text():
print("Sending GET request to `/recommend_text`.")
jsonLoad = {"text_field": "Example text about airplanes"}
r = requests.post(
- f"{root}/recommend_text?size=10",
+ f"{root}/recommend_text?size=10&start_from=5",
json=jsonLoad,
# auth=HTTPBasicAuth(username, password),
)
@@ -37,9 +37,9 @@ def recommend_text():
def recommend_file():
# Recommend an SoW given a PDF.
# Specify file location of an SOW.
- location = "data/example.pdf"
+ location = "../data/sow.pdf"
file = {"pdf": open(location, "rb")}
- print("Sending GET request to `/recommend_file` with a PDF.")
+ print("Sending POST request to `/recommend_file` with a PDF.")
r = requests.post(
f"{root}/recommend_file", files=file, auth=HTTPBasicAuth(username, password)
)
@@ -135,7 +135,7 @@ def select():
print(format_json(r.text))
-def set():
+def set_standard():
set_standards = {
"username": "test_user",
"standard_id": "x0288b9ed144439f8ad8fa017d604eac",
@@ -191,7 +191,7 @@ def set():
},
}
# Insert username, password, and ASSESS root url into `conf.yaml`
- with open("conf.yaml", "r") as stream:
+ with open("../conf.yaml", "r") as stream:
conf = yaml.safe_load(stream)
username = conf.get("username")
password = conf.get("password")
@@ -202,7 +202,10 @@ def set():
% (username, password, root)
)
recommend_text()
+ # recommend_file()
add()
standard_info()
edit()
- delete()
\ No newline at end of file
+ delete()
+ set_standard()
+ # train()
diff --git a/api/bulk_export.py b/api/scripts/bulk_export.py
similarity index 99%
rename from api/bulk_export.py
rename to api/scripts/bulk_export.py
index 62cdf7b..b9616ec 100644
--- a/api/bulk_export.py
+++ b/api/scripts/bulk_export.py
@@ -169,6 +169,7 @@ def df_to_es(df_path, index, client, overwrite=False, normalize=False):
client.indices.delete(index, ignore=[400, 404])
client.indices.create(index, ignore=400)
df = feather.read_feather(df_path)
+ df.fillna("", inplace=True)
bulk(client, doc_generator(df, index, normalize))
return
diff --git a/api/standard_extractor.py b/api/standard_extractor.py
deleted file mode 100755
index 81a603a..0000000
--- a/api/standard_extractor.py
+++ /dev/null
@@ -1,38 +0,0 @@
-# -*- coding: utf-8 -*-
-
-import re
-import io
-
-standard_orgs={}
-for line in io.open("standards/data/standard_orgs.txt",mode="r", encoding="utf-8").readlines():
- line=line.strip()
- abbr=line.split(' — ')[0]
- name=line.split(' — ')[1]
- standard_orgs[abbr]=name
-
-
-def find_standard_ref(text):
- refs=[]
- # match abbreviations in upper case
- words=text.split()
- for i, word in enumerate(words):
- for k in standard_orgs.keys():
- if k in word:
- # check one word before and after for alphanumeric
- if i=0:
- word_before=words[i+1]
- if bool(re.search(r'\d', word_before)):
- refs.append(word_before+' '+word)
-
- return list(set(refs))
-
-# print(find_standard_ref('(IEC) sdd67'))
\ No newline at end of file
diff --git a/api/standards/es_search.py b/api/standards/es_search.py
deleted file mode 100644
index 73f638a..0000000
--- a/api/standards/es_search.py
+++ /dev/null
@@ -1,54 +0,0 @@
-from elasticsearch import Elasticsearch
-import json
-from elasticsearch_dsl import Search
-import requests
-
-
-# es = Elasticsearch(["172.19.0.2"])
-# es_index = "iso_final_clean"
-es = Elasticsearch()
-es_index = "test-csv"
-
-search = Search(using=es)
-
-def search_test(uri, term):
- """Simple Elasticsearch Query"""
- query = json.dumps({
- "query": {
- "match": {
- "content": term
- }
- }
- })
- response = requests.get(uri, data=query)
- results = json.loads(response.text)
- return results
-
-
-def client_search(searchq, n):
- return es.search(index=es_index, body={"query": {"match": {"description":searchq}}}, size=n)
-
-
-def search_by_text(searchq, n=10):
- res = es.search(index=es_index, body={"size": n, "query": {"match": {"description":searchq}}})
- print("Got %d Hits:" % res['hits']['total']['value'])
- results = {}
- for num, hit in enumerate(res['hits']['hits']):
- results[num+1] = hit["_source"]
- json_object = json.dumps(results, indent=4)
- return json_object
-
-def search_by_id(searchq, n=10):
- res = es.search(index=es_index, body={"size": n, "query": {"match": {"num_id":searchq}}})
- print("Got %d Hits:" % res['hits']['total']['value'])
- results = {}
- for num, hit in enumerate(res['hits']['hits']):
- results[num+1] = hit["_source"]
- json_object = json.dumps(results, indent=4)
- return json_object
-
-
-#client_search('localhost:9200/test-csv', 'airplanes')
-print(search_by_text("machine", 2))
-print(search_by_id("22", 1))
-print(client_search("test", 3))
diff --git a/api/standards/scripts/iso_data_prep.py b/api/standards/scripts/iso_data_prep.py
deleted file mode 100644
index 336de05..0000000
--- a/api/standards/scripts/iso_data_prep.py
+++ /dev/null
@@ -1,185 +0,0 @@
-import pandas as pd
-import dill
-
-def savemodel(model,outfile):
- with open(outfile, 'wb') as output:
- dill.dump(model, output)
- return ''
-
-def loadmodel(infile):
- model=''
- with open(infile, 'rb') as inp:
- model = dill.load(inp)
- return model
-
-
-def getID(key):
- global counter
- global ics_map_old_new
- if key not in ics_map_old_new.keys():
- counter += 1
- ics_map_old_new[key] = counter
-
- return ics_map_old_new[key]
-
-def process(a, col, lst=[]):
- a=str(a)
- if (len(lst)==0 or col in lst) and a!='':
- # because when saving to csv 43.060 becomes 43.06!!
- return '~'+str(a)
- return a
-
-"""
-# ===================
-ics categories have an ID, which can be separated into field, group, sub_group and standard.
-"""
-
-
-ics_path='ics.csv' # contains data about how the
-
-df_ics=pd.read_csv(ics_path)
-
-# create the class tree. Have a mapping to actual names of the categories (can get from querying the ics.csv directly).
-df_ics_seperated=pd.DataFrame()
-
-ics_dict={-1:[]} # create a taxonomy to input into the Hclassif algo only top two levels
-ics_dict_general={-1:[]} # create a general tree (to clearly show the heirachial structure of ics)
-for i, row in df_ics.iterrows():
-
- new_row={}
- type=''
- code=row['code']
- field,group,sub_group,standard,new_field, new_group, new_sub_group, new_standard='','','','','','','',''
-
- code = code.split('.ISO')
- code_ = code[0].split('.')
-
- if len(code_) >= 2:
- field = code_[1]
- type = 'field'
- new_field = getID(field)
- ics_dict[-1].append(new_field)
- new_row['id'] = field
- new_row['id_'] = new_field
- ics_dict_general[-1].append(new_field)
-
-
- if len(code_) >= 3:
- group = code_[1] + '.' + code_[2]
- type = 'group'
- new_group = getID(group)
-
- if new_field not in ics_dict.keys():
- ics_dict[new_field] = []
- ics_dict[new_field].append(new_group)
-
- if new_field not in ics_dict_general.keys():
- ics_dict_general[new_field] = []
- ics_dict_general[new_field].append(new_group)
-
- new_row['id'] = group
- new_row['id_'] = new_group
-
-
- if len(code_) >= 4:
- sub_group = code_[1] + '.' + code_[2] + '.' + code_[3]
- type = 'subgroup'
- new_sub_group = getID(sub_group)
-
- if new_group not in ics_dict_general.keys():
- ics_dict_general[new_group] = []
- ics_dict_general[new_group].append(new_sub_group)
-
- new_row['id'] = sub_group
- new_row['id_'] = new_sub_group
-
- if len(code) > 1:
- standard = 'ISO' + code[1]
- new_standard = getID(standard)
- if type=='field':
- if new_field not in ics_dict_general.keys():
- ics_dict_general[new_field] = []
- ics_dict_general[new_field].append(new_standard)
- if type=='group':
- if new_group not in ics_dict_general.keys():
- ics_dict_general[new_group] = []
- ics_dict_general[new_group].append(new_standard)
- if type=='subgroup':
- if new_sub_group not in ics_dict_general.keys():
- ics_dict_general[new_sub_group] = []
- ics_dict_general[new_sub_group].append(new_standard)
- type = 'standard'
-
-
- new_row['id'] = standard
- new_row['id_'] = new_standard
-
-
-
- new_row['field'] = field
- new_row['new_field'] = new_field
- new_row['group'] = group
- new_row['new_group'] = new_group
- new_row['subgroup'] = sub_group
- new_row['new_subgroup'] = new_sub_group
- new_row['standard'] = standard
- new_row['new_standard'] = new_standard
- new_row['code'] = row['code']
- new_row['link'] = row['link']
- new_row['title'] = row['title']
- new_row['type']=type
-
- to_process_list=['field','new_field','group','new_group','subgroup','new_subgroup','standard','new_standard','code','id','id_']
- new_row={k:process(v, k, to_process_list) for k, v in new_row.items()}
-
-
-
- df_ics_seperated=df_ics_seperated.append(new_row, ignore_index=True)
- print(i)
-
-ics_dict_general_={}
-for k,v in ics_dict_general.items():
- k=process(k,'')
- v=list(set(v))
- v=[process(item,'') for item in v]
- ics_dict_general_[k]=v
-ics_dict_general=ics_dict_general_
-
-df_ics_seperated.to_csv('ics_separated.csv') # save all things into a csv so that later on lables, ics labels and names could be correlated
-savemodel(ics_dict,'ics_dict')
-savemodel(ics_dict_general,'ics_dict_general')
-savemodel(ics_map_old_new,'ics_map_old_new')
-
-
-
-
-"""
-# ===================
-merge the ics data with the iso standards
-"""
-df_ics_seperated=pd.read_csv('ics_separated.csv', index_col=0)
-json_to_csv=pd.read_csv('json_to_csv.csv', index_col=0) # this is the csv version of 'iso_flat.json', contains the iso standards metadata
-
-df_final_all=pd.DataFrame()
-counter=0
-for _, row in df_ics_seperated.iterrows():
- counter+=1
- print(counter)
- entry=json_to_csv[json_to_csv['url'] == row['link']].values
- new_row={}
-
- # merge
- if len(entry)!=0:
- for k, v in zip(json_to_csv.columns, entry[0]):
- new_row[k]=v
-
- for k, v in dict(row).items():
- new_row[k]=v
-
-
-
- df_final_all=df_final_all.append(new_row, ignore_index=True)
-
-df_final_all.to_csv('iso_final_all.csv')
-
-# todo: fix the paths for files
diff --git a/api/standards_extraction/LICENSE b/api/standards_extraction/LICENSE
deleted file mode 100755
index 8dada3e..0000000
--- a/api/standards_extraction/LICENSE
+++ /dev/null
@@ -1,201 +0,0 @@
- Apache License
- Version 2.0, January 2004
- http://www.apache.org/licenses/
-
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
- 1. Definitions.
-
- "License" shall mean the terms and conditions for use, reproduction,
- and distribution as defined by Sections 1 through 9 of this document.
-
- "Licensor" shall mean the copyright owner or entity authorized by
- the copyright owner that is granting the License.
-
- "Legal Entity" shall mean the union of the acting entity and all
- other entities that control, are controlled by, or are under common
- control with that entity. For the purposes of this definition,
- "control" means (i) the power, direct or indirect, to cause the
- direction or management of such entity, whether by contract or
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
- outstanding shares, or (iii) beneficial ownership of such entity.
-
- "You" (or "Your") shall mean an individual or Legal Entity
- exercising permissions granted by this License.
-
- "Source" form shall mean the preferred form for making modifications,
- including but not limited to software source code, documentation
- source, and configuration files.
-
- "Object" form shall mean any form resulting from mechanical
- transformation or translation of a Source form, including but
- not limited to compiled object code, generated documentation,
- and conversions to other media types.
-
- "Work" shall mean the work of authorship, whether in Source or
- Object form, made available under the License, as indicated by a
- copyright notice that is included in or attached to the work
- (an example is provided in the Appendix below).
-
- "Derivative Works" shall mean any work, whether in Source or Object
- form, that is based on (or derived from) the Work and for which the
- editorial revisions, annotations, elaborations, or other modifications
- represent, as a whole, an original work of authorship. For the purposes
- of this License, Derivative Works shall not include works that remain
- separable from, or merely link (or bind by name) to the interfaces of,
- the Work and Derivative Works thereof.
-
- "Contribution" shall mean any work of authorship, including
- the original version of the Work and any modifications or additions
- to that Work or Derivative Works thereof, that is intentionally
- submitted to Licensor for inclusion in the Work by the copyright owner
- or by an individual or Legal Entity authorized to submit on behalf of
- the copyright owner. For the purposes of this definition, "submitted"
- means any form of electronic, verbal, or written communication sent
- to the Licensor or its representatives, including but not limited to
- communication on electronic mailing lists, source code control systems,
- and issue tracking systems that are managed by, or on behalf of, the
- Licensor for the purpose of discussing and improving the Work, but
- excluding communication that is conspicuously marked or otherwise
- designated in writing by the copyright owner as "Not a Contribution."
-
- "Contributor" shall mean Licensor and any individual or Legal Entity
- on behalf of whom a Contribution has been received by Licensor and
- subsequently incorporated within the Work.
-
- 2. Grant of Copyright License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- copyright license to reproduce, prepare Derivative Works of,
- publicly display, publicly perform, sublicense, and distribute the
- Work and such Derivative Works in Source or Object form.
-
- 3. Grant of Patent License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- (except as stated in this section) patent license to make, have made,
- use, offer to sell, sell, import, and otherwise transfer the Work,
- where such license applies only to those patent claims licensable
- by such Contributor that are necessarily infringed by their
- Contribution(s) alone or by combination of their Contribution(s)
- with the Work to which such Contribution(s) was submitted. If You
- institute patent litigation against any entity (including a
- cross-claim or counterclaim in a lawsuit) alleging that the Work
- or a Contribution incorporated within the Work constitutes direct
- or contributory patent infringement, then any patent licenses
- granted to You under this License for that Work shall terminate
- as of the date such litigation is filed.
-
- 4. Redistribution. You may reproduce and distribute copies of the
- Work or Derivative Works thereof in any medium, with or without
- modifications, and in Source or Object form, provided that You
- meet the following conditions:
-
- (a) You must give any other recipients of the Work or
- Derivative Works a copy of this License; and
-
- (b) You must cause any modified files to carry prominent notices
- stating that You changed the files; and
-
- (c) You must retain, in the Source form of any Derivative Works
- that You distribute, all copyright, patent, trademark, and
- attribution notices from the Source form of the Work,
- excluding those notices that do not pertain to any part of
- the Derivative Works; and
-
- (d) If the Work includes a "NOTICE" text file as part of its
- distribution, then any Derivative Works that You distribute must
- include a readable copy of the attribution notices contained
- within such NOTICE file, excluding those notices that do not
- pertain to any part of the Derivative Works, in at least one
- of the following places: within a NOTICE text file distributed
- as part of the Derivative Works; within the Source form or
- documentation, if provided along with the Derivative Works; or,
- within a display generated by the Derivative Works, if and
- wherever such third-party notices normally appear. The contents
- of the NOTICE file are for informational purposes only and
- do not modify the License. You may add Your own attribution
- notices within Derivative Works that You distribute, alongside
- or as an addendum to the NOTICE text from the Work, provided
- that such additional attribution notices cannot be construed
- as modifying the License.
-
- You may add Your own copyright statement to Your modifications and
- may provide additional or different license terms and conditions
- for use, reproduction, or distribution of Your modifications, or
- for any such Derivative Works as a whole, provided Your use,
- reproduction, and distribution of the Work otherwise complies with
- the conditions stated in this License.
-
- 5. Submission of Contributions. Unless You explicitly state otherwise,
- any Contribution intentionally submitted for inclusion in the Work
- by You to the Licensor shall be under the terms and conditions of
- this License, without any additional terms or conditions.
- Notwithstanding the above, nothing herein shall supersede or modify
- the terms of any separate license agreement you may have executed
- with Licensor regarding such Contributions.
-
- 6. Trademarks. This License does not grant permission to use the trade
- names, trademarks, service marks, or product names of the Licensor,
- except as required for reasonable and customary use in describing the
- origin of the Work and reproducing the content of the NOTICE file.
-
- 7. Disclaimer of Warranty. Unless required by applicable law or
- agreed to in writing, Licensor provides the Work (and each
- Contributor provides its Contributions) on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- implied, including, without limitation, any warranties or conditions
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
- PARTICULAR PURPOSE. You are solely responsible for determining the
- appropriateness of using or redistributing the Work and assume any
- risks associated with Your exercise of permissions under this License.
-
- 8. Limitation of Liability. In no event and under no legal theory,
- whether in tort (including negligence), contract, or otherwise,
- unless required by applicable law (such as deliberate and grossly
- negligent acts) or agreed to in writing, shall any Contributor be
- liable to You for damages, including any direct, indirect, special,
- incidental, or consequential damages of any character arising as a
- result of this License or out of the use or inability to use the
- Work (including but not limited to damages for loss of goodwill,
- work stoppage, computer failure or malfunction, or any and all
- other commercial damages or losses), even if such Contributor
- has been advised of the possibility of such damages.
-
- 9. Accepting Warranty or Additional Liability. While redistributing
- the Work or Derivative Works thereof, You may choose to offer,
- and charge a fee for, acceptance of support, warranty, indemnity,
- or other liability obligations and/or rights consistent with this
- License. However, in accepting such obligations, You may act only
- on Your own behalf and on Your sole responsibility, not on behalf
- of any other Contributor, and only if You agree to indemnify,
- defend, and hold each Contributor harmless for any liability
- incurred by, or claims asserted against, such Contributor by reason
- of your accepting any such warranty or additional liability.
-
- END OF TERMS AND CONDITIONS
-
- APPENDIX: How to apply the Apache License to your work.
-
- To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "{}"
- replaced with your own identifying information. (Don't include
- the brackets!) The text should be enclosed in the appropriate
- comment syntax for the file format. We also recommend that a
- file or class name and description of purpose be included on the
- same "printed page" as the copyright notice for easier
- identification within third-party archives.
-
- Copyright {yyyy} {name of copyright owner}
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
diff --git a/api/standards_extraction/README.md b/api/standards_extraction/README.md
deleted file mode 100755
index c4c66ff..0000000
--- a/api/standards_extraction/README.md
+++ /dev/null
@@ -1,143 +0,0 @@
-# StandardsExtractingContentHandler
-
-Apache Tika currently provides many _ContentHandler_ which help to de-obfuscate specific types of information from text. For instance, the `PhoneExtractingContentHandler` is used to extract phone numbers while parsing.
-
-This improvement adds the **`StandardsExtractingContentHandler`** to Tika, a new ContentHandler that relies on regular expressions in order to identify and extract standard references from text.
-Basically, a standard reference is just a reference to a norm/convention/requirement (i.e., a standard) released by a standard organization. This work is maily focused on identifying and extracting the references to the standards already cited within a given document (e.g., SOW/PWS) so the references can be stored and provided to the user as additional metadata in case the `StandardExtractingContentHandler` is used.
-
-In addition to the patch, the first version of the `StandardsExtractingContentHandler` along with an example class to easily execute the handler is available on [GitHub](https://github.com/giuseppetotaro/StandardsExtractingContentHandler). The following sections provide more in detail how the `StandardsExtractingHandler` has been developed.
-
-All the details are reported on Jira ([TIKA-2449](https://issues.apache.org/jira/browse/TIKA-2449)).
-
-## Getting Started
-
-To build StandardsExtractingContentHandler, you can run the following bash script:
-
-```
-./build.sh
-```
-
-To extract standard references by using the StandardsExtractingContentHandler, you can run the following bash script:
-
-```
-./run.sh /path/to/input threshold
-```
-
-For instance, by running `./run.sh ./example/SOW-TacCOM.pdf 0.75` you can get the standard references from the [SOW-TacCOM.pdf](https://foiarr.cbp.gov/streamingWord.asp?i=607) file using a threshold of 0.75 along with the scope and other metadata:
-
-```
-{
- "Author": "BAF107S",
- "Content-Type": "application/pdf",
- "Creation-Date": "2011-04-21T21:36:36Z",
- "Last-Modified": "2011-06-14T22:04:53Z",
- "Last-Save-Date": "2011-06-14T22:04:53Z",
- "X-Parsed-By": [
- "org.apache.tika.parser.DefaultParser",
- "org.apache.tika.parser.pdf.PDFParser"
- ],
- "access_permission:assemble_document": "false",
- "access_permission:can_modify": "false",
- "access_permission:can_print": "true",
- "access_permission:can_print_degraded": "false",
- "access_permission:extract_content": "false",
- "access_permission:extract_for_accessibility": "true",
- "access_permission:fill_in_form": "false",
- "access_permission:modify_annotations": "false",
- "created": "Thu Apr 21 14:36:36 PDT 2011",
- "creator": "BAF107S",
- "date": "2011-06-14T22:04:53Z",
- "dc:creator": "BAF107S",
- "dc:format": "application/pdf; version\u003d1.7",
- "dc:title": "Microsoft Word - SOW HSBP1010C00056.doc",
- "dcterms:created": "2011-04-21T21:36:36Z",
- "dcterms:modified": "2011-06-14T22:04:53Z",
- "meta:author": "BAF107S",
- "meta:creation-date": "2011-04-21T21:36:36Z",
- "meta:save-date": "2011-06-14T22:04:53Z",
- "modified": "2011-06-14T22:04:53Z",
- "pdf:PDFVersion": "1.7",
- "pdf:docinfo:created": "2011-04-21T21:36:36Z",
- "pdf:docinfo:creator": "BAF107S",
- "pdf:docinfo:creator_tool": "PScript5.dll Version 5.2.2",
- "pdf:docinfo:modified": "2011-06-14T22:04:53Z",
- "pdf:docinfo:producer": "Acrobat Distiller 9.3.3 (Windows)",
- "pdf:docinfo:title": "Microsoft Word - SOW HSBP1010C00056.doc",
- "pdf:encrypted": "true",
- "producer": "Acrobat Distiller 9.3.3 (Windows)",
- "scope": "\n \nThe purpose of this SOW is to describe the products and services that the Contractor \nwill provide to the CBP, Office of Information and Technology’s (OIT), Enterprise \n\n\n\nCBP TACCOM LMR Deployment Equipment and Services - Houlton Ref. No.____________ \nSource Selection Sensitive Information – See FAR 2.101 and 3.104 \n\n \n\n \nUpdated 02/25/10 2 \n\nSource Selection Sensitive Information – See FAR 2.101 and 3.104 \n \n\nNetworks and Technology Support (ENTS), Wireless Technology Programs (WTP) \nTACCOM Project in support of the TACCOM system modernization in the Houlton, \nMaine Focus Area 1. \n \nThe Contractor shall provide LMR Equipment, Development, Deployment and Support \nas needed in support of CBP’s LMR network and systems. LMR Equipment, \nDevelopment, Deployment and Support includes, but is not limited to: assistance in \nengineering design and analysis, site development, equipment configuration, system \ninstallation, system testing, training, warehousing, transportation, field operations \nsupport, and equipment and material supply as called for within this SOW. \n \nThe equipment and services requested under this SOW will be applied in coordination \nwith the Government Contracting Officer’s Technical Representative (COTR), and/or the \nCOTR-designated Task Monitor(s). \n \n \n",
- "standard_references": [
- "ANSI/TIA 222-G",
- "TIA/ANSI 222-G-1",
- "FIPS 140-2",
- "FIPS 197"
- ],
- "title": "Microsoft Word - SOW HSBP1010C00056.doc",
- "xmp:CreatorTool": "PScript5.dll Version 5.2.2",
- "xmpMM:DocumentID": "uuid:13a50f6e-93d9-42f0-b939-eb9aa2c15426",
- "xmpTPg:NPages": "46"
-}
-```
-
-## Background
-
-From a technical perspective, a standard reference is a string that is usually composed of two parts:
-1. the name of the standard organization;
-2. the alphanumeric identifier of the standard within the organization.
-Specifically, the first part can include the acronym or the full name of the standard organization or even both, and the second part can include an alphanumeric string, possibly containing one or more separation symbols (e.g., "-", "_", ".") depending on the format adopted by the organization, representing the identifier of the standard within the organization.
-
-Furthermore, the standard references are usually reported within the "Applicable Documents" or "References" section of a SOW, and they can be cited also within sections that include in the header the word "standard", "requirement", "guideline", or "compliance".
-
-Consequently, the citation of standard references within a SOW/PWS document can be summarized by the following rules:
-* **RULE 1**: standard references are usually reported within the section named "Applicable Documents" or "References".
-* **RULE 2**: standard references can be cited also within sections including the word "compliance" or another semantically-equivalent word in their name.
-* **RULE 3**: standard references is composed of two parts:
- * Name of the standard organization (acronym, full name, or both).
- * Alphanumeric identifier of the standard within the organization.
-* **RULE 4**: The name of the standard organization includes the acronym or the full name or both. The name must belong to the set of standard organizations `S = O U V`, where `O` represents the set of open standard organizations (e.g., ANSI) and `V` represents the set of vendor-specific standard organizations (e.g., Motorola).
-* **RULE 5**: A separation symbol (e.g., "-", "_", "." or whitespace) can be used between the name of the standard organization and the alphanumeric identifier.
-* **RULE 6**: The alphanumeric identifier of the standard is composed of alphabetic and numeric characters, possibly split in two or more parts by a separation symbol (e.g., "-", "_", ".").
-
-On the basis of the above rules, here are some examples of formats used for reporting standard references within a SOW/PWS:
-* ``
-* `()`
-* `()`
-
-Moreover, some standards are sometimes released by two standard organizations. In this case, the standard reference can be reported as follows:
-* `/`
-
-## Regular Expressions
-
-The `StandardsExtractingContentHandler` uses a helper class named `StandardsText` that relies on Java regular expressions and provides some methods to identify headers and standard references, and determine the score of the references found within the given text.
-
-Here are the main regular expressions used within the `StandardsText` class:
-* **REGEX_HEADER**: regular expression to match only uppercase headers.
- ```
- (\d+\.(\d+\.?)*)\p{Blank}+([A-Z]+(\s[A-Z]+)*){5,}
- ```
-* **REGEX_APPLICABLE_DOCUMENTS**: regular expression to match the the header of "APPLICABLE DOCUMENTS" and equivalent sections.
- ```
- (?i:.*APPLICABLE\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)
- ```
-* **REGEX_FALLBACK**: regular expression to match a string that is supposed to be a standard reference.
- ```
- \(?(?[A-Z]\w+)\)?((\s?(?\/)\s?)(\w+\s)*\(?(?[A-Z]\w+)\)?)?(\s(Publication|Standard))?(-|\s)?(?([0-9]{3,}|([A-Z]+(-|_|\.)?[0-9]{2,}))((-|_|\.)?[A-Z0-9]+)*)
- ```
-* **REGEX_STANDARD**: regular expression to match the standard organization within a string potentially representing a standard reference.
- This regular expression is obtained by using a helper class named `StandardOrganizations` that provides a list of the most important standard organizations reported on [Wikipedia](https://en.wikipedia.org/wiki/List_of_technical_standard_organisations). Basically, the list is composed of International standard organizations, Regional standard organizations, and American and British among Nationally-based standard organizations. Other lists of standard organizations are reported on [OpenStandards](http://www.openstandards.net/viewOSnet2C.jsp?showModuleName=Organizations) and [IBR Standards Portal](https://ibr.ansi.org/Standards/).
-
-## How To Use The Standards Extraction Capability
-
-The standard references identification performed by using the `StandardsExtractingContentHandler` is based on the following steps (see also the [flow chart](https://issues.apache.org/jira/secure/attachment/12885939/flowchart_standards_extraction_v02.png)):
-1. searches for headers;
-2. searches for patterns that are supposed to be standard references (basically, every string mostly composed of uppercase letters followed by an alphanumeric characters);
-3. each potential standard reference starts with score equal to 0.25;
-4. increases by 0.25 the score of references which include the name of a known standard organization;
-5. increases by 0.25 the score of references which include the word "Publication" or "Standard";
-6. increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections;
-7. returns the standard references along with scores;
-8. adds the standard references as additional metadata.
-
-The unit test is implemented within the **`StandardsExtractingContentHandlerTest`** class and extracts the standard references from a SOW downloaded from the [FOIA Library](https://foiarr.cbp.gov/streamingWord.asp?i=607). This SOW is also provide on [Jira](https://issues.apache.org/jira/secure/attachment/12884323/SOW-TacCOM.pdf).
-
-The **`StandardsExtractionExample`** is a class to demonstrate how to use the `StandardsExtractingContentHandler` to get a list of the standard references from every file in a directory.
diff --git a/api/text_analysis/__init__.py b/api/standards_extraction/__init__.py
similarity index 100%
rename from api/text_analysis/__init__.py
rename to api/standards_extraction/__init__.py
diff --git a/api/standards_extraction/build.sh b/api/standards_extraction/build.sh
deleted file mode 100755
index e4b66e0..0000000
--- a/api/standards_extraction/build.sh
+++ /dev/null
@@ -1,26 +0,0 @@
-#!/bin/bash
-#
-# Script : build.sh
-# Usage : ./build.sh /path/to/input
-# Author : Giuseppe Totaro
-# Date : 09/26/2017 [MM-DD-YYYY]
-# Last Edited:
-# Description: This scripts compiles all .java files.
-# Notes :
-#
-
-export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8
-
-if [ ! -e lib/tika-app-1.16.jar ]
-then
- echo "Error: this program requires Apache Tika 1.16!"
- echo "Please provide \"tika-app-1.16.jar\" file in the \"lib\" folder and try again."
- exit 1
-fi
-
-mkdir -p bin
-
-for file in $(find . -name "*.java" -print)
-do
- javac -cp ./:./lib/tika-app-1.16.jar:./lib/junit-4.12.jar:./src -d ./bin ${file}
-done
diff --git a/api/standards_extraction/lib/junit-4.12.jar b/api/standards_extraction/lib/junit-4.12.jar
deleted file mode 100755
index 3a7fc26..0000000
Binary files a/api/standards_extraction/lib/junit-4.12.jar and /dev/null differ
diff --git a/api/standards_extraction/parse.py b/api/standards_extraction/parse.py
new file mode 100755
index 0000000..4b586be
--- /dev/null
+++ b/api/standards_extraction/parse.py
@@ -0,0 +1,61 @@
+# -*- coding: utf-8 -*-
+
+import re
+import io
+import os
+import subprocess
+
+
+def tika_parse(pdf):
+ filepath = "./data/" + pdf.filename
+ if os.path.exists(filepath + "_parsed.txt"):
+ # todo: remove this. Caches the parsed text.
+ return str(open(filepath + "_parsed.txt", "r").read())
+ pdf.write(filepath)
+ bashCommand = "java -jar standards_extraction/tika-app-1.16.jar -t " + filepath
+ output = ""
+ try:
+ output = subprocess.check_output(["bash", "-c", bashCommand])
+ # file = open(filepath + "_parsed.txt", "wb")
+ # file.write(output)
+ # file.close()
+ # Returns bytestring with lots of tabs and spaces.
+ if type(output) == bytes:
+ output = output.decode("utf-8").replace("\t",
+ " ").replace("\n", " ")
+ except subprocess.CalledProcessError as e:
+ print(e.output)
+ return str(output)
+
+
+def find_standard_ref(text):
+ standard_orgs = {}
+ for line in io.open("standards_extraction/standard_orgs.txt", mode="r", encoding="utf-8").readlines():
+ line = line.strip()
+ abbr = line.split(' — ')[0]
+ name = line.split(' — ')[1]
+ standard_orgs[abbr] = name
+ refs = []
+ # match abbreviations in upper case
+ words = text.split()
+ for i, word in enumerate(words):
+ for k in standard_orgs.keys():
+ if k in word:
+ # check one word before and after for alphanumeric
+ if i < len(words):
+ word_after = words[i+1]
+ if bool(re.search(r'\d', word_after)):
+ standard_ref = word + ' ' + word_after
+ # clean a bit
+ if standard_ref[-1] == '.' or standard_ref[-1] == ',':
+ standard_ref = standard_ref[:-1]
+ standard_ref = standard_ref.replace('\\n', '')
+ refs.append(standard_ref)
+ elif i >= 0:
+ word_before = words[i+1]
+ if bool(re.search(r'\d', word_before)):
+ refs.append(word_before+' '+word)
+
+ return list(set(refs))
+
+# print(find_standard_ref('(IEC) sdd67'))
diff --git a/api/standards_extraction/run.sh b/api/standards_extraction/run.sh
deleted file mode 100755
index d50961c..0000000
--- a/api/standards_extraction/run.sh
+++ /dev/null
@@ -1,39 +0,0 @@
-#!/bin/bash
-#
-# Script : run.sh
-# Usage : ./run.sh /path/to/to/input
-# Author : Giuseppe Totaro
-# Date : 08/28/2017 [MM-DD-YYYY]
-# Last Edited:
-# Description: This scripts runs the StandardsExtractingContentHandler to
-# extract the standard references from every file in a directory.
-# Notes :
-#
-
-function usage() {
- echo "Usage: run.sh /path/to/input threshold"
- exit 1
-}
-
-INPUT=""
-OUTPUT=""
-UMLS_USER=""
-UMLS_PASS=""
-CTAKES_HOME=""
-
-if [ ! -e lib/tika-app-1.16.jar ]
-then
- echo "Error: this program requires Apache Tika 1.16!"
- echo "Please provide \"tika-app-1.16.jar\" file in the \"lib\" folder and try again."
- exit 1
-fi
-
-if [ $# -lt 2 ]
-then
- usage
-fi
-
-INPUT="$1"
-THRESHOLD="$2"
-
-java -cp ./lib/tika-app-1.16.jar:./bin StandardsExtractor "$INPUT" $THRESHOLD 2> /dev/null
diff --git a/api/standards_extraction/src/StandardsExtractor.java b/api/standards_extraction/src/StandardsExtractor.java
deleted file mode 100755
index 69b5294..0000000
--- a/api/standards_extraction/src/StandardsExtractor.java
+++ /dev/null
@@ -1,178 +0,0 @@
-import java.io.BufferedInputStream;
-import java.io.InputStream;
-import java.io.StringWriter;
-import java.nio.file.Files;
-import java.nio.file.Path;
-import java.nio.file.Paths;
-import java.util.Arrays;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
-import org.apache.tika.exception.TikaException;
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.metadata.serialization.JsonMetadata;
-import org.apache.tika.parser.AutoDetectParser;
-import org.apache.tika.parser.ParseContext;
-import org.apache.tika.parser.Parser;
-import org.apache.tika.parser.ocr.TesseractOCRConfig;
-import org.apache.tika.parser.pdf.PDFParserConfig;
-import org.apache.tika.sax.BodyContentHandler;
-import org.apache.tika.sax.RomanNumeral;
-import org.apache.tika.sax.StandardsExtractingContentHandler;
-import org.apache.tika.sax.StandardsExtractionExample;
-
-/**
- * StandardsExtractor performs the extraction of the scope within the given
- * document, add the scope to the Metadata object, and finally serializes the
- * metadata to JSON.
- *
- */
-public class StandardsExtractor {
- public static final String SCOPE = "scope";
- public static final String TEXT = "text";
- private static final String REGEX_ROMAN_NUMERALS = "(CM|CD|D?C{1,3})|(XC|XL|L?X{1,3})|(IX|IV|V?I{1,3})";
- private static final String REGEX_SCOPE = "(?((\\d+|(" + REGEX_ROMAN_NUMERALS + ")+)\\.?)+)\\p{Blank}+(SCOPE|Scope)";
-
- public static void main(String[] args) {
- if (args.length < 2) {
- System.err.println("Usage: " + StandardsExtractionExample.class.getName() + " /path/to/input threshold");
- System.exit(1);
- }
- String pathname = args[0];
- double threshold = Double.parseDouble(args[1]);
-
- Path input = Paths.get(pathname);
-
- if (!Files.exists(input)) {
- System.err.println("Error: " + input + " does not exist!");
- System.exit(1);
- }
-
- Metadata metadata = null;
-
- try {
- metadata = process(input, threshold);
- } catch (Exception e) {
- metadata = new Metadata();
- }
-
- StringWriter writer = new StringWriter();
- JsonMetadata.setPrettyPrinting(true);
-
- try {
- JsonMetadata.toJson(metadata, writer);
- } catch (TikaException e) {
- writer.write("{}");
- }
-
- System.out.println(writer.toString());
- }
-
- private static Metadata process(Path input, double threshold) throws Exception {
- Parser parser = new AutoDetectParser();
-// ForkParser forkParser = new ForkParser(StandardsExtractor.class.getClassLoader(), parser);
- Metadata metadata = new Metadata();
- StandardsExtractingContentHandler handler = new StandardsExtractingContentHandler(new BodyContentHandler(-1), metadata);
- handler.setThreshold(threshold);
-
- TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
- PDFParserConfig pdfConfig = new PDFParserConfig();
- pdfConfig.setExtractInlineImages(true);
-
- ParseContext parseContext = new ParseContext();
- parseContext.set(TesseractOCRConfig.class, ocrConfig);
- parseContext.set(PDFParserConfig.class, pdfConfig);
- parseContext.set(Parser.class, parser);
-
- try (InputStream stream = new BufferedInputStream(Files.newInputStream(input))) {
- parser.parse(stream, handler, metadata, parseContext);
- }
-// try {
-// //TODO ForkParser gives back text in a different format wrt AutoDetectParser!
-// forkParser.parse(stream, handler, metadata, parseContext);
-// } finally {
-// forkParser.close();
-// }
-
- String text = handler.toString();
-
- Pattern patternScope = Pattern.compile(REGEX_SCOPE);
- Matcher matcherScope = patternScope.matcher(text);
-
- Matcher matchResult = null;
- boolean match = false;
- String scope = "";
-
-// // Gets the second occurrence of SCOPE
-// for (int i = 0; i < 2; i++) {
-// match = matcherScope.find();
-
-// }
- // Gets the last occurrence of scope
- for (int i = 0; i < 2 && (match = matcherScope.find()); i++) {
- matchResult = (Matcher)matcherScope.toMatchResult();
- }
-
- if (matchResult != null && !matchResult.group().isEmpty()) {
- int start = matchResult.end();
- String index = matchResult.group("index");
-
- int end = text.length() - 1;
- match = false;
- String endsWithDot = (index.substring(index.length()-1).equals(".")) ? "." : "";
- String[] parts = index.split("\\.");
-
- do {
-// if (parts.length > 0) {
-// int partsLength = parts.length;
-// int subindex = Integer.parseInt(parts[--partsLength]);
-// while (subindex++ == 0 && partsLength > 0) {
-// subindex = Integer.parseInt(parts[--partsLength]);
-// }
-// parts[partsLength] = Integer.toString(subindex);
-// index = String.join(".", parts) + endsWithDot;
-// }
-
- if (parts.length > 0) {
- int partsLength = parts.length;
- int subIndex = 0;
- RomanNumeral romanNumeral = null;
- boolean roman = false;
-
- do {
- String subindexString = parts[--partsLength];
- try {
- romanNumeral = new RomanNumeral(subindexString);
- subIndex = romanNumeral.toInt();
- roman = true;
- } catch (NumberFormatException e) {
- subIndex = Integer.parseInt(subindexString);
- }
- } while (subIndex++ == 0 && partsLength > 0);
-
- parts[partsLength] = (roman) ? new RomanNumeral(subIndex).toString() : Integer.toString(subIndex);
- index = String.join(".", parts) + endsWithDot;
- }
-
- Pattern patternNextHeader = Pattern.compile(index + "\\p{Blank}+([A-Z]([A-Za-z]+\\s?)*)");
- Matcher matcherNextHeader = patternNextHeader.matcher(text);
-
- if (match = matcherNextHeader.find(start)) {
- end = matcherNextHeader.start();
- }
-
- if (parts.length > 0) {
- parts = Arrays.copyOfRange(parts, 0, parts.length-1);
- }
- } while (!match && parts.length > 0);
-
- //TODO Clean text by removing header, footer, and page number (try to find the patterns associated with header and footer)
- scope = text.substring(start + 1, end);
- }
-
- metadata.add(SCOPE, scope);
- metadata.add(TEXT, text);
-
- return metadata;
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/RomanNumeral.java b/api/standards_extraction/src/org/apache/tika/sax/RomanNumeral.java
deleted file mode 100755
index 49f9493..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/RomanNumeral.java
+++ /dev/null
@@ -1,153 +0,0 @@
-package org.apache.tika.sax;
-
-/**
- * An object of type RomanNumeral is an integer between 1 and 3999. It can
- * be constructed either from an integer or from a string that represents
- * a Roman numeral in this range. The function toString() will return a
- * standardized Roman numeral representation of the number. The function
- * toInt() will return the number as a value of type int.
- *
- * Reference: http://math.hws.edu/eck/cs124/javanotes3/c9/ex-9-3-answer.html
- *
- */
-public class RomanNumeral {
-
- private final int num; // The number represented by this Roman numeral.
-
- private static int[] numbers = { 1000, 900, 500, 400, 100, 90, 50, 40, 10, 9, 5, 4, 1};
-
- private static String[] letters = {"M", "CM", "D", "CD", "C", "XC", "L", "XL", "X", "IX", "V", "IV", "I"};
-
- /**
- * Creates the Roman number with the int value specified
- * by the parameter. Throws a {@link NumberFormatException} if arabic is
- * not in the range 1 to 3999 inclusive.
- *
- * @param arabic int value to create the Roman number
- */
- public RomanNumeral(int arabic) {
- if (arabic < 1) {
- throw new NumberFormatException("Value of RomanNumeral must be positive.");
- }
- if (arabic > 3999) {
- throw new NumberFormatException("Value of RomanNumeral must be 3999 or less.");
- }
- num = arabic;
- }
-
- /**
- * Creates the Roman number with the given representation.
- * For example, RomanNumeral("xvii") is 17. If the parameter is not a
- * legal Roman numeral, a {@link NumberFormatException} is thrown. Both upper and
- * lower case letters are allowed.
- *
- * @param roman representation of the Roman number
- */
- public RomanNumeral(String roman) {
- if (roman.length() == 0) {
- throw new NumberFormatException("An empty string does not define a Roman numeral.");
- }
-
- roman = roman.toUpperCase(); // Convert to upper case letters.
-
- int i = 0; // A position in the string, roman;
- int arabic = 0; // Arabic numeral equivalent of the part of the string that has
- // been converted so far.
-
- while (i < roman.length()) {
-
- char letter = roman.charAt(i); // Letter at current position in string.
- int number = letterToNumber(letter); // Numerical equivalent of letter.
-
- if (number < 0) {
- throw new NumberFormatException("Illegal character \"" + letter + "\" in roman numeral.");
- }
-
- i++; // Move on to next position in the string
-
- if (i == roman.length()) {
- // There is no letter in the string following the one we have just processed.
- // So just add the number corresponding to the single letter to arabic.
- arabic += number;
- }
- else {
- // Look at the next letter in the string. If it has a larger Roman numeral
- // equivalent than number, then the two letters are counted together as
- // a Roman numeral with value (nextNumber - number).
- int nextNumber = letterToNumber(roman.charAt(i));
- if (nextNumber > number) {
- // Combine the two letters to get one value, and move on to next position in string.
- arabic += (nextNumber - number);
- i++;
- }
- else {
- // Don't combine the letters. Just add the value of the one letter onto the number.
- arabic += number;
- }
- }
-
- } // end while
-
- if (arabic > 3999)
- throw new NumberFormatException("Roman numeral must have value 3999 or less.");
-
- num = arabic;
-
- } // end constructor
-
- /**
- * Finds the integer value of letter considered as a Roman numeral.
- * Returns -1 if letter is not a legal Roman numeral. The letter must be
- * upper case.
- *
- * @param letter considered as a Roman numeral
- * @return the integer value of letter considered as a Roman numeral
- */
- private int letterToNumber(char letter) {
- switch (letter) {
- case 'I':
- return 1;
- case 'V':
- return 5;
- case 'X':
- return 10;
- case 'L':
- return 50;
- case 'C':
- return 100;
- case 'D':
- return 500;
- case 'M':
- return 1000;
- default:
- return -1;
- }
- }
-
- /**
- * Returns the standard representation of this Roman numeral.
- *
- * @return the standard representation of this Roman numeral
- */
- public String toString() {
- String roman = ""; // The roman numeral.
- int N = num; // N represents the part of num that still has
- // to be converted to Roman numeral representation.
- for (int i = 0; i < numbers.length; i++) {
- while (N >= numbers[i]) {
- roman += letters[i];
- N -= numbers[i];
- }
- }
- return roman;
- }
-
- /**
- * Returns the value of this Roman numeral as an int.
- *
- * @return the value of this Roman numeral as an int
- */
- public int toInt() {
- return num;
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardOrganizations.java b/api/standards_extraction/src/org/apache/tika/sax/StandardOrganizations.java
deleted file mode 100755
index 24e7f4b..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardOrganizations.java
+++ /dev/null
@@ -1,300 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-import java.util.Map;
-import java.util.TreeMap;
-
-/**
- * This class provides a collection of the most important technical standard organizations.
- * The collection of standard organizations has been obtained from Wikipedia.
- * Currently, the list is composed of the most important international standard organizations, the regional standard organizations (i.e., Africa, Americas, Asia Pacific, Europe, and Middle East), and British and American standard organizations among the national-based ones.
- *
- */
-public class StandardOrganizations {
-
- private static Map organizations;
- static {
- organizations = new TreeMap();
- //manually added organizations
- organizations.put("CFR", "Code of Federal Regulations");
- organizations.put("BIPM", "International Bureau of Weights and Measures");
- organizations.put("CGPM", "General Conference on Weights and Measures");
- organizations.put("CIPM", "International Committee for Weights and Measures");
-
- //International standard organizations
- organizations.put("3GPP", "3rd Generation Partnership Project");
- organizations.put("3GPP2", "3rd Generation Partnership Project 2");
- organizations.put("ABYC", "The American Boat & Yacht Council");
- organizations.put("Accellera", "Accellera Organization");
- organizations.put("A4L", "Access for Learning Community");
- organizations.put("AES", "Audio Engineering Society");
- organizations.put("AIIM", "Association for Information and Image Management");
- organizations.put("ASAM", "Association for Automation and Measuring Systems");
- organizations.put("ASHRAE", "American Society of Heating, Refrigerating and Air-Conditioning Engineers");
- organizations.put("ASME", "American Society of Mechanical Engineers");
- organizations.put("ASTM", "American Society for Testing and Materials");
- organizations.put("ATIS", "Alliance for Telecommunications Industry Solutions");
- organizations.put("AUTOSAR", "Automotive technology");
- //organizations.put("BIPM, CGPM, and CIPM", "Bureau International des Poids et Mesures and the related organizations established under the Metre Convention of 1875.");
- organizations.put("CableLabs", "Cable Television Laboratories");
- organizations.put("CCSDS", "Consultative Committee for Space Data Sciences");
- organizations.put("CIE", "International Commission on Illumination");
- organizations.put("CISPR", "International Special Committee on Radio Interference");
- organizations.put("CFA", "Compact flash association");
- organizations.put("DCMI", "Dublin Core Metadata Initiative");
- organizations.put("DDEX", "Digital Data Exchange");
- organizations.put("DMTF", "Distributed Management Task Force");
- organizations.put("ECMA", "Ecma International");
- organizations.put("EKOenergy", "EKOenergy");
- organizations.put("FAI", "Fédération Aéronautique Internationale");
- organizations.put("GS1", "Global supply chain standards");
- organizations.put("HGI", "Home Gateway Initiative");
- organizations.put("HFSB", "Hedge Fund Standards Board");
- organizations.put("IATA", "International Air Transport Association");
- organizations.put("IAU", "International Arabic Union");
- organizations.put("ICAO", "International Civil Aviation Organization");
- organizations.put("IEC", "International Electrotechnical Commission");
- organizations.put("IEEE", "Institute of Electrical and Electronics Engineers");
- organizations.put("IEEE-SA", "IEEE Standards Association");
- organizations.put("IETF", "Internet Engineering Task Force");
- organizations.put("IFOAM", "International Federation of Organic Agriculture Movements");
- organizations.put("IFSWF", "International Forum of Sovereign Wealth Funds");
- organizations.put("IMO", "International Maritime Organization");
- organizations.put("IMS", "IMS Global Learning Consortium");
- organizations.put("ISO", "International Organization for Standardization");
- organizations.put("IPTC", "International Press Telecommunications Council");
- organizations.put("ITU", "The International Telecommunication Union");
- organizations.put("ITU-R", "ITU Radiocommunications Sector");
- organizations.put("CCIR", "Comité Consultatif International pour la Radio");
- organizations.put("ITU-T", "ITU Telecommunications Sector");
- organizations.put("CCITT", "Comité Consultatif International Téléphonique et Télégraphique");
- organizations.put("ITU-D", "ITU Telecom Development");
- organizations.put("BDT", "Bureau de développement des télécommunications, renamed ITU-D");
- organizations.put("IUPAC", "International Union of Pure and Applied Chemistry");
- organizations.put("Liberty Alliance", "Liberty Alliance");
- organizations.put("Media Grid", "Media Grid Standards Organization");
- organizations.put("NACE International", "National Association of Corrosion Engineers");
- organizations.put("OASIS", "Organization for the Advancement of Structured Information Standards");
- organizations.put("OGC", "Open Geospatial Consortium");
- organizations.put("OHICC", "Organization of Hotel Industry Classification & Certification");
- organizations.put("OIF", "Optical Internetworking Forum");
- organizations.put("OMA", "Open Mobile Alliance");
- organizations.put("OMG", "Object Management Group");
- organizations.put("OGF", "Open Grid Forum");
- organizations.put("GGF", "Global Grid Forum");
- organizations.put("EGA", "Enterprise Grid Alliance");
- organizations.put("OTA", "OpenTravel Alliance");
- organizations.put("OSGi", "OSGi Alliance");
- organizations.put("PESC", "P20 Education Standards Council");
- organizations.put("SAI", "Social Accountability International");
- organizations.put("SDA", "Secure Digital Association");
- organizations.put("SNIA", "Storage Networking Industry Association");
- organizations.put("SMPTE", "Society of Motion Picture and Television Engineers");
- organizations.put("SSDA", "Solid State Drive Alliance");
- organizations.put("The Open Group", "The Open Group");
- organizations.put("TIA", "Telecommunications Industry Association");
- organizations.put("TM Forum", "Telemanagement Forum");
- organizations.put("UIC", "International Union of Railways");
- organizations.put("UL", "Underwriters Laboratories");
- organizations.put("UPU", "Universal Postal Union");
- organizations.put("WMO", "World Meteorological Organization");
- organizations.put("W3C", "World Wide Web Consortium");
- organizations.put("WSA", "Website Standards Association");
- organizations.put("WHO", "World Health Organization");
- organizations.put("XSF", "The XMPP Standards Foundation");
- organizations.put("FAO", "Food and Agriculture Organization");
- //Regional standards organizations
- //Africa
- organizations.put("ARSO", "African Regional Organization for Standarization");
- organizations.put("SADCSTAN", "Southern African Development Community Cooperation in Standarization");
- //Americas
- organizations.put("COPANT", "Pan American Standards Commission");
- organizations.put("AMN", "MERCOSUR Standardization Association");
- organizations.put("CROSQ", "CARICOM Regional Organization for Standards and Quality");
- organizations.put("AAQG", "America's Aerospace Quality Group");
- //Asia Pacific
- organizations.put("PASC", "Pacific Area Standards Congress");
- organizations.put("ACCSQ", "ASEAN Consultative Committee for Standards and Quality");
- //Europe
- organizations.put("RoyalCert", "RoyalCert International Registrars");
- organizations.put("CEN", "European Committee for Standardization");
- organizations.put("CENELEC", "European Committee for Electrotechnical Standardization");
- organizations.put("URS", "United Registrar of Systems, UK");
- organizations.put("ETSI", "European Telecommunications Standards Institute");
- organizations.put("EASC", "Euro-Asian Council for Standardization, Metrology and Certification");
- organizations.put("IRMM", "Institute for Reference Materials and Measurements");
- organizations.put("WELMEC", "European Cooperation in Legal Metrology");
- organizations.put("EURAMET", "the European Association of National Metrology Institutes");
- //Middle East
- organizations.put("AIDMO", "Arab Industrial Development and Mining Organization");
- organizations.put("IAU", "International Arabic Union");
- //Nationally-based standards organizations
- //United Kingdom
- organizations.put("BSI", "British Standards Institution aka BSI Group");
- organizations.put("DStan", "UK Defence Standardization");
- //United States of America
- organizations.put("ANSI", "American National Standards Institute");
- organizations.put("ACI", "American Concrete Institute");
- organizations.put("NIST", "National Institute of Standards and Technology");
-
- //for and of the in on
-
- //manually added organizations
- organizations.put("Code\\sof\\sFederal\\sRegulations", "CFR");
- organizations.put("International\\sBureau\\sof\\sWeights\\sand\\sMeasures", "BIPM");
- organizations.put("General\\sConference\\son\\sWeights\\sand\\sMeasures", "CGPM");
- organizations.put("International\\sCommittee\\sfor\\sWeights\\sand\\sMeasures", "CIPM");
-
- //International standard organizations
- organizations.put("3rd\\sGeneration\\sPartnership\\sProject", "3GPP");
- organizations.put("3rd\\sGeneration\\sPartnership\\sProject\\s2", "3GPP2");
- organizations.put("The\\sAmerican\\sBoat\\s&\\sYacht\\sCouncil", "ABYC");
- organizations.put("Accellera\\sOrganization", "Accellera");
- organizations.put("Access\\sfor\\sLearning\\sCommunity", "A4L");
- organizations.put("Audio\\sEngineering\\sSociety", "AES");
- organizations.put("Association\\sfor\\sInformation\\sand\\sImage\\sManagement", "AIIM");
- organizations.put("Association\\sfor\\sAutomation\\sand\\sMeasuring\\sSystems", "ASAM");
- organizations.put("American\\sSociety\\sof\\sHeating,\\sRefrigerating\\sand\\sAir-Conditioning\\sEngineers", "ASHRAE");
- organizations.put("American\\sSociety\\sof\\sMechanical\\sEngineers", "ASME");
- organizations.put("American\\sSociety\\sfor\\sTesting\\sand\\sMaterials", "ASTM");
- organizations.put("Alliance\\sfor\\sTelecommunications\\sIndustry\\sSolutions", "ATIS");
- organizations.put("Automotive\\stechnology", "AUTOSAR");
- //organizations.put("BIPM, CGPM, and CIPM", "Bureau International des Poids et Mesures and the related organizations established under the Metre Convention of 1875.");
- organizations.put("Cable\\sTelevision\\sLaboratories", "CableLabs");
- organizations.put("Consultative\\sCommittee\\sfor\\sSpace\\sData\\sSciences", "CCSDS");
- organizations.put("International\\sCommission\\son\\sIllumination", "CIE");
- organizations.put("International\\sSpecial\\sCommittee\\son\\sRadio\\sInterference", "CISPR");
- organizations.put("Compact\\sflash\\sassociation", "CFA");
- organizations.put("Dublin\\sCore\\sMetadata\\sInitiative", "DCMI");
- organizations.put("Digital\\sData\\sExchange", "DDEX");
- organizations.put("Distributed\\sManagement\\sTask\\sForce", "DMTF");
- organizations.put("Ecma\\sInternational", "ECMA");
- organizations.put("EKOenergy", "EKOenergy");
- organizations.put("Fédération\\sAéronautique\\sInternationale", "FAI");
- organizations.put("Global\\ssupply\\schain\\sstandards", "GS1");
- organizations.put("Home\\sGateway\\sInitiative", "HGI");
- organizations.put("Hedge\\sFund\\sStandards\\sBoard", "HFSB");
- organizations.put("International\\sAir\\sTransport\\sAssociation", "IATA");
- organizations.put("International\\sArabic\\sUnion", "IAU");
- organizations.put("International\\sCivil\\sAviation\\sOrganization", "ICAO");
- organizations.put("International\\sElectrotechnical\\sCommission", "IEC");
- organizations.put("Institute\\sof\\sElectrical\\sand\\sElectronics\\sEngineers", "IEEE");
- organizations.put("IEEE\\sStandards\\sAssociation", "IEEE-SA");
- organizations.put("Internet\\sEngineering\\sTask\\sForce", "IETF");
- organizations.put("International\\sFederation\\sof\\sOrganic\\sAgriculture\\sMovements", "IFOAM");
- organizations.put("International\\sForum\\sof\\sSovereign\\sWealth\\sFunds", "IFSWF");
- organizations.put("International\\sMaritime\\sOrganization", "IMO");
- organizations.put("IMS\\sGlobal\\sLearning\\sConsortium", "IMS");
- organizations.put("International\\sOrganization\\sfor\\sStandardization", "ISO");
- organizations.put("International\\sPress\\sTelecommunications\\sCouncil", "IPTC");
- organizations.put("The\\sInternational\\sTelecommunication\\sUnion", "ITU");
- organizations.put("ITU\\sRadiocommunications\\sSector", "ITU-R");
- organizations.put("Comité\\sConsultatif\\sInternational\\spour\\sla\\sRadio", "CCIR");
- organizations.put("ITU\\sTelecommunications\\sSector", "ITU-T");
- organizations.put("Comité\\sConsultatif\\sInternational\\sTéléphonique\\set\\sTélégraphique", "CCITT");
- organizations.put("ITU\\sTelecom\\sDevelopment", "ITU-D");
- organizations.put("Bureau\\sde\\sdéveloppement\\sdes\\stélécommunications", "BDT");
- organizations.put("International\\sUnion\\sof\\sPure\\sand\\sApplied\\sChemistry", "IUPAC");
- organizations.put("Liberty Alliance", "Liberty Alliance");
- organizations.put("Media\\sGrid\\sStandards\\sOrganization", "Media Grid");
- organizations.put("National\\sAssociation\\sof\\sCorrosion\\sEngineers", "NACE International");
- organizations.put("Organization\\sfor\\sthe\\sAdvancement\\sof\\sStructured\\sInformation\\sStandards", "OASIS");
- organizations.put("Open\\sGeospatial\\sConsortium", "OGC");
- organizations.put("Organization\\sof\\sHotel\\sIndustry\\sClassification\\s&\\sCertification", "OHICC");
- organizations.put("Optical\\sInternetworking\\sForum", "OIF");
- organizations.put("Open\\sMobile\\sAlliance", "OMA");
- organizations.put("Object\\sManagement\\sGroup", "OMG");
- organizations.put("Open\\sGrid\\sForum", "OGF");
- organizations.put("Global\\sGrid\\sForum", "GGF");
- organizations.put("Enterprise\\sGrid\\sAlliance", "EGA");
- organizations.put("OpenTravel\\sAlliance", "OTA");
- organizations.put("OSGi\\sAlliance", "OSGi");
- organizations.put("P20\\sEducation\\sStandards\\sCouncil", "PESC");
- organizations.put("Social\\sAccountability\\sInternational", "SAI");
- organizations.put("Secure\\sDigital\\sAssociation", "SDA");
- organizations.put("Storage\\sNetworking\\sIndustry\\sAssociation", "SNIA");
- organizations.put("Society\\sof\\sMotion\\sPicture\\sand\\sTelevision\\sEngineers", "SMPTE");
- organizations.put("Solid\\sState\\sDrive\\sAlliance", "SSDA");
- organizations.put("The\\sOpen\\sGroup", "The Open Group");
- organizations.put("Telecommunications\\sIndustry\\sAssociation", "TIA");
- organizations.put("Telemanagement\\sForum", "TM Forum");
- organizations.put("International\\sUnion\\sof\\sRailways", "UIC");
- organizations.put("Underwriters\\sLaboratories", "UL");
- organizations.put("Universal\\sPostal\\sUnion", "UPU");
- organizations.put("World\\sMeteorological\\sOrganization", "WMO");
- organizations.put("World\\sWide\\sWeb\\sConsortium", "W3C");
- organizations.put("Website\\sStandards\\sAssociation", "WSA");
- organizations.put("World\\sHealth\\sOrganization", "WHO");
- organizations.put("The\\sXMPP\\sStandards\\sFoundation", "XSF");
- organizations.put("Food\\sand\\sAgriculture\\sOrganization", "FAO");
- //Regional standards organizations
- //Africa
- organizations.put("African\\sRegional\\sOrganization\\sfor\\sStandarization", "ARSO");
- organizations.put("Southern\\sAfrican\\sDevelopment\\sCommunity\\sCooperation\\sin\\sStandarization", "SADCSTAN");
- //Americas
- organizations.put("Pan\\sAmerican\\sStandards\\sCommission", "COPANT");
- organizations.put("MERCOSUR\\sStandardization\\sAssociation", "AMN");
- organizations.put("CROSQ", "CARICOM\\sRegional\\sOrganization\\sfor\\sStandards\\sand\\sQuality");
- organizations.put("America's\\sAerospace\\sQuality\\sGroup", "AAQG");
- //Asia Pacific
- organizations.put("Pacific\\sArea\\sStandards\\sCongress", "PASC");
- organizations.put("ASEAN\\sConsultative\\sCommittee\\sfor\\sStandards\\sand\\sQuality", "ACCSQ");
- //Europe
- organizations.put("RoyalCert\\sInternational\\sRegistrars", "RoyalCert");
- organizations.put("European\\sCommittee\\sfor\\sStandardization", "CEN");
- organizations.put("European\\sCommittee\\sfor\\sElectrotechnical\\sStandardization", "CENELEC");
- organizations.put("United\\sRegistrar\\sof\\sSystems", "URS");
- organizations.put("European\\sTelecommunications\\sStandards\\sInstitute", "ETSI");
- organizations.put("Euro-Asian\\sCouncil\\sfor\\sStandardization,\\sMetrology\\sand\\sCertification", "EASC");
- organizations.put("Institute\\sfor\\sReference\\sMaterials\\sand\\sMeasurements", "IRMM");
- organizations.put("European\\sCooperation\\sin\\sLegal\\sMetrology", "WELMEC");
- organizations.put("the\\sEuropean\\sAssociation\\sof\\sNational\\sMetrology\\sInstitutes", "EURAMET");
- //Middle East
- organizations.put("Arab\\sIndustrial\\sDevelopment\\sand\\sMining\\sOrganization", "AIDMO");
- organizations.put("International\\sArabic\\sUnion", "IAU");
- //Nationally-based standards organizations
- //United Kingdom
- organizations.put("British\\sStandards\\sInstitution", "BSI");
- organizations.put("UK\\sDefence\\sStandardization", "DStan");
- //United States of America
- organizations.put("American\\sNational\\sStandards\\sInstitute", "ANSI");
- organizations.put("American\\sConcrete\\sInstitute", "ACI");
- organizations.put("National\\sInstitute\\sof\\sStandards\\sand\\sTechnology", "NIST");
-
- }
-
- /**
- * Returns the map containing the collection of the most important technical standard organizations.
- *
- * @return the map containing the collection of the most important technical standard organizations.
- */
- public static Map getOrganizations() {
- return organizations;
- }
-
- /**
- * Returns the regular expression containing the most important technical standard organizations.
- *
- * @return the regular expression containing the most important technical standard organizations.
- */
- public static String getOrganzationsRegex() {
- String regex = "(" + String.join("|", organizations.keySet()) + ")"; //1) regex improved, 2) take care of white space w/ second fxn
- return regex;
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardReference.java b/api/standards_extraction/src/org/apache/tika/sax/StandardReference.java
deleted file mode 100755
index cb3f5a1..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardReference.java
+++ /dev/null
@@ -1,124 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-/**
- * Class that represents a standard reference.
- *
- */
-public class StandardReference {
- private String mainOrganization;
- private String separator;
- private String secondOrganization;
- private String identifier;
- private double score;
-
- private StandardReference(String mainOrganizationAcronym, String separator, String secondOrganizationAcronym,
- String identifier, double score) {
- super();
- this.mainOrganization = mainOrganizationAcronym;
- this.separator = separator;
- this.secondOrganization = secondOrganizationAcronym;
- this.identifier = identifier;
- this.score = score;
- }
-
- public String getMainOrganizationAcronym() {
- return mainOrganization;
- }
-
- public void setMainOrganizationAcronym(String mainOrganizationAcronym) {
- this.mainOrganization = mainOrganizationAcronym;
- }
-
- public String getSeparator() {
- return separator;
- }
-
- public void setSeparator(String separator) {
- this.separator = separator;
- }
-
- public String getSecondOrganizationAcronym() {
- return secondOrganization;
- }
-
- public void setSecondOrganizationAcronym(String secondOrganizationAcronym) {
- this.secondOrganization = secondOrganizationAcronym;
- }
-
- public String getIdentifier() {
- return identifier;
- }
-
- public void setIdentifier(String identifier) {
- this.identifier = identifier;
- }
-
- public double getScore() {
- return score;
- }
-
- public void setScore(double score) {
- this.score = score;
- }
-
- @Override
- public String toString() {
- String standardReference = mainOrganization;
-
- if (separator != null && !separator.isEmpty()) {
- standardReference += separator + secondOrganization;
- }
-
- standardReference += " " + identifier;
-
- return standardReference;
- }
-
- public static class StandardReferenceBuilder {
- private String mainOrganization;
- private String separator;
- private String secondOrganization;
- private String identifier;
- private double score;
-
- public StandardReferenceBuilder(String mainOrganization, String identifier) {
- this.mainOrganization = mainOrganization;
- this.separator = null;
- this.secondOrganization = null;
- this.identifier = identifier;
- this.score = 0;
- }
-
- public StandardReferenceBuilder setSecondOrganization(String separator, String secondOrganization) {
- this.separator = separator;
- this.secondOrganization = secondOrganization;
- return this;
- }
-
- public StandardReferenceBuilder setScore(double score) {
- this.score = score;
- return this;
- }
-
- public StandardReference build() {
- return new StandardReference(mainOrganization, separator, secondOrganization, identifier, score);
- }
- }
-}
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandler.java b/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandler.java
deleted file mode 100755
index 5d63300..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandler.java
+++ /dev/null
@@ -1,116 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.tika.metadata.Metadata;
-import org.xml.sax.ContentHandler;
-import org.xml.sax.SAXException;
-import org.xml.sax.helpers.DefaultHandler;
-
-/**
- * StandardsExtractingContentHandler is a Content Handler used to extract
- * standard references while parsing.
- *
- */
-public class StandardsExtractingContentHandler extends ContentHandlerDecorator {
- public static final String STANDARD_REFERENCES = "standard_references";
- private Metadata metadata;
- private StringBuilder stringBuilder;
- private double threshold = 0;
-
- /**
- * Creates a decorator for the given SAX event handler and Metadata object.
- *
- * @param handler
- * SAX event handler to be decorated.
- * @param metadata
- * {@link Metadata} object.
- */
- public StandardsExtractingContentHandler(ContentHandler handler, Metadata metadata) {
- super(handler);
- this.metadata = metadata;
- this.stringBuilder = new StringBuilder();
- }
-
- /**
- * Creates a decorator that by default forwards incoming SAX events to a
- * dummy content handler that simply ignores all the events. Subclasses
- * should use the {@link #setContentHandler(ContentHandler)} method to
- * switch to a more usable underlying content handler. Also creates a dummy
- * Metadata object to store phone numbers in.
- */
- protected StandardsExtractingContentHandler() {
- this(new DefaultHandler(), new Metadata());
- }
-
- /**
- * Gets the threshold to be used for selecting the standard references found
- * within the text based on their score.
- *
- * @return the threshold to be used for selecting the standard references
- * found within the text based on their score.
- */
- public double getThreshold() {
- return threshold;
- }
-
- /**
- * Sets the score to be used as threshold.
- *
- * @param score
- * the score to be used as threshold.
- */
- public void setThreshold(double score) {
- this.threshold = score;
- }
-
- /**
- * The characters method is called whenever a Parser wants to pass raw
- * characters to the ContentHandler. However, standard references are often
- * split across different calls to characters, depending on the specific
- * Parser used. Therefore, we simply add all characters to a StringBuilder
- * and analyze it once the document is finished.
- */
- @Override
- public void characters(char[] ch, int start, int length) throws SAXException {
- try {
- String text = new String(Arrays.copyOfRange(ch, start, start + length));
- stringBuilder.append(text);
- super.characters(ch, start, length);
- } catch (SAXException e) {
- handleException(e);
- }
- }
-
- /**
- * This method is called whenever the Parser is done parsing the file. So,
- * we check the output for any standard references.
- */
- @Override
- public void endDocument() throws SAXException {
- super.endDocument();
- List standards = StandardsText.extractStandardReferences(stringBuilder.toString(),
- threshold);
- for (StandardReference standardReference : standards) {
- metadata.add(STANDARD_REFERENCES, standardReference.toString());
- }
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java b/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java
deleted file mode 100755
index 0fb411d..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java
+++ /dev/null
@@ -1,53 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-import static org.junit.Assert.*;
-
-import java.io.InputStream;
-
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.parser.AutoDetectParser;
-import org.apache.tika.parser.ParseContext;
-import org.apache.tika.parser.Parser;
-import org.junit.Test;
-
-/**
- * Test class for the {@link StandardsExtractingContentHandler} class.
- */
-public class StandardsExtractingContentHandlerTest {
-
- @Test
- public void testExtractStandards() throws Exception {
- Parser parser = new AutoDetectParser();
- Metadata metadata = new Metadata();
-
- StandardsExtractingContentHandler handler = new StandardsExtractingContentHandler(new BodyContentHandler(-1), metadata);
- handler.setThreshold(0.75);
- InputStream inputStream = StandardsExtractingContentHandlerTest.class.getResourceAsStream("/test-documents/testStandardsExtractor.pdf");
-
- parser.parse(inputStream, handler, metadata, new ParseContext());
-
- String[] standardReferences = metadata.getValues(StandardsExtractingContentHandler.STANDARD_REFERENCES);
-
- assertTrue(standardReferences[0].equals("ANSI/TIA 222-G"));
- assertTrue(standardReferences[1].equals("TIA/ANSI 222-G-1"));
- assertTrue(standardReferences[2].equals("FIPS 140-2"));
- assertTrue(standardReferences[3].equals("FIPS 197"));
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractionExample.java b/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractionExample.java
deleted file mode 100755
index 63e5b76..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardsExtractionExample.java
+++ /dev/null
@@ -1,109 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-import java.io.BufferedInputStream;
-import java.io.IOException;
-import java.io.InputStream;
-import java.nio.file.FileVisitResult;
-import java.nio.file.Files;
-import java.nio.file.Path;
-import java.nio.file.Paths;
-import java.nio.file.SimpleFileVisitor;
-import java.nio.file.attribute.BasicFileAttributes;
-import java.util.Collections;
-import java.util.HashSet;
-
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.parser.AutoDetectParser;
-import org.apache.tika.parser.ParseContext;
-import org.apache.tika.parser.Parser;
-
-/**
- * Class to demonstrate how to use the {@link StandardsExtractingContentHandler}
- * to get a list of the standard references from every file in a directory.
- *
- *
- * You can run this main method by running
- *
- * mvn exec:java -Dexec.mainClass="org.apache.tika.example.StandardsExtractionExample" -Dexec.args="/path/to/input"
- *
- * from the tika-example directory.
- *
- */
-public class StandardsExtractionExample {
- private static HashSet standardReferences = new HashSet<>();
- private static int failedFiles = 0;
- private static int successfulFiles = 0;
-
- public static void main(String[] args) {
- if (args.length < 1) {
- System.err.println("Usage: " + StandardsExtractionExample.class.getName() + " /path/to/input");
- System.exit(1);
- }
- String pathname = args[0];
-
- Path folder = Paths.get(pathname);
- System.out.println("Searching " + folder.toAbsolutePath() + "...");
- processFolder(folder);
- System.out.println(standardReferences.toString());
- System.out.println("Parsed " + successfulFiles + "/" + (successfulFiles + failedFiles));
- }
-
- public static void processFolder(Path folder) {
- try {
- Files.walkFileTree(folder, new SimpleFileVisitor() {
- @Override
- public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
- try {
- process(file);
- successfulFiles++;
- } catch (Exception e) {
- failedFiles++;
- // ignore this file
- }
- return FileVisitResult.CONTINUE;
- }
-
- @Override
- public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
- failedFiles++;
- return FileVisitResult.CONTINUE;
- }
- });
- } catch (IOException e) {
- // ignore failure
- }
- }
-
- public static void process(Path path) throws Exception {
- Parser parser = new AutoDetectParser();
- Metadata metadata = new Metadata();
- // The StandardsExtractingContentHandler will examine any characters for
- // standard references before passing them
- // to the underlying Handler.
- StandardsExtractingContentHandler handler = new StandardsExtractingContentHandler(new BodyContentHandler(-1),
- metadata);
- handler.setThreshold(0.75);
- try (InputStream stream = new BufferedInputStream(Files.newInputStream(path))) {
- parser.parse(stream, handler, metadata, new ParseContext());
- }
- String[] references = metadata.getValues(StandardsExtractingContentHandler.STANDARD_REFERENCES);
- Collections.addAll(standardReferences, references);
- }
-}
\ No newline at end of file
diff --git a/api/standards_extraction/src/org/apache/tika/sax/StandardsText.java b/api/standards_extraction/src/org/apache/tika/sax/StandardsText.java
deleted file mode 100755
index e856540..0000000
--- a/api/standards_extraction/src/org/apache/tika/sax/StandardsText.java
+++ /dev/null
@@ -1,188 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.tika.sax;
-
-import java.util.ArrayList;
-import java.util.Iterator;
-import java.util.Map;
-import java.util.Map.Entry;
-import java.util.TreeMap;
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
-import org.apache.tika.sax.StandardReference.StandardReferenceBuilder;
-
-/**
- * StandardText relies on regular expressions to extract standard references
- * from text.
- *
- *
- * This class helps to find the standard references from text by performing the
- * following steps:
- *
- * - searches for headers;
- * - searches for patterns that are supposed to be standard references
- * (basically, every string mostly composed of uppercase letters followed by an
- * alphanumeric characters);
- * - each potential standard reference starts with score equal to 0.25;
- * - increases by 0.25 the score of references which include the name of a
- * known standard organization ({@link StandardOrganizations});
- * - increases by 0.25 the score of references which include the word
- * Publication or Standard;
- * - increases by 0.25 the score of references which have been found within
- * "Applicable Documents" and equivalent sections;
- * - returns the standard references along with scores.
- *
- *
- *
- */
-public class StandardsText {
- // Regular expression to match uppercase headers
- private static final String REGEX_HEADER = "(\\d+\\.(\\d+\\.?)*)\\p{Blank}+([A-Z]+(\\s[A-Z]+)*){5,}";
-
- // Regular expression to match the "APPLICABLE DOCUMENTS" and equivalent
- // sections
- private static final String REGEX_APPLICABLE_DOCUMENTS = "(?i:.*APPLICABLE\\sDOCUMENTS|REFERENCE|STANDARD|REQUIREMENT|GUIDELINE|COMPLIANCE.*)";
-
- // Regular expression to match the alphanumeric identifier of the standard
- private static final String REGEX_IDENTIFIER = "(?([0-9]{2,}|([A-Z]+(-|_|\\.)?[0-9]{2,}))((-|_|\\.)?[A-Z0-9]+)*)";//new line here??
-
- // Regular expression to match the standard organization
- private static final String REGEX_ORGANIZATION = StandardOrganizations.getOrganzationsRegex();
-
- // Regular expression to match the type of publication, often reported
- // between the name of the standard organization and the standard identifier
- private static final String REGEX_STANDARD_TYPE = "(\\s(?i:Publication|Standard))";
-
- // Regular expression to match a string that is supposed to be a standard
- // reference
- private static final String REGEX_FALLBACK = "\\(?" + "(?(([A-Z]+\\w+\\s?){1,10}))"
- + "\\)?((\\s?(?\\/)\\s?)(\\w+\\s)*\\(?" + "(?[A-Z]+\\w+)" + "\\)?)?"
- + REGEX_STANDARD_TYPE + "?" + "(-|\\s)?" + REGEX_IDENTIFIER; //or new line here??
-
- // Regular expression to match the standard organization within a string
- // that is supposed to be a standard reference
- private static final String REGEX_STANDARD = ".*" + REGEX_ORGANIZATION + ".+" + REGEX_ORGANIZATION + "?.*";
-
- /**
- * Extracts the standard references found within the given text.
- *
- * @param text
- * the text from which the standard references are extracted.
- * @param threshold
- * the lower bound limit to be used in order to select only the
- * standard references with score greater than or equal to the
- * threshold. For instance, using a threshold of 0.75 means that
- * only the patterns with score greater than or equal to 0.75
- * will be returned.
- * @return the list of standard references extracted from the given text.
- */
- public static ArrayList extractStandardReferences(String text, double threshold) {
- Map headers = findHeaders(text);
-
- ArrayList standardReferences = findStandards(text, headers, threshold);
-
- return standardReferences;
- }
-
- /**
- * This method helps to find the headers within the given text.
- *
- * @param text
- * the text from which the headers are extracted.
- * @return the list of headers found within the given text.
- */
- private static Map findHeaders(String text) {
- Map headers = new TreeMap();
-
- Pattern pattern = Pattern.compile(REGEX_HEADER);
- Matcher matcher = pattern.matcher(text);
-
- while (matcher.find()) {
- headers.put(matcher.start(), matcher.group());
- }
-
- return headers;
- }
-
- /**
- * This method helps to find the standard references within the given text.
- *
- * @param text
- * the text from which the standards references are extracted.
- * @param headers
- * the list of headers found within the given text.
- * @param threshold
- * the lower bound limit to be used in order to select only the
- * standard references with score greater than or equal to the
- * threshold.
- * @return the list of standard references extracted from the given text.
- */
- private static ArrayList findStandards(String text, Map headers,
- double threshold) {
- ArrayList standards = new ArrayList();
- double score = 0;
-
- Pattern pattern = Pattern.compile(REGEX_FALLBACK);
- Matcher matcher = pattern.matcher(text);
-
- while (matcher.find()) {
- StandardReferenceBuilder builder = new StandardReference.StandardReferenceBuilder(
- matcher.group("mainOrganization"), matcher.group("identifier"))
- .setSecondOrganization(matcher.group("separator"), matcher.group("secondOrganization"));
- score = 0.25;
-
- // increases by 0.5 the score of references which include the name of a known standard organization
- if (matcher.group().matches(REGEX_STANDARD)) {
- score += 0.5;
- }
-
- // increases by 0.25 the score of references which include the word "Publication" or "Standard"
- if (matcher.group().matches(".*" + REGEX_STANDARD_TYPE + ".*")) {
- score += 0.25;
- }
-
- int startHeader = 0;
- int endHeader = 0;
- boolean headerFound = false;
- Iterator> iterator = headers.entrySet().iterator();
- while (iterator.hasNext() && !headerFound) {
- startHeader = endHeader;
- endHeader = iterator.next().getKey();
- if (endHeader > matcher.start()) {
- headerFound = true;
- }
- }
-
- String header = headers.get(startHeader);
-
- // increases by 0.25 the score of references which have been found within "Applicable Documents" and equivalent sections
- if (header != null && headers.get(startHeader).matches(REGEX_APPLICABLE_DOCUMENTS)) {
- score += 0.25;
- }
-
- builder.setScore(score);
-
- if (score >= threshold) {
- standards.add(builder.build());
- }
- }
-
- return standards;
- }
-}
\ No newline at end of file
diff --git a/api/standards/data/standard_orgs.txt b/api/standards_extraction/standard_orgs.txt
similarity index 100%
rename from api/standards/data/standard_orgs.txt
rename to api/standards_extraction/standard_orgs.txt
diff --git a/api/standards_extraction/lib/tika-app-1.16.jar b/api/standards_extraction/tika-app-1.16.jar
similarity index 100%
rename from api/standards_extraction/lib/tika-app-1.16.jar
rename to api/standards_extraction/tika-app-1.16.jar
diff --git a/api/test_extract.py b/api/test_extract.py
deleted file mode 100644
index 3d421f7..0000000
--- a/api/test_extract.py
+++ /dev/null
@@ -1,18 +0,0 @@
-import os
-import json
-import subprocess
-from sklearn.neighbors import NearestNeighbors
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.preprocessing import normalize
-import dill
-import pandas as pd
-from sklearn.feature_extraction import text
-from standard_extractor import find_standard_ref
-from text_analysis import extract_prep
-
-"""
-TODO: 1) Run where ES is connected and do tests.
- 2) exit()
-"""
-test = extract_prep.predict(in_text="test")
-print(test)
\ No newline at end of file
diff --git a/api/text_analysis/create_elmo_for_all.py b/api/text_analysis/create_elmo_for_all.py
deleted file mode 100644
index d63bdf5..0000000
--- a/api/text_analysis/create_elmo_for_all.py
+++ /dev/null
@@ -1,53 +0,0 @@
-"""
-create ELMO vectors for all the standards and save it: use multi-threading to make it fast
-"""
-import pandas as pd
-import os
-from api.text_analysis.elmo_util import *
-import dask.dataframe as dd
-import multiprocessing
-import time
-
-
-# standards_dir = '../standards/data'
-# df=dd.read_csv(os.path.join(standards_dir,'iso_final_all_clean_text.csv')) # df=pd.read_csv(os.path.join(standards_dir,'iso_final_all_clean_text.csv'), index_col=0)
-# df=df.compute()
-# df=dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count())
-# df=df[df['type']=='standard'].reset_index(drop=True)
-# df=df.fillna('')
-# df=df.map_partitions(lambda df: df.assign(usable_text=df['description_clean'] +' ' + df['title'])).compute()
-# df=df.head(1000)
-# df=df.reset_index()
-# df_=df['usable_text']
-# df_=df_.reset_index()
-# start = time.process_time()
-# # https://stackoverflow.com/questions/40019905/how-to-map-a-column-with-dask
-# df_=dd.from_pandas(df_, npartitions=2*multiprocessing.cpu_count())
-# df['elmo']=df_.usable_text.map(lambda usable_text: give_paragraph_elmo_vector(usable_text), meta=('usable_text', str)).compute() # df['elmo'] = df.apply(lambda row: give_paragraph_elmo_vector(row['usable_text']) , axis=1)
-# print(time.process_time() - start)
-# df.to_csv(os.path.join(standards_dir,'iso_final_all_clean_text_w_elmo.csv'))
-
-# the above code used dask to parallelize the operations for calculating elmo vectors. But this still uses the tf_hub on a paragraph basis. We can try one shot:
- # -- sent-tokenize the paragraphs and maintain an array to track which sentences belong to which data points.
- # -- give all sents at once to the tf_hub and extract the tokens then
- # -- implemented the above in the elmo_utils.py
-
-
-standards_dir = '../standards/data'
-df=dd.read_csv(os.path.join(standards_dir,'iso_final_all_clean_text.csv')) # df=pd.read_csv(os.path.join(standards_dir,'iso_final_all_clean_text.csv'), index_col=0)
-df=df.compute()
-df=dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count())
-df=df[df['type']=='standard'].reset_index(drop=True)
-df=df.fillna('')
-df=df.map_partitions(lambda df: df.assign(usable_text=df['description_clean'] +' ' + df['title'])).compute()
-# df=df.head(100)
-# df=df.reset_index()
-df_splits = np.array_split(df, 31)
-
-for df_split in df_splits:
- start = time.process_time()
- df_split['elmo'] = give_paragraph_elmo_vector_multi(list(df_split['usable_text']))
- print(time.process_time() - start)
-df=pd.concat(df_splits)
-df.to_csv(os.path.join(standards_dir,'iso_final_all_clean_standards_text_w_elmo.csv'))
-# merge all the splits now
diff --git a/api/text_analysis/demo.py b/api/text_analysis/demo.py
deleted file mode 100644
index 07d38e2..0000000
--- a/api/text_analysis/demo.py
+++ /dev/null
@@ -1,178 +0,0 @@
-from api.text_analysis.utils.hplotprecdict import *
-from api.text_analysis.utils.utils import *
-import pandas as pd
-import os
-
-models_dir='../models/'
-standards_dir='../standards/data/'
-data_dir='data/'
-output_dir='output/'
-temp_dir='temp/'
-
-pos = loadmodel(models_dir + 'pos_')
-graph = loadmodel(models_dir + 'graph')
-text_sow=open(data_dir+'test_input_sow_text.txt','r').read()
-
-standards_df = pd.read_csv(os.path.join(standards_dir, 'iso_final_all_clean_standards_text_w_elmo.csv'), index_col=0)
-standards_df = standards_df[standards_df['type'] == 'standard'].reset_index(drop=True)
-standards_df.fillna('', inplace=True)
-standards_df['id']=standards_df.index
-
-def print_items(inpt):
- for item in inpt:
- print(item[0], '||', item[1])
-
-field_predictions_collect={}
-print('starting to predict (bottom up)')
-field_predictions=bottom_up_hpredict(text_sow, standards_df, pos, graph, algo='cosine-sim', plot_name=output_dir+'cosine_sim_hcat.html')
-field_predictions_collect['cosine-sim']=field_predictions
-print_items(field_predictions)
-field_predictions=bottom_up_hpredict(text_sow, standards_df, pos, graph, algo='elmo-sim', plot_name=output_dir+'elmo_sim_hcat.html')
-field_predictions_collect['elmo-sim']=field_predictions
-print_items(field_predictions)
-field_predictions=bottom_up_hpredict(text_sow, standards_df, pos, graph, algo='glove-sim', plot_name=output_dir+'glove_sim_hcat.html')
-field_predictions_collect['glove-sim']=field_predictions
-print_items(field_predictions)
-field_predictions=bottom_up_hpredict(text_sow, standards_df, pos, graph, algo='w2v-sim', plot_name=output_dir+'w2v_sim_hcat.html')
-field_predictions_collect['w2v-sim']=field_predictions
-print_items(field_predictions)
-savemodel(field_predictions_collect, 'field_predictions_collect')
-exit()
-
-"""
-Todos:
--get top 50 from each type of recall algo, combine them and then create a heatmap for correlations between all the algorithms (recall and rerank)
--also, find heat map of rankings of the categories for the recall algorithms
--should we also try to remove the min/max from the vector glove and w2v
-
---do the heatmaps for categories (4 algorithms)
---have a T-SNE for all the standards with their vectors (for fast 4 algorithms). Color based on categories and see which ones make most sense.
-- complete the code below and create heatmaps for 7 algorithms
-
-- wmd calculations for the 7 algorithms are very slow. So we can use soft cosine. Implement it, since we cannot use the implementations
- already there as they will not work for elmo.
-"""
-
-import plotly.graph_objects as go
-from plotly.offline import plot
-from scipy import stats
-
-field_predictions_collect=loadmodel('field_predictions_collect')
-algorithms=list(field_predictions_collect.keys())
-for algo in algorithms:
- rankings=[item[1] for item in sorted(field_predictions_collect[algo], key=lambda x: x[0], reverse=True)]
- field_predictions_collect[algo]=rankings
-
-result_matrix=np.zeros(shape=(len(algorithms),len(algorithms)))
-
-for i, algo_a in enumerate(algorithms):
- for j, algo_b in enumerate(algorithms):
- if algo_a==algo_b:
- result_matrix[i][j]=1
- else:
- tau, p_value = stats.weightedtau(field_predictions_collect[algo_a], field_predictions_collect[algo_b])
- result_matrix[i][j] = tau
-
-
-fig = go.Figure(data=go.Heatmap(
- z=result_matrix,
- x=algorithms,
- y=algorithms,
- colorscale='aggrnyl',
-reversescale=True))
-
-plot(fig, filename=output_dir+'categorical_rankin_heatmap.html')
-exit()
-
-import plotly.express as px
-from sklearn.manifold import TSNE
-from plotly.offline import plot
-
-vec_types=['elmo','w2v','glove']
-for vec in vec_types:
- standards_df_mod=get_paragraph_vectors(standards_df, vec)
- X = np.array(list(standards_df_mod[vec]))
- print('data shape', X.shape)
-
- X_embedded = TSNE(n_components=2).fit_transform(X)
- standards_df_mod['x_axis'] = X_embedded[:, 0]
- standards_df_mod['y_axis'] = X_embedded[:, 1]
-
- fig = px.scatter(standards_df_mod, x='x_axis', y='y_axis', color="field")
- plot(fig, filename=output_dir+'T_SNE_'+vec+'.html')
-
-exit()
-
-# collect top n Ids from each fast algorithm
-collect_results={}
-results_count=50
-algos=['cosine-sim', 'elmo-sim', 'w2v-sim', 'glove-sim']
-for algo in algos:
- ids, distances, _, _= get_similar_standards(text_sow, standards_df, algo=algo)
- collect_results[algo]=(ids[:results_count],distances[:results_count])
- print(ids, distances)
-
-savemodel(collect_results, temp_dir+'collect_results')
-exit()
-
-
-collect_results=loadmodel(temp_dir+'collect_results')
-# merge all the Ids and rank using all slow and fast algorithms
-all_ids=[]
-for k,v in collect_results.items():
- ids=v[0]
- all_ids+=ids
-
-standards_df_top=standards_df[standards_df['id'].isin(all_ids)]
-standards_df_top=standards_df_top.reset_index()
-
-collect_results_2={}
-algos=['w2v-wmd-sim', 'elmo-wmd-sim', 'glove-wmd-sim', 'cosine-sim', 'elmo-sim', 'w2v-sim', 'glove-sim']
-for algo in algos:
- ids, distances, _, _= get_similar_standards(text_sow, standards_df_top, algo=algo)
- collect_results_2[algo]=(ids, distances)
- print(ids, distances)
-
-savemodel(collect_results_2, temp_dir+'collect_results_2')
-exit()
-
-
-# calculate pairwise (for each pair of algorithms) correlations of the rankings:
-import plotly.graph_objects as go
-from plotly.offline import plot
-from scipy import stats
-
-
-collect_results_2=loadmodel(temp_dir+'collect_results_2')
-algorithms=list(collect_results_2.keys())
-
-for algo in algorithms:
- rankings=[item[1] for item in sorted(zip(collect_results_2[algo][0], collect_results_2[algo][1] ), key=lambda x: x[0])]
- collect_results_2[algo]=rankings
-
-result_matrix=np.zeros(shape=(len(algorithms), len(algorithms)))
-
-for i, algo_a in enumerate(algorithms):
- for j, algo_b in enumerate(algorithms):
- if algo_a==algo_b:
- result_matrix[i][j]=1
- else:
- tau, p_value = stats.weightedtau(collect_results_2[algo_a], collect_results_2[algo_b])
- result_matrix[i][j] = tau
- print(tau, algo_a, algo_b)
-
-
-fig = go.Figure(data=go.Heatmap(
- z=result_matrix,
- x=algorithms,
- y=algorithms,
- colorscale='aggrnyl',
- reversescale=True))
-
-plot(fig, filename=output_dir+'algorithms_comparison_heatmap.html')
-exit()
-
-"""
-Note: the results above show that elmo-sim is most similar to elmo-wmd-sim, giving some confidence for having this as a recall algorithm.
--Think about an two step algorithm. Use fast to get to an area and then do slow matches to find related ones around!!
-"""
\ No newline at end of file
diff --git a/api/text_analysis/extract_prep.py b/api/text_analysis/extract_prep.py
deleted file mode 100644
index ebbcba6..0000000
--- a/api/text_analysis/extract_prep.py
+++ /dev/null
@@ -1,140 +0,0 @@
-import json
-import os
-import pathlib
-import subprocess
-from collections import deque
-import numpy as np
-import dill
-import pandas as pd
-from elasticsearch import Elasticsearch
-from elasticsearch.helpers import scan
-from pandas.io.json import json_normalize
-from sklearn.feature_extraction import text
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.neighbors import NearestNeighbors
-from sklearn.preprocessing import normalize
-from standard_extractor import find_standard_ref
-from web_utils import connect_to_es
-from text_analysis.prepare_h_cat import clean_ngram
-import time
-
-
-def parse_text(filepath):
- if os.path.exists(filepath + "_parsed.txt"):
- # todo: remove this. Caches the parsed text.
- return str(open(filepath + "_parsed.txt", "r").read())
-
- bashCommand = "java -jar standards_extraction/lib/tika-app-1.16.jar -t " + filepath
- output = ""
- try:
- output = subprocess.check_output(["bash", "-c", bashCommand])
- # file = open(filepath + "_parsed.txt", "wb")
- # file.write(output)
- # file.close()
- except subprocess.CalledProcessError as e:
- print(e.output)
- return str(output)
-
-
-def transfrom(df):
- df = df.reset_index(drop=True)
- df.fillna("", inplace=True)
- print("shape")
- print(df.shape)
- tfidftransformer = TfidfVectorizer(
- ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS
- )
- # start = time.time()
- # df["description_clean"] = df["description"].apply(
- # lambda x: " ".join(clean_ngram(x))
- # )
- # end = time.time() - start
- # print(end)
- X = tfidftransformer.fit_transform(
- [m + " " + n for m, n in zip(df["description"], df["title"])]
- )
- print("shape", X.shape)
- X = normalize(X, norm="l2", axis=1)
- nbrs_brute = NearestNeighbors(
- n_neighbors=X.shape[0], algorithm="brute", metric="cosine"
- )
- print("fitting")
- nbrs_brute.fit(X.todense())
- print("fitted")
- return tfidftransformer, X, nbrs_brute
-
-
-def predict(file=None, in_text=None, size=10, read="feather"):
- """
- Predict recommendations given text or pdf file.
- Fields in the dataframe:
- ['id',
- 'raw_id',
- 'doc_number',
- 'description',
- 'status',
- 'technical_committee',
- 'text',
- 'title',
- 'published_date',
- 'isbn',
- 'url',
- 'ingestion_date',
- 'hash',
- 'sdo.iso.code',
- 'sdo.iso.field',
- 'sdo.iso.group',
- 'sdo.iso.subgroup',
- 'sdo.iso.edition',
- 'sdo.iso.number_of_pages',
- 'sdo.iso.section_titles',
- 'sdo.iso.sections',
- 'sdo.iso.type',
- 'sdo.iso.preview_url',
- 'category.ics']
- """
- if file:
- if file.filename == "":
- raise ValueError("No selected file!")
- in_text = parse_text(file.filename)
-
- # Get text from form
- # file = open("temp_text", "w")
- # file.write(str(new_text.encode("utf-8", "ignore")))
- # file.flush()
- # file.close()
- if read == "es":
- # Connect to Elasticsearch
- es, idx_main, idx_log, idx_stats = connect_to_es()
- res = list(scan(es, query={}, index=idx_main))
- output_all = deque()
- output_all.extend([x["_source"] for x in res])
- df = json_normalize(output_all)
- if read == "feather":
- df = pd.read_feather("/app/data/feather_text")
-
- tfidftransformer, X, nbrs_brute = transfrom(df)
-
- result = {}
- result["recommendations"] = []
- sow = tfidftransformer.transform([in_text])
- sow = normalize(sow, norm="l2", axis=1)
-
- # This is memory intensive.
- distances, indices = nbrs_brute.kneighbors(sow.todense())
- print(distances)
- distances = list(distances[0])
- indices = list(indices[0])
-
- for indx, dist in zip(indices[:size], distances[:size]):
- st_id = df.iloc[indx]["id"]
- result["recommendations"].append(
- {
- "sim": 100 * round(1 - dist, 21),
- "id": st_id,
- }
- )
- print(st_id)
- print("debugging...")
- print(result)
- return result
diff --git a/api/text_analysis/old_extract_prep.py b/api/text_analysis/old_extract_prep.py
deleted file mode 100644
index 7acc7fd..0000000
--- a/api/text_analysis/old_extract_prep.py
+++ /dev/null
@@ -1,229 +0,0 @@
-import os
-import json
-import subprocess
-from sklearn.neighbors import NearestNeighbors
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.preprocessing import normalize
-import dill
-import pandas as pd
-from sklearn.feature_extraction import text
-from standard_extractor import find_standard_ref
-import pathlib
-from elasticsearch import Elasticsearch
-from web_utils import connect_to_es
-
-
-# Connect to Elasticsearch
-es, idx_main, idx_log, idx_stats = connect_to_es()
-
-
-def parse_text(filepath):
- if os.path.exists(filepath + "_parsed.txt"):
- # todo: remove this. Caches the parsed text.
- return str(open(filepath + "_parsed.txt", "r").read())
-
- bashCommand = "java -jar standards_extraction/lib/tika-app-1.16.jar -t " + filepath
- output = ""
- try:
- output = subprocess.check_output(["bash", "-c", bashCommand])
- file = open(filepath + "_parsed.txt", "wb")
- file.write(output)
- file.close()
- except subprocess.CalledProcessError as e:
- print(e.output)
- return str(output)
-
-
-def transfrom(df):
- df = df[df["type"] == "standard"].reset_index(drop=True)
- df.fillna("", inplace=True)
- tfidftransformer = TfidfVectorizer(
- ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS
- )
- X = tfidftransformer.fit_transform(
- [m + " " + n for m, n in zip(df["description_clean"], df["title"])]
- )
- X = normalize(X, norm="l2", axis=1)
- nbrs_brute = NearestNeighbors(
- n_neighbors=X.shape[0], algorithm="brute", metric="cosine"
- )
- nbrs_brute.fit(X.todense())
- return tfidftransformer, X, nbrs_brute
-
-
-def predict_from_es(file=None, text=None, size=10):
- if file:
- if file.filename == "":
- return "No selected file!"
- new_text = parse_text(file.filename)
-
- else:
- # Get text from form
- new_text = text
- file = open("temp_text", "w")
- file.write(str(new_text.encode("utf-8", "ignore")))
- file.flush()
- file.close()
- res = es.search(index=idx_main, body={"query": {"match_all": {}}})
- df = pd.concat(map(pd.DataFrame.from_dict, res), axis=1)["fields"].T
- print(df)
- exit()
- tfidftransformer, X, nbrs_brute = transfrom(df)
-
- standard_refs = find_standard_ref(new_text)
- result = {}
- result["embedded_references"] = standard_refs
- result["recommendations"] = []
- sow = tfidftransformer.transform([new_text])
- sow = normalize(sow, norm="l2", axis=1)
-
- distances, indices = nbrs_brute.kneighbors(sow.todense())
- print(distances)
- distances = list(distances[0])
- indices = list(indices[0])
-
- for indx, dist in zip(indices[:size], distances[:size]):
- title = df.iloc[indx]["title"]
- description = df.iloc[indx]["description"]
- link = df.iloc[indx]["link"]
- standard_code = df.iloc[indx]["standard"]
- standard_id = df.iloc[indx]["id"]
- code = df.iloc[indx]["code"]
- tc = df.iloc[indx]["tc"]
- result["recommendations"].append(
- {
- "sim": 100.0 * round(1 - dist, 2),
- "raw_id": standard_id,
- "code": code,
- }
- )
- return result
-
-
-def predict_test(file=None, in_text=None):
- dirPath = str(pathlib.Path(__file__).parent.absolute())
-
- standards_dir = dirPath + "/../standards/data"
- json_output_dir = "output"
- models_dir = "models"
-
- # TODO: Fix this line
- # df = pd.concat(map(pd.DataFrame.from_dict, res), axis=1)
- df = pd.read_csv(
- os.path.join(standards_dir, "iso_final_all_clean_text.csv"), index_col=0
- )
- print(df)
- exit()
-
-
-def predict(file=None, in_text=None, size=10):
- dirPath = str(pathlib.Path(__file__).parent.absolute())
-
- standards_dir = dirPath + "/../standards/data"
- json_output_dir = "output"
- models_dir = "models"
-
- # TODO: Fix this line
- # df = pd.concat(map(pd.DataFrame.from_dict, res), axis=1)
- df = pd.read_csv(
- os.path.join(standards_dir, "iso_final_all_clean_text.csv"), index_col=0
- )
- # print(df2)
- df = df[df["type"] == "standard"].reset_index(drop=True)
- df.fillna("", inplace=True)
- print("shape")
- print(df.shape)
- tfidftransformer = TfidfVectorizer(
- ngram_range=(1, 1), stop_words=text.ENGLISH_STOP_WORDS
- )
- X = tfidftransformer.fit_transform(
- [m + " " + n for m, n in zip(df["description_clean"], df["title"])]
- ) # using both desc and tile to predict
- # tfidftransformer = TfidfVectorizer(ngram_range=(1,1))
- # X = tfidftransformer.fit_transform([m+' '+n for m, n in zip(df['description'], df['title'])]) # using both desc and tile to predict
- print("shape", X.shape)
- X = normalize(X, norm="l2", axis=1)
- nbrs_brute = NearestNeighbors(
- n_neighbors=X.shape[0], algorithm="brute", metric="cosine"
- )
- print("fitting")
- nbrs_brute.fit(X.todense())
- print("fitted")
-
- # How do we get request.file
- new_text = ""
- # ======================== find the referenced standards
- filename = "temp_text"
- # check if the post request has the file part
- if file:
- print("made it to the PDF part")
- # INSERT FILES OBJECT HERE
- # file = files['file']
- # if user does not select file, browser also
- # submit a empty part without filename
- if file.filename == "":
- return "no selected file!"
- # return redirect(request.url)
- if file:
- filename = file.filename
- ### Save the file
- new_text = parse_text(filename)
- print("parsed")
- else:
- # get text from form
- new_text = in_text
- file = open(filename, "w")
- file.write(str(new_text.encode("utf-8", "ignore")))
- file.flush()
- file.close()
-
- print("extracting standards")
- # standard_refs=extract_standard_ref(filename)
- standard_refs = find_standard_ref(new_text)
- print("standards extracted")
-
- # ======================== find the recommended standards
-
- result = {}
- result["embedded_references"] = standard_refs
- result["recommendations"] = []
-
- sow = tfidftransformer.transform([new_text])
- sow = normalize(sow, norm="l2", axis=1)
-
- print("scoring standards")
- distances, indices = nbrs_brute.kneighbors(sow.todense())
- distances = list(distances[0])
- indices = list(indices[0])
-
- for indx, dist in zip(indices[:size], distances[:size]):
- title = df.iloc[indx]["title"]
- description = df.iloc[indx]["description"]
- link = df.iloc[indx]["link"]
- standard_code = df.iloc[indx]["standard"]
- standard_id = df.iloc[indx]["id"].replace("~", "")
- code = df.iloc[indx]["code"].replace("~", "")
- tc = df.iloc[indx]["tc"]
- type_standard = ["Information Technology"]
-
- # TODO: this code calculates the word importances for the top results (slows the operation, hence commented)
- # print(title)
- # print(description)
- #
- # to_print = [
- # tfidftransformer.get_feature_names()[i]
- # + ' ' +
- # str(abs(np.array(sow[0].todense()).flatten()[i] - np.array(X[indx].todense()).flatten()[i]))
- # for i in set(sow.indices).intersection(X[indx].indices)]
- # print(' || '.join(to_print), '\n')
-
- result["recommendations"].append(
- {
- "sim": 100 * round(1 - dist, 21),
- "raw_id": standard_id,
- "code": code,
- }
- )
- print("Make sure it's an error here and not in ES.")
- print(result)
- return result
diff --git a/api/text_analysis/output/T_SNE_elmo.html b/api/text_analysis/output/T_SNE_elmo.html
deleted file mode 100644
index 2dd4603..0000000
--- a/api/text_analysis/output/T_SNE_elmo.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/T_SNE_glove.html b/api/text_analysis/output/T_SNE_glove.html
deleted file mode 100644
index 816dc4c..0000000
--- a/api/text_analysis/output/T_SNE_glove.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/T_SNE_w2v.html b/api/text_analysis/output/T_SNE_w2v.html
deleted file mode 100644
index e9f5183..0000000
--- a/api/text_analysis/output/T_SNE_w2v.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/algorithms_comparison_heatmap.html b/api/text_analysis/output/algorithms_comparison_heatmap.html
deleted file mode 100644
index 0500d4e..0000000
--- a/api/text_analysis/output/algorithms_comparison_heatmap.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/categorical_rankin_heatmap.html b/api/text_analysis/output/categorical_rankin_heatmap.html
deleted file mode 100644
index fc70a4d..0000000
--- a/api/text_analysis/output/categorical_rankin_heatmap.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/cosine_sim_hcat.html b/api/text_analysis/output/cosine_sim_hcat.html
deleted file mode 100644
index cbf5cee..0000000
--- a/api/text_analysis/output/cosine_sim_hcat.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/elmo_sim_hcat.html b/api/text_analysis/output/elmo_sim_hcat.html
deleted file mode 100644
index f81ed2f..0000000
--- a/api/text_analysis/output/elmo_sim_hcat.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
\ No newline at end of file
diff --git a/api/text_analysis/output/glove_sim_hcat.html b/api/text_analysis/output/glove_sim_hcat.html
deleted file mode 100644
index 915e581..0000000
--- a/api/text_analysis/output/glove_sim_hcat.html
+++ /dev/null
@@ -1,31 +0,0 @@
-
-
-
-
-
-
-