Skip to content
forked from texta-tk/texta

Terminology EXtraction and Text Analytics (TEXTA) Toolkit

License

Notifications You must be signed in to change notification settings

WMRamadan/texta

 
 

Repository files navigation

FastAPI

TEXTA Toolkit 3

Documentation

https://docs.texta.ee

Wiki

https://git.texta.ee/texta/texta-rest/wikis/home

Requirements

  • Anaconda or Miniconda (Recommended)
  • Docker & Docker-Compose (Only required for Docker deployment)

Notes

Works with Python 3.8

Creating environment:

conda env create -f environment.yaml

Running migrations:

python3 migrate.py

  • This script will also create an admin account with the default username "admin". You can use python migrate.py -u {{username}} instead for a custom username of your choice.
  • Password for that admin account will be generated automatically and printed to the console. This behaviour can be overwritten with the environment variable TEXTA_ADMIN_PASSWORD, in which case the password will be set to the same value as 'TEXTA_ADMIN_PASSWORD'.
  • Running python migrate.py -o will overwrite the password with whatever value you have inside the environment variable TEXTA_ADMIN_PASSWORD.

Running application:

python3 manage.py runserver

celery -A toolkit.taskman worker -l info

Import testing data:

python3 import_test_data.py

Run all tests:

python3 manage.py test

Run tests for specific app:

python3 manage.py test appname (eg python3 manage.py test toolkit.neurotagger)

Run performance tests (not run by default as they are slow):

python3 manage.py test toolkit.performance_tests

Building Docker:

docker build -t texta-rest:latest -f docker/Dockerfile .

Running Docker:

docker run -p 8000:8000 texta-rest:latest

Building Docker with GPU support:

docker build -t texta-rest:gpu-latest -f docker/gpu.Dockerfile .

Running Docker with GPU support requires NVIDIA Container Toolkit to be installed on the host machine: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker. When Container Toolkit is installed:

docker run --gpus all -p 8000:8000 texta-rest:latest-gpu

Environment variables

Deploy & Testing variables

  • TEXTA_ENV_FILE - Optional file path for a typical .env file to be loaded into memory for TEXTA Toolkit (Default: None).

  • TEXTA_SECRET_KEY - String key for cryptographic security purposes. ALWAYS SET IN PRODUCTION.

  • TEXTA_CORS_ALLOW_CREDENTIALS - Whether to allow cookies to be included in cross-site HTTP requests (Default: True)

  • TEXTA_CORS_ALLOW_ALL_ORIGINS - Whether to allow requests from all Origins (Default: false)

  • TEXTA_CELERY_USED_QUEUES - Comma separated list of Celery queues you are using for TEXTA Toolkit. No need to touch when running a standard configuration.

  • TEXTA_ELASTIC_VERSION - Must equal to the integer of the main Elasticsearch cluster version (Default: 6).

  • TEXTA_DEPLOY_KEY - Used to separate different Toolkit instances for cases where Elasticsearch or the database are shared amongst multiple instances. Best to give this a simple number (Default: 1).

  • TEXTA_ADMIN_PASSWORD - Password of the admin user created on first run.

  • TEXTA_USE_CSRF - Whether to disable CSRF for integration tests (Default: false).

  • TEXTA_CELERY_ALWAYS_EAGER - Whether to use Celerys async features or not, useful for testing purposes locally. ( Default: False)

  • TEXTA_DATA_DIR - Path to the directory in which TEXTA Toolkit saves the models it generates, and the binary model dependencies it needs (Default: data).

  • TEXTA_EXTERNAL_DATA_DIR - Path to the base directory in which 3rd party models (MLP/BERT/etc) are kept (Default: data/models).

  • TEXTA_CACHE_DIR - Path for the cache folder which BERT uses (Default: data/external/.cache).

  • TEXTA_RELATIVE_MODELS_DIR - Relative path of the directory in which all the different types of models are stored in. (Default: "/data/models").

  • TEXTA_LANGUAGE_CODES - Comma separated string of Stanza supported language codes to use for Multilingual Processing. (Default: "").

  • TEXTA_MLP_USE_GPU - Use GPU to speed up MLP (Default: False).

  • TEXTA_MLP_MODEL_DIRECTORY_PATH - Relative path to the directory into which Stanza models will be stored under the " stanza" folder (setting this to ./home/texta will create ./home/texta/stanza which contains subfolders for every language like ./home/texta/stanza/et etc). (Default: "./data/external/mlp").

  • TEXTA_ALLOW_BERT_MODEL_DOWNLOADS - Boolean flag indicating if the users can download additional BERT models. (Default: True).

  • TEXTA_BERT_MODEL_DIRECTORY_PATH - Relative path to the directory into which pretrained and fine-tuned BERT models will be stored under the "bert_tagger" folder. (setting this to ./home/texta will create ./home/texta/bert_tagger/pretrained/ which contains subfolders for every downloaded bert more like ./home/texta/bert_model/pretrained/bert-base-multilingual-cased etc and ./home/texta/bert_model/fine_tuned/ which will store fine-tuned BERT models. (Default: "./data/models").

  • TEXTA_NLTK_DATA_DIRECTORY_PATH - Path of the directory where the NLTK library keeps its resources (Default: data/external/nltk).

  • TEXTA_BERT_MODELS - Comma seprated string of pretrained BERT models to download. (Default: "bert-base-multilingual-cased,bert-base-uncased,EMBEDDIA/finest-bert").

  • SKIP_BERT_RESOURCES - If set "True", skips downloading pretrained BERT models. (Default: false).

  • SKIP_MLP_RESOURCES - Whether to skip downloading MLP resources on application boot-up (Default: false).

  • SKIP_NLTK_RESOURCES - Whether to skip downloading NLTK library resources on application boot-up (Default: false).

  • TEXTA_EVALUATOR_MEMORY_BUFFER_GB - The minimum amount of memory that should be left free while using the evaluator, unit = GB. (Default = 50% of available_memory)

  • TEXTA_DATASOURCE_CHOICES - Choices for index domain field given as a list ex: [['prefix_name', 'display_name']]. ( Default = [["emails", "emails"], ["news articles", "news articles"], ["comments", "comments"] , ["court decisions", "court decisions"], ["tweets", "tweets"], ["forum posts", "forum posts"] , ["formal documents", "formal documents"], ["other", "other"]])

  • TOOLKIT_PROJECT_DATA_PATH - Path of the directory in which project specific data is kept (Default: data/projects).

External services

  • TEXTA_ES_PREFIX - String used to limit Elasticsearch index access. Only indices matched will be the ones matching " {TEXTA_ES_PREFIX}*".
  • TEXTA_ES_URL - URL of the Elasticsearch instance including the protocol, host and port (ex. http://localhost:9200).
  • TEXTA_REDIS_URL - URL of the Redis instance including the protocol, host and port (ex. redis://localhost:6379).

Django specifics

  • TEXTA_CORS_ORIGIN_WHITELIST - Comma separated string of urls (NO WHITESPACE) for the CORS whitelist. Needs to include the protocol (ex. http://* or http://*,http://localhost:4200).
  • TEXTA_ALLOWED_HOSTS - Comma separated string (NO WHITESPACE) representing the host/domain names that this Django site can serve (ex. * or *,http://localhost:4200).
  • TEXTA_DEBUG - True/False values on whether to run Django in its Debug mode or not (Default: true).
  • TEXTA_MAX_UPLOAD - Maximum size of files in bytes that are allowed to be updated, which Django validates (Default: 1073741824 aka 1GB)

Database credentials

  • DJANGO_DATABASE_ENGINE - https://docs.djangoproject.com/en/3.0/ref/settings/#engine
  • DJANGO_DATABASE_NAME - The name of the database to use. For SQLite, it’s the full path to the database file. When specifying the path, always use forward slashes, even on Windows.
  • DJANGO_DATABASE_USER - The username to use when connecting to the database. Not used with SQLite.
  • DJANGO_DATABASE_PASSWORD - The password to use when connecting to the database. Not used with SQLite.
  • DJANGO_DATABASE_HOST - Which host to use when connecting to the database. An empty string means localhost. Not used with SQLite.
  • DJANGO_DATABASE_PORT - The port to use when connecting to the database. An empty string means the default port. Not used with SQLite.

Docker specific configurations:

  • TEXTA_SHORT_TASK_WORKERS - Number of processes available for short term tasks (Default: 2).
  • TEXTA_LONG_TASK_WORKERS - Number of processes available for long term tasks (Default: 4).
  • TEXTA_MLP_TASK_WORKERS - Number of processes available for MLP based tasks (Default: 2).
  • TEXTA_SHORT_MAX_TASKS - Number of tasks per worker for short term tasks (Default: 10).
  • TEXTA_LONG_MAX_TASKS - Number of tasks per worker for long term tasks (Default: 10).
  • TEXTA_MLP_MAX_TASKS - Number of tasks per worker for MLP based tasks (Default: 10).
  • TEXTA_BEAT_LOG_LEVEL - Which log level should beat output within the Docker image (Default: WARNING).
  • TEXTA_CELERY_LOG_LEVEL - Which log level should Celery workers output within the Docker image (Default: WARNING).

Extra Elasticsearch connection configurations

Unless you have a specially configured Elasticsearch instance, you can ignore these options.

  • TEXTA_ES_USER - Username to authenticate to a secured Elasticsearch instance.
  • TEXTA_ES_PASSWORD - Password to authenticate to a secured Elasticsearch instance.

https://elasticsearch-py.readthedocs.io/en/6.3.1/connection.html#elasticsearch.Urllib3HttpConnection:

  • TEXTA_ES_USE_SSL
  • TEXTA_ES_VERIFY_CERTS
  • TEXTA_ES_CA_CERT_PATH
  • TEXTA_ES_CLIENT_CERT_PATH
  • TEXTA_ES_CLIENT_KEY_PATH
  • TEXTA_ES_TIMEOUT
  • TEXTA_ES_SNIFF_ON_START
  • TEXTA_ES_SNIFF_ON_FAIL

UAA specific configurations

  • TEXTA_USE_UAA - Whether to include UAA authentication with the default authentication (Default: false).

  • TEXTA_UAA_SCOPES - Which scopes should be sent with communication between TEXTA Toolkit and UAA (Default: openid texta.*).

  • TEXTA_UAA_SUPERUSER_SCOPE - Which scope to use for determining whether an UAA user is a superuser (Default: texta.admin).

  • TEXTA_UAA_PROJECT_ADMIN_SCOPE - Which scope to use to specify whether an UAA user has project administrator rights to ANY project which is available to them (Default: texta.project_admin).

  • TEXTA_UAA_SCOPE_PREFIX - Prefix for determining UAA user access to TEXTA Toolkit. Any user who does not have a scope which matches the pattern "{TEXTA_UAA_SCOPE_PREFIX}.*" will be denied entry to TEXTA Toolkit (Default: texta).

  • TEXTA_UAA_URL - URI for the UAA service (Default: http://localhost:8080).

  • TEXTA_UAA_REDIRECT_URI - URI into which the user will be redirected after a successful UAA login ( Default: http://localhost:8000/api/v2/uaa/callback).

  • TEXTA_UAA_FRONT_REDIRECT_URL - Configuration for the front end to determine where it will redirect the user after a successful login with UAA by Toolkit (Default: http://localhost:4200/oauth/uaa)

  • TEXTA_UAA_CLIENT_ID - UAA client ID for authenticating the TEXTA Toolkit application for UAA. Must be kept secret

  • TEXTA_UAA_CLIENT_SECRET - Password for authenticating the TEXTA Toolkit application with UAA. Must be kept secret

Used By

About

Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Other 0.4%