Path to network implementation of OCR-D
- In the simplest (and current) form, the controller will be a SSH login server for a full command-line OCR-D installation. Files must be mounted locally (if they are network shares, this must be done on the host side running the container).
- Next, the SSH server can also dynamically receive and send data.
- The first true network implementation will offer an HTTP interface for processing (like the workflow server).
- From there, the actual processing could be further delegated into different processing servers.
- A more powerful workflow engine would then offer instantiating different workflows, and monitoring jobs.
- In the final form, the controller will implement (most parts of) the OCR-D Web API.
Build or pull the Docker image:
make build # or docker pull ghcr.io/slub/ocrd_controller
Then run the container – providing host-side directories for the volumes …
DATA
: directory for data processing (including images or existing workspaces),
defaults to current working directoryMODELS
: directory for persistent storage of processor resource files,
defaults to~/.local/share
; models will be under./ocrd-resources/*
CONFIG
: directory for persistent storage of processor resource list,
defaults to~/.config
; file will be under./ocrd/resources.yml
… but also a file KEYS
with public key credentials for log-in to the controller, and (optionally) some environment variables …
WORKERS
: number of parallel jobs (i.e. concurrent login sessions forocrd
) (should be set to match the available computing resources)UID
: numerical user identifier to be used by programs in the container
(will affect the files modified/created); defaults to current userGID
: numerical group identifier to be used by programs in the container
(will affect the files modified/created); defaults to current groupUMASK
: numerical user mask to be used by programs in the container
(will affect the files modified/created); defaults to 0002PORT
: numerical TCP port to expose the SSH server on the host side
defaults to 8022 (for non-priviledged access)NETWORK
name of the Docker network to use
defaults tobridge
(the default Docker network)
… thus, for example:
make run DATA=/mnt/workspaces MODELS=~/.local/share KEYS=~/.ssh/id_rsa.pub PORT=8022 WORKERS=3
Then you can log in as user ocrd
from remote (but let's use controller
in the following –
without loss of generality):
ssh -p 8022 ocrd@controller bash -i
Unless you already have the data in workspaces, you need to create workspaces prior to processing. For example:
ssh -p 8022 ocrd@controller "ocrd-import -P some-document"
For actual processing, you will first need to download some models
into your MODELS
volume:
ssh -p 8022 ocrd@controller "ocrd resmgr download ocrd-tesserocr-recognize *"
Subsequently, you can use these models on your DATA
files:
ssh -p 8022 ocrd@controller "ocrd process -m some-document/mets.xml 'tesserocr-recognize -P segmentation_level region -P model Fraktur'"
# or equivalently:
ssh -p 8022 ocrd@controller "ocrd-tesserocr-recognize -m some-document/mets.xml -P segmentation_level region -P model Fraktur"
If your data files cannot be directly mounted on the host (not even as a network share),
then you can use rsync
, scp
or sftp
to transfer them to the server:
rsync --port 8022 -av some-directory ocrd@controller:/data
scp -P 8022 -r some-directory ocrd@controller:/data
echo put some-directory /data | sftp -P 8022 ocrd@controller
Analogously, to transfer the results back:
rsync --port 8022 -av ocrd@controller:/data/some-directory .
scp -P 8022 -r ocrd@controller:/data/some-directory .
echo get /data/some-directory | sftp -P 8022 ocrd@controller
For parallel processing, you can either
- run multiple processes on a single controller by
- logging in multiple times, or
- issuing parallel commands –
- via basic shell scripting
- via ocrd-make calls
- run processes on multiple controllers.
Note: internally, WORKERS
is implemented as a (GNU parallel-based) semaphore
wrapping the SSH sessions inside blocking sem --fg
calls within .ssh/rc.
Thus, commands will get queued but not processed until a 'worker' is free.
All logs are accumulated on standard output, which can be inspected via Docker:
docker logs ocrd_controller
If you have any questions or encounter any problems, please do not hesitate to contact me.