This document describes main design concepts and implementation details of the deepfacility
tool.
- Environment Variables
- Configuration
- Location Format
- UX Sessions
- UX Background Tasks
- Multi-Language Support
- Logging
- Parallelization
- Data Caching
- Other Considerations and Limitations
The following environment variables are used in the tool:
Environment Variable | Default | Description |
---|---|---|
DEEPFACILITY_ROOT_DIR | app-data |
The root directory of the app. |
DEEPFACILITY_LANG_MODEL | NLP |
Set the translation ML model to use (default isHelsinki-NLP/opus-mt-en-fr). |
DEEPFACILITY_HOST | localhost |
Demo web app host name. |
DEEPFACILITY_PORT | 8000 |
Demo web app port. |
DEEPFACILITY_SID | None |
Set the session id for the CLI scenario. |
The configuration consists of user and system configuration TOML files.
The user configuration is focused on files a user needs to provide and describe required column names. The system configuration describe workflow input and results files.
During the initiation, the variables are replaced with values collected from users or determined based on user inputs:
{app_dir}
: App root directory containing the cache, data, and downloads directories and the config file. The default isapp-data
, positioned relative to the working directory, for example, the repo dir. It can be customized using theDEEPFACILITY_ROOT_DIR
environment variable.{data_dir}
: Data directory containing country directories with input and results files.{country_code}
: Country code determined based on the village centers coordinates.{level}
: Admin level determined based on the village centers coordinates.
The following variables are replaced with values determined at runtime:
{run_name}
: A unique run name based on the selected locations.{location}
: The location iterator value.
The configuration initiation is done in the deepfacility.config
module. The Config
class is used to load, populate and merge the user and system configuration files.
In the UX scenario, users are not directly exposed to configuration files. Instead, the session object, from the deepfacility.ux
module, generates the configuration based of user inputs and uploaded files.
In the CLI scenario, the user can create a configuration file using the config
command. Then the user can populate the configuration file to match files and columns in the data.
Methods of WorkflowEntity subclasses have access to the config through the cfg
field. Other function receive the configuration object as an argument.
The configuration contains most of the information needed for the tool to run. This means that most functions could receive the configuration object as the only argument. However, this approach would make the code less modular and harder to test.
Therefore, the general principle is for functions to have input values as explicit arguments and use parameters and file paths from the configuration object for generate outputs. This allows the code to be tested independently of the configuration object.
The configuration file is a TOML file which contains sections for user and system configurations. The configuration file contains the following sections:
args
: basic parameters like country, data directory, thresholds and file path and column names for user-provided files.downloads
: contains URLs for downloading external data, directories for storing downloaded files and coordinates column names.inputs
: contains file paths and column names for input files generated by the data preparation workflow.results
: contains file paths and column names for output files generated by the scientific workflow.
For more detail see configuration templates which contain detailed descriptions of all configuration parameters:
The configuration file facilitates the directory structure of the app.
Typical structure of the root app-data
dir used by the web app looks like this:
app-data
│
├──── downloads # all downloads for all countries
│ ├── GADM_shapes # all GADM shapes for all countries
│ │ └── gadm41_BFA_shp.zip
│ └── google_buildings # all Google Open Buildings for all countries
│ ├── 0e3_buildings.csv.gz
│ ├── 0e5_buildings.csv.gz
│ ├── 0fb_buildings.csv.gz
│ ├── 0fd_buildings.csv.gz
│ └── 11d_buildings.csv.gz
│
├── 9593161cb53f # session id
│ ├── config.village_centers.toml # generated user and system configuration file
│ └── data # all data for all countries│
│ └── BFA # all data for a specific country
│ ├── args # args: user provided data
│ │ ├── health_facilities.csv
│ │ └── locality_villages.csv
│ │
│ ├── inputs # inputs: data generated by the data preparation workflow
│ │ ├── all_locations.csv # all available admin locations
│ │ ├── baseline_facilities.csv # baseline health facilities with matched admin names
│ │ ├── baseline_facilities.geojson # baseline health facilities points for visualization
│ │ ├── buildings_BFA.feather # Google Open Buildings clipped for BFA
│ │ ├── households.csv # households coordinates with matched admin names
│ │ ├── households.stats.csv # households stats
│ │ ├── prep.log # data preparation workflow log
│ │ ├── shapes # GADM shapes
│ │ ├── village_centers.csv # village centers with matched admin names
│ │ └── village_centers.geojson # village centers points for visualization
│ │
│ └── results # results: data generated by the scientific workflow
│ └── Bale-Boromo_3_41353ad # results for 3 communes (Bale-Boromo and two other)
│ ├── cluster_centers.csv # cluster centers
│ ├── cluster_counts.csv # cluster counts
│ ├── cluster_stats.csv # cluster stats
│ ├── clustered_households.csv # clustered households
│ ├── locations.csv # locations
│ ├── optimal_facilities.csv # optimal health facilities
│ ├── population_coverage_optimal.png # optimal population coverage plot
│ ├── population_coverage_baseline.png # existing population coverage plot
│ ├── run.log # scientific workflow log
│ ├── village_shapes.geojson # village shapes for visualization
│ └── www # interactive visualization map
│
└── cache # cache directory managed by the system (joblib.Memory python package)
└── deepfacility # cache directory for the deepfacility package
├── data # data preparation functions cache
├── tasks # scientific workflow functions cache
└── utils # utility functions cache
As described in the main readme a location is an administrative area where the clustering is performed. Location values are constructed from colon-separated names of administrative levels.
The default configuration for Burkina Faso specifies the following location formats:
- Communes:
{province}:{commune}
(e.g.,Tapoa:Diapaga
)- Data from GADM shapes (columns NAME2, NAME3)
- Villages:
{province}:{commune}:{village}
(e.g.,Tapoa:Diapaga:Mangou
)- Data from user-provided village centers file (custom column)
Location columns are specified in the configuration:
- Commune column: system configuration >
[inputs.shapes].adm_cols
- Village column: user configuration >
[args.village_centers].adm_cols
The main purpose of the session object is to preserve references to a config and translator objects between requests and to facilitate the execution of the background tasks.
UX session support is implemented using the deepfacility.ux.session
module, with deepfacility.ux.session.init
being the function which handles most of the session management: creation and retrieval.
The session object is created when the user starts the web app. At that time a session id is generated and used to store the reference to a session object in the session dictionary. The session dictionary is created in the FastAPI state object and preserved until the app is stopped.
The session id is also stored as long-living cookie in the user's browser. The session id is used to retrieve the session object from the session dictionary when the user makes a new request.
In case a session id cookie is lost a user would have to start a new session. This would also mean loosing all the previously generated files. To mitigate that, a user has few options:
- Set the session id in the URL:
http:://localhost:8000?sid=1234
- Set the session id in the environment variable:
export DEEPFACILITY_SID=1234
before running thedeepfacility ux
command.
# Set the session id with the environment variable
export DEEPFACILITY_SID=1234
deepfacility ux
- Use the CLI scenario where the session id is set as a command line argument.
# Hardcode the session id for all session
deepfacility ux --sid 1234
The background tasks are run using FastAPI BackgroundTasks
object to asynchronously run the data preparation and scientific workflows commands in the background.
The background task execution is abstracted using config.Operation
and ux.Session
classes.
The Operation
class fields are used to control workflow functions execution:
conrol_file
- a file used to signal the workflow function which is running in the background tasks to stop.log_file
- point to a log file where workflow function logs are stored (and from where the logs are streamed to UI).logger
- a logger object used to log messages to the log file.
The Operation
is an abstract class inherited by Inputs
and Results
config classes which are used by data preparation and scientific workflows functions.
The ux.Session
class has a private member _operation
which points to a workflow operation object when a user triggers execution.
It also has start_task
and stop_task
methods meant to facilitate the execution of the background tasks.
This section describes how workflow execution progress is monitored in the UI.
The steps below are for the data preparation workflow. The scientific workflow monitoring uses the same approach.
-
User initiates data preparation:
- User clicks the "Prepare Input Data" button on the data prep page 30-prep.html.
- This triggers the
ux.main.prep
function which:- Starts data preparation background tasks.
- Renders the status monitoring page 30-prep-status-container.html.
-
Status monitoring:
- The status monitoring page automatically sends
/prep/status
request every 5 seconds.- This is specified with htmx
hx-trigger
attribute as shown in the example below.
- This is specified with htmx
- The
ux.main.prep_status
function on the server:- Checks the status of input files and backend tasks.
- Renders a status update response based on the progress.
- The response is displayed in the
status
div in the page 30-prep-status.html.
<div id='container' hx-get='/prep/status' hx-target='#status' hx-swap='innerHTML' hx-trigger="load, every 5s"> </div> <div id='status' style="overflow-y: auto"> <p>Waiting for status...</p> </div>
- The status monitoring page automatically sends
-
Success:
- If all files are ready, the server prompts the browser to refresh
- Page refresh triggers top level htmx div element which send
/info
and/driver
requests.
- Page refresh triggers top level htmx div element which send
- The
/driver
request handler renders 40-run.html, allowing execution of the scientific workflow.
- If all files are ready, the server prompts the browser to refresh
-
Failure:
- If files are not ready, the
/driver
request handler:- Deletes the config file.
- Returns a response to clear the
download
div.
- The
/info
request handler renders the upload page again 10-upload.html- This prompts the user to reconfigure input files and do the data prep again.
- If files are not ready, the
This section describes:
- existing multi-language support design and implementation and how to extend it
- how to enable automated translation to support additional languages or UI text updates.
The default multi-language support is implemented using a simple dictionary containing English-French translations for all the text currently used in the tool. This approach has the advantage of being simple and fast. The main disadvantage is that it requires manual translation of all the text used in the tool and maintaining translations as the text is updated.
To address the limitations of the default multi-language support, the tool also supports multi-language support using pre-trained language models. This approach is more flexible and can be used to translate arbitrary text. In other words, it can be used to support additional languages or to support new text in the tool.
Both multi-language support approaches are implemented in the deepfacility.lang
module by inheriting the BaseTranslator
abstract class defines the translation interface.
It defines the translation interface which consists of two functions:
set_language
- sets the language to be used for translation and do all necessary initializations.translate
- traslates the input string to the current language.
The BaseTranslator
abstract class also implements the factory class method create
which instantiates a translator and set the language.
It takes language and request as argument and if language is not explicitly set it extracts it from the request headers. Then it calls the set_language
and returns the translator.
The DefaultTranslator implements the translation interface using a simple dictionary for English-French translations for all the text currently used in the tool.
For the DefaultTranslator, set_language
method loads the translation dictionary and sets the language string.
The translation is simple English message lookup using the messages
dictionary.
The DefaultTranslator is simple and fast and it fully supports the current text used in the tool.
The i18n translators are an experimental feature. These translators are based on ML language models, and they are capable of successfully translating arbitrary text, so they can be used in specific situations when tools text is modified or a new language is added. The downside is that using them requires installing additional dependencies. When the app starts they need to download the model (NLP is 250MB, NLLB is 2GB). They inherit the default translator, so they still use the dictionary lookup if a match can be found. If not they will fall back to the pre-trained language model. This makes page transitions noticeably longer (for new text not covered by the default message dictionary).
ML translators are based on pre-trained machine translation models. Some of the models considered for the tool are:
- Hugging Face provides a range of pre-trained machine translation models accessible via the transformers library.
- Helsinki-NLP/opus-mt-en-fr (NLP) can be used for English to French translation.
- Facebook's NLLB model supports more than two hundred languages.
To add a new language to the tool you first need to list that language and local in the language dictionary.
Then you need to do one of the following:
- Add a message translation file for the new language in the src/deepfacility/lang/messages dir, or
- Enable the automated translation, as described in the next section.
To generate the message translation file you can
- Copy the existing fr.json file and name it to match the local string listed in the language dictionary.
- Translate messages to the new language manually or using one of the AI assistants.
You can enable automated translation by installing the tool with the i18n
extra dependencies:
# Install the tool with gettext multi-language support.
pip install -e .[i18n]
Once i18n
extra dependencies the tool will use the ML translators by default.
Note that the i18n
extra dependencies include the torch
packages which is required for the ML translators.
For more information about using PyTorch see the documentation.
The logging is implemented using the Python logging
module. The default level is set to INFO
.
Because UX scenario could run multiple user session simultaneously, the code is not using a single global logger.
Instead, separate loggers are instantiated for each workflow execution. This is facilitated by the Operation
class in
the deepfacility.config module.
All logs are displayed to the console. Workflow logs are also stored in log files, which path matches the workflow, per configuration. In the UX scenario, background tasks, running workflows are streaming logs from those log files into the UI.
The parallelization in this tool is implemented using the concurrent.futures
module which provides a high-level interface
for asynchronous execution using either ThreadPoolExecutor (on Windows and Mac) or ProcessPoolExecutor (on Linux).
The scientific workflow functions which implement asynchronous execution are (both in the flows.py module):
- cluster_households: submits
cluster_houses_by_villages_centers
function for each location - outline_and_place: submits
outline_clusters_and_place_facilities
function for each location
In both cases, future
is set to have a callback enclosure process_future
which:
- receives the
future
object after completion. - retrieves the resulting ClusteredHouseholds object from the future.
- stores it in the result dictionary and
- reports the progress using
util.report_progress
function.
Data caching is set up using joblib.Memory
object which is used to cache the results of the data preparation and scientific workflows functions.
The joblib.Memory
object is configured to store cached data in a cache
directory,
in the app root: app-data/cache
so that the cache can be shared between sessions.
The criteria for selecting functions for caching is that they are:
- time-consuming.
- called multiple times with same arguments.
- arguments are hashable in a way that uniquely identifies the result.
- return values are pickleable objects which contains data, like a dataframe.
To locate cached functions in the code search for the @memory.cache
decorator.
To invalidate function's cache (e.g., due to a code change), delete the entire cache directory, or a subdir corresponding to that function.
You can also use the deepfacility reset
command which will remove the cache directory.
The UX provided in the deepfacility
package was designed for demo purposes and should ONLY be used on a local machine.
This demo web application is not ready to be used as a hosted web application and is missing important security and reliability features a hosted web application must have.
That said, the demo web app UI can be a good starting point for developing a fully fledged web application.