Design Concepts

This document describes main design concepts and implementation details of the deepfacility tool.

Environment Variables

The following environment variables are used in the tool:

Environment Variable	Default	Description
DEEPFACILITY_ROOT_DIR	`app-data`	The root directory of the app.
DEEPFACILITY_LANG_MODEL	`NLP`	Set the translation ML model to use (default isHelsinki-NLP/opus-mt-en-fr).
DEEPFACILITY_HOST	`localhost`	Demo web app host name.
DEEPFACILITY_PORT	`8000`	Demo web app port.
DEEPFACILITY_SID	`None`	Set the session id for the CLI scenario.

Configuration

The configuration consists of user and system configuration TOML files.

Configuration Templates

The user configuration is focused on files a user needs to provide and describe required column names. The system configuration describe workflow input and results files.

Configuration Template Variables

During the initiation, the variables are replaced with values collected from users or determined based on user inputs:

{app_dir}: App root directory containing the cache, data, and downloads directories and the config file. The default is app-data, positioned relative to the working directory, for example, the repo dir. It can be customized using the DEEPFACILITY_ROOT_DIR environment variable.
{data_dir}: Data directory containing country directories with input and results files.
{country_code}: Country code determined based on the village centers coordinates.
{level}: Admin level determined based on the village centers coordinates.

The following variables are replaced with values determined at runtime:

{run_name}: A unique run name based on the selected locations.
{location}: The location iterator value.

Configuration Initiation

The configuration initiation is done in the deepfacility.config module. The Config class is used to load, populate and merge the user and system configuration files. In the UX scenario, users are not directly exposed to configuration files. Instead, the session object, from the deepfacility.ux module, generates the configuration based of user inputs and uploaded files. In the CLI scenario, the user can create a configuration file using the config command. Then the user can populate the configuration file to match files and columns in the data.

Accessing Configuration

Methods of WorkflowEntity subclasses have access to the config through the cfg field. Other function receive the configuration object as an argument.

Configuration Usage

The configuration contains most of the information needed for the tool to run. This means that most functions could receive the configuration object as the only argument. However, this approach would make the code less modular and harder to test.

Therefore, the general principle is for functions to have input values as explicit arguments and use parameters and file paths from the configuration object for generate outputs. This allows the code to be tested independently of the configuration object.

Configuration File Structure

The configuration file is a TOML file which contains sections for user and system configurations. The configuration file contains the following sections:

args: basic parameters like country, data directory, thresholds and file path and column names for user-provided files.
downloads: contains URLs for downloading external data, directories for storing downloaded files and coordinates column names.
inputs: contains file paths and column names for input files generated by the data preparation workflow.
results: contains file paths and column names for output files generated by the scientific workflow.

For more detail see configuration templates which contain detailed descriptions of all configuration parameters:

user config template
system config template.

Directory Structure

The configuration file facilitates the directory structure of the app.

Typical structure of the root app-data dir used by the web app looks like this:

app-data
│
├──── downloads                       # all downloads for all countries
│  ├── GADM_shapes                      # all GADM shapes for all countries
│  │  └── gadm41_BFA_shp.zip
│  └── google_buildings                 # all Google Open Buildings for all countries
│     ├── 0e3_buildings.csv.gz
│     ├── 0e5_buildings.csv.gz
│     ├── 0fb_buildings.csv.gz
│     ├── 0fd_buildings.csv.gz
│     └── 11d_buildings.csv.gz
│
├── 9593161cb53f                      # session id
│  ├── config.village_centers.toml      # generated user and system configuration file
│  └── data                             # all data for all countries│     
│     └── BFA                             # all data for a specific country
│        ├── args                            # args: user provided data
│        │  ├── health_facilities.csv
│        │  └── locality_villages.csv
│        │
│        ├── inputs                          # inputs: data generated by the data preparation workflow
│        │  ├── all_locations.csv              # all available admin locations 
│        │  ├── baseline_facilities.csv        # baseline health facilities with matched admin names
│        │  ├── baseline_facilities.geojson    # baseline health facilities points for visualization
│        │  ├── buildings_BFA.feather          # Google Open Buildings clipped for BFA
│        │  ├── households.csv                 # households coordinates with matched admin names
│        │  ├── households.stats.csv           # households stats
│        │  ├── prep.log                       # data preparation workflow log                        
│        │  ├── shapes                         # GADM shapes
│        │  ├── village_centers.csv            # village centers with matched admin names
│        │  └── village_centers.geojson        # village centers points for visualization
│        │
│        └── results                         # results: data generated by the scientific workflow
│           └── Bale-Boromo_3_41353ad          # results for 3 communes (Bale-Boromo and two other)
│              ├── cluster_centers.csv           # cluster centers
│              ├── cluster_counts.csv            # cluster counts
│              ├── cluster_stats.csv             # cluster stats
│              ├── clustered_households.csv      # clustered households
│              ├── locations.csv                 # locations
│              ├── optimal_facilities.csv        # optimal health facilities
│              ├── population_coverage_optimal.png   # optimal population coverage plot
│              ├── population_coverage_baseline.png  # existing population coverage plot
│              ├── run.log                       # scientific workflow log
│              ├── village_shapes.geojson        # village shapes for visualization
│              └── www                           # interactive visualization map
│
└── cache                             # cache directory managed by the system (joblib.Memory python package)            
   └── deepfacility                     # cache directory for the deepfacility package                   
      ├── data                            # data preparation functions cache
      ├── tasks                           # scientific workflow functions cache
      └── utils                           # utility functions cache

Location Format

As described in the main readme a location is an administrative area where the clustering is performed. Location values are constructed from colon-separated names of administrative levels.

The default configuration for Burkina Faso specifies the following location formats:

Communes: {province}:{commune} (e.g., Tapoa:Diapaga)
- Data from GADM shapes (columns NAME2, NAME3)
Villages: {province}:{commune}:{village} (e.g., Tapoa:Diapaga:Mangou)
- Data from user-provided village centers file (custom column)

Location columns are specified in the configuration:

Commune column: system configuration > [inputs.shapes].adm_cols
Village column: user configuration > [args.village_centers].adm_cols

UX Sessions

The main purpose of the session object is to preserve references to a config and translator objects between requests and to facilitate the execution of the background tasks.

UX session support is implemented using the deepfacility.ux.session module, with deepfacility.ux.session.init being the function which handles most of the session management: creation and retrieval.

The session object is created when the user starts the web app. At that time a session id is generated and used to store the reference to a session object in the session dictionary. The session dictionary is created in the FastAPI state object and preserved until the app is stopped.

The session id is also stored as long-living cookie in the user's browser. The session id is used to retrieve the session object from the session dictionary when the user makes a new request.

In case a session id cookie is lost a user would have to start a new session. This would also mean loosing all the previously generated files. To mitigate that, a user has few options:

Set the session id in the URL: http:://localhost:8000?sid=1234
Set the session id in the environment variable: export DEEPFACILITY_SID=1234 before running the deepfacility ux command.

# Set the session id with the environment variable
export DEEPFACILITY_SID=1234
deepfacility ux

Use the CLI scenario where the session id is set as a command line argument.

# Hardcode the session id for all session
deepfacility ux --sid 1234

UX Background Tasks

The background tasks are run using FastAPI BackgroundTasks object to asynchronously run the data preparation and scientific workflows commands in the background.

Execution Control

The background task execution is abstracted using config.Operation and ux.Session classes.

The Operation class fields are used to control workflow functions execution:

conrol_file - a file used to signal the workflow function which is running in the background tasks to stop.
log_file - point to a log file where workflow function logs are stored (and from where the logs are streamed to UI).
logger - a logger object used to log messages to the log file.

The Operation is an abstract class inherited by Inputs and Results config classes which are used by data preparation and scientific workflows functions.

The ux.Session class has a private member _operation which points to a workflow operation object when a user triggers execution. It also has start_task and stop_task methods meant to facilitate the execution of the background tasks.

Monitoring Progress UI

This section describes how workflow execution progress is monitored in the UI.

The steps below are for the data preparation workflow. The scientific workflow monitoring uses the same approach.

User initiates data preparation:
- User clicks the "Prepare Input Data" button on the data prep page 30-prep.html.
- This triggers the ux.main.prep function which:
  - Starts data preparation background tasks.
  - Renders the status monitoring page 30-prep-status-container.html.
Status monitoring:
- The status monitoring page automatically sends /prep/status request every 5 seconds.
  - This is specified with htmx hx-trigger attribute as shown in the example below.
- The ux.main.prep_status function on the server:
  - Checks the status of input files and backend tasks.
  - Renders a status update response based on the progress.
- The response is displayed in the status div in the page 30-prep-status.html.
```
<div id='container' 
     hx-get='/prep/status' 
     hx-target='#status' 
     hx-swap='innerHTML'
     hx-trigger="load, every 5s">
</div>
<div id='status' style="overflow-y: auto">
    <p>Waiting for status...</p>    
</div>
```
Success:
- If all files are ready, the server prompts the browser to refresh
  - Page refresh triggers top level htmx div element which send /info and /driver requests.
- The /driver request handler renders 40-run.html, allowing execution of the scientific workflow.
Failure:
- If files are not ready, the /driver request handler:
  - Deletes the config file.
  - Returns a response to clear the download div.
- The /info request handler renders the upload page again 10-upload.html
  - This prompts the user to reconfigure input files and do the data prep again.

Multi-Language Support

This section describes:

existing multi-language support design and implementation and how to extend it
how to enable automated translation to support additional languages or UI text updates.

Existing Multi-Language Support

Overview Of Multi-Language Support

The default multi-language support is implemented using a simple dictionary containing English-French translations for all the text currently used in the tool. This approach has the advantage of being simple and fast. The main disadvantage is that it requires manual translation of all the text used in the tool and maintaining translations as the text is updated.

To address the limitations of the default multi-language support, the tool also supports multi-language support using pre-trained language models. This approach is more flexible and can be used to translate arbitrary text. In other words, it can be used to support additional languages or to support new text in the tool.

Both multi-language support approaches are implemented in the deepfacility.lang module by inheriting the BaseTranslator abstract class defines the translation interface.

Translator Interface

It defines the translation interface which consists of two functions:

set_language - sets the language to be used for translation and do all necessary initializations.
translate - traslates the input string to the current language.

The BaseTranslator abstract class also implements the factory class method create which instantiates a translator and set the language. It takes language and request as argument and if language is not explicitly set it extracts it from the request headers. Then it calls the set_language and returns the translator.

DefaultTranslator

The DefaultTranslator implements the translation interface using a simple dictionary for English-French translations for all the text currently used in the tool. For the DefaultTranslator, set_language method loads the translation dictionary and sets the language string. The translation is simple English message lookup using the messages dictionary. The DefaultTranslator is simple and fast and it fully supports the current text used in the tool.

ML Translators

The i18n translators are an experimental feature. These translators are based on ML language models, and they are capable of successfully translating arbitrary text, so they can be used in specific situations when tools text is modified or a new language is added. The downside is that using them requires installing additional dependencies. When the app starts they need to download the model (NLP is 250MB, NLLB is 2GB). They inherit the default translator, so they still use the dictionary lookup if a match can be found. If not they will fall back to the pre-trained language model. This makes page transitions noticeably longer (for new text not covered by the default message dictionary).

ML Language Models

ML translators are based on pre-trained machine translation models. Some of the models considered for the tool are:

Hugging Face provides a range of pre-trained machine translation models accessible via the transformers library.
Helsinki-NLP/opus-mt-en-fr (NLP) can be used for English to French translation.
Facebook's NLLB model supports more than two hundred languages.

Adding New Languages

To add a new language to the tool you first need to list that language and local in the language dictionary.

Then you need to do one of the following:

Add a message translation file for the new language in the src/deepfacility/lang/messages dir, or
Enable the automated translation, as described in the next section.

To generate the message translation file you can

Copy the existing fr.json file and name it to match the local string listed in the language dictionary.
Translate messages to the new language manually or using one of the AI assistants.

Enable Automated Translation

You can enable automated translation by installing the tool with the i18n extra dependencies:

# Install the tool with gettext multi-language support.
pip install -e .[i18n]

Once i18n extra dependencies the tool will use the ML translators by default.

Note that the i18n extra dependencies include the torch packages which is required for the ML translators. For more information about using PyTorch see the documentation.

Logging

The logging is implemented using the Python logging module. The default level is set to INFO.

Because UX scenario could run multiple user session simultaneously, the code is not using a single global logger. Instead, separate loggers are instantiated for each workflow execution. This is facilitated by the Operation class in the deepfacility.config module.

All logs are displayed to the console. Workflow logs are also stored in log files, which path matches the workflow, per configuration. In the UX scenario, background tasks, running workflows are streaming logs from those log files into the UI.

Parallelization

The parallelization in this tool is implemented using the concurrent.futures module which provides a high-level interface for asynchronous execution using either ThreadPoolExecutor (on Windows and Mac) or ProcessPoolExecutor (on Linux).

The scientific workflow functions which implement asynchronous execution are (both in the flows.py module):

cluster_households: submits cluster_houses_by_villages_centers function for each location
outline_and_place: submits outline_clusters_and_place_facilities function for each location

In both cases, future is set to have a callback enclosure process_future which:

receives the future object after completion.
retrieves the resulting ClusteredHouseholds object from the future.
stores it in the result dictionary and
reports the progress using util.report_progress function.

Data Caching

Data caching is set up using joblib.Memory object which is used to cache the results of the data preparation and scientific workflows functions. The joblib.Memory object is configured to store cached data in a cache directory,
in the app root: app-data/cache so that the cache can be shared between sessions.

The criteria for selecting functions for caching is that they are:

time-consuming.
called multiple times with same arguments.
arguments are hashable in a way that uniquely identifies the result.
return values are pickleable objects which contains data, like a dataframe.

To locate cached functions in the code search for the @memory.cache decorator.

To invalidate function's cache (e.g., due to a code change), delete the entire cache directory, or a subdir corresponding to that function. You can also use the deepfacility reset command which will remove the cache directory.

Other Considerations and Limitations

The UX provided in the deepfacility package was designed for demo purposes and should ONLY be used on a local machine. This demo web application is not ready to be used as a hosted web application and is missing important security and reliability features a hosted web application must have. That said, the demo web app UI can be a good starting point for developing a fully fledged web application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.md

design.md

Design Concepts

Table of Contents

Environment Variables

Configuration

Configuration Templates

Configuration Template Variables

Configuration Initiation

Accessing Configuration

Configuration Usage

Configuration File Structure

Directory Structure

Location Format

UX Sessions

UX Background Tasks

Execution Control

Monitoring Progress UI

Multi-Language Support

Existing Multi-Language Support

Overview Of Multi-Language Support

Translator Interface

DefaultTranslator

ML Translators

ML Language Models

Adding New Languages

Enable Automated Translation

Logging

Parallelization

Data Caching

Other Considerations and Limitations

Files

design.md

Latest commit

History

design.md

File metadata and controls

Design Concepts

Table of Contents

Environment Variables

Configuration

Configuration Templates

Configuration Template Variables

Configuration Initiation

Accessing Configuration

Configuration Usage

Configuration File Structure

Directory Structure

Location Format

UX Sessions

UX Background Tasks

Execution Control

Monitoring Progress UI

Multi-Language Support

Existing Multi-Language Support

Overview Of Multi-Language Support

Translator Interface

DefaultTranslator

ML Translators

ML Language Models

Adding New Languages

Enable Automated Translation

Logging

Parallelization

Data Caching

Other Considerations and Limitations