-
Notifications
You must be signed in to change notification settings - Fork 9
3. Methodology
The methodology will describe the 3 experiments conducted to create Project Jetson and improve the predictions of the movements of both internally displaced people (IDPs) inside Somalia and refugees crossing the border to Dollo Ado region (southern Ethiopia).
The overall goal to conduct the experiments was to use machine-learning (ML) for multivariate time series analysis (TSA) and a combination of other techniques from data science and statistics to create a predictive analytics project, vis-à-vis the use of conventional statistical forecasting techniques for modeling.
We call it experiments, because we combined different "ingredients" (variables) and time lags, as well as used different applications for modeling. We learn and iterated the experiments, as we were learning trends in the data and new data was entering the system. The experiments are the following:
- Experiment #1 (mid-2017 to 2018): predicting forced displacement one month in advance. First experiment, using off-the-shelf software for multivariate TSA, combined with open-source scripts for ML (R and Python)
- Experiment #2 (2018 and onwards): predicting forced displacement one month and three months in advance. Second experiment, using full open-source scripting for multivariate TSA for ML (R and Python)
- Experiment #3 (planned, 2019): predicting forced displacement with three months in advanced and giving space for more months, using full open source scripting for ML, but using gravity model.
If we were to write a simple “recipe” with all the “ingredients” required to undertake this predictive analytics project – regardless of the experiment (see section below) – the recipe will be the following, mirroring a traditional data science process:
- Data (Input Data Section)
- Initial Data Exploration (Input Data Section)
- Modelling Applications
- Open-source scripting knowledge
- Server and strong computing power
- Technical Capacity (a team)
We visualize the results of the process in both a map and the graphs with the models, which we called Jetson engine. This engine portrays the historical predictions up to April 2019.
As explored in the Input Data section, this section explores the problem set up, the data pre-processing, including the handling of missing values and the generation of new variables. Also some initial exploration, that will further be developed in Experiment #2 visualizations.
Modelling application(s): this is a flexible component in Jetson and it varies depending on data protection requirements. We have tested both open-source applications meaning, we have built upon some of the research work out there on predictive analytics with time-series analysis (TSA) in R and Python. But also we have tested or required a demo for some of the off-the-shelf, including licensed software and other commercial applications for modeling purposes.
From the rapid comparison of applications to conduct this type of work, these are the five essential tech specs a modeling application need to have to TSA predictive analytics work:
- The application needs to support integration with tabular applications (e.g. excel/google sheets) and ideally needs to have a python/JS/R API
- It has to have the ability to conduct multivariate time series forecasting, defining time lags, and windows of time to perform machine-learning (ML) of the past and project future values. The machine should be able to see the dependency among dependent variables (x) and the target variable (y), but also inter-dependency between dependent variables (x1, x2, x3…). This is one of the main concepts for dynamic modeling. For example here a paper on data-driven dynamic modeling for prediction with time series analysis.
- It needs to have the possibility to run predictions both locally (for data security/protection purposes or small scale testing) as well as run predictions in the cloud;
- It contains fairly good interpretability elements to understand machine calculations [and avoid the A.I. black-box concept] and;
- Ideally, that has a ‘feedback loop’ element, this means that with new inputs in data, the machine adapts the new data points into the future predictions or is easier to have anomaly detection. For this reason, each experiment used a different combination of applications to obtain at least four of the five required functionalities. If you know of an application that is open-source software (e.g. python/R) or your team is building one that contains all these tech specs together, please let us know at [email protected]. We have explored at least 12 modeling applications, both off-the-shelf and open-source and majority of them only featured 3 or 4 out of 5 tech specs needed.
To know more about the modeling applications tested and used, please refer to sections 3a. Experiment #1 and 3b. Experiment #2.
Open-source scripting knowledge: for experiment #1 and #2 we needed to build parsers and transformation scripts to collect automatically some of the data sources as well as converting some regression functions into predictions (experiment #1 only, given the output of the modeling application we used). These parsers pull data from certain websites, collate them into originally a repository and now directly to the data visualizations and push it to the public website. We also built an additional application (R-shiny) with some performance metrics visualization (e.g. heatmap, graphs) for selecting best performing models. Finally the data visualization elements: 1) the dashboard is based also on R-shiny and 2) the map is based on javascript.
Server and strong computing power: depending on the modeling applications features, some of them have integration to cloud-based instances or the ability to connect to virtual machines. This will support running models in a more efficient way. However, if the data is sensitive, it is recommended to run locally or on-premises. For this reason, we needed computers with a minimum of 4 core processor and good RAM memory. Any computer designed for rendering, design or video games will be enough. To push out applications (e.g. website and its domain, shiny app, automation) it is recommended to have a server or a virtual machine that can host all the elements for public consumption.
Technical capacity (a team): last but not least, it is imperative to have skilled staff (e.g. computer scientist, data scientist, information systems engineers, artificial intelligence engineers) to provide maintenance to the project. To create the 2 experiments, we needed a technical team working both on the website and the applications. Our technical team is composed by the following people: a) a UX/UI designer, b) 2 data scientists and c) 1 artificial intelligence engineer. For maintenance it is recommended that one or two people with good knowledge on data science and artificial intelligence give regular maintenance to the systems.
We recommend to read each experiment methodology in the experiments sections to understand the steps for training/test, evaluation, cross-validation and model development.