Skip to content

2. Input Data

rebecamoreno edited this page Aug 12, 2020 · 10 revisions

Data sources

The variables selected to conduct this experiment, represent some of the most important influencing factors for forced displacement in the operational context of Somalia. The variables were selected according to different assumptions we needed to later validate. For selecting variables and validating them, we used several innovation methodologies, such as principles of human-centered design.

The selection of influencing factors or variables were:

  • Originally assumed by conducting an initial literature review of the operational context.
  • Proposed by UNHCR staff and/or the field operation with extensive field experience in that particular region
  • Directly suggested by persons of concern themselves, particularly refugees

Independent Variables

According to them, the most influential (independent, y) variables - and therefore the datasets collected - to understand forced displacement and the push-pull factors of population movement in Somalia are:

Violent conflict: is defined in ACLED codebook - which is one of the main violent conflict data sources in the region. We included all the violent events and demonstrations or in other words excluded all non-violent actions (event type: strategic developments). We used two main variables from this data source: the sum amount of violent incidents per month per region and the amount of fatalities (deaths) per month per region.  ACLED is the only data source for Jetson with a public API.

Climate & Weather anomalies: climate and weather predictive analytics is a rigorous science with many meteorology and environmentally-based methods. In order for us to keep the experiment as simple as possible we analysed two main variables, as a proxy for climate and weather anomalies considering that they are some of the main weather/climate anomalies that disrupt livelihoods in the region:

  • Rain patterns: this dataset has had an evolution in terms of data-sharing. The data source is FAO SWALIM team. It was originally publicly available in this site, but later on shared to us via email (excel tables with gauge sensors data). Now is compiled along with FAO FSNAU dashboard (see note below)
  • River levels: publicly available data by FAO SWALIM Flood Risk and Response Information Management, we originally built a parser for extracting the data from the river level graphs.

Note: Recently, FAO SWALIM and FSNAU teams joined forces and collated the climate, weather and market datasets datasets together in a single dashboard (registration required):  FAO FSNAU Early Warning/Early Action in Somalia leapfrogging our automation work to build a more user-friendly project.

Market prices: were suggested to be included in this experiment by refugees themselves via key informant interviews. They highlighted the importance of two commodities for their livelihoods: water drum prices and [local] goat market prices, this latter being a proxy for movement. This is due to the fact that refugees stated that goats are a sensitive product to extreme weather conditions. Prior fleeing they sell them in order to their trip/tribe dying on the road due to lack of water or pasture to feed on. This creates a ripple effect on the local market of supply/demand for the local commodity. The market prices are collected by FAO FSNAU Integrated Database System team. 

We also originally considered also the following dependent variables, but at the end we did not included the for the following reasons:

  • Remittances data as a pull-factor: the data issue for not including this dataset was it was sensitive data for competitiveness-related factors, private-sector owned. The company did not release it.
  • AWD/cholera cases and deaths as a push factor: the issue was attempting to homogenize epidemiological weeks with time series calendar months format. Later on we discovered the epiweeks python package.
  • Humanitarian cash-assistance data, as a pull/push factor: the issue was that there was not enough historical data, only 2 years available of data disaggregated per admin level 1 (per region).

Dependent Variable

Forced displacement: as per UNHCR mandate, forced displacement is the target variable, focusing in calculating the number of arrivals per region and per month - and taking Dollo Ado as an extra "region" for modeling purposes. For internal displacement we obtained historical data (7 years of the 18 regions of Somalia) from the Protection and Return Monitoring Network (PRMN) lead by UNHCR Somalia operation. For cross-border displacement - this means refugee movement, we are using official historical and current UNHCR registration data from UNHCR Dollo Ado, Melkadida sub-office in Ethiopia.

Data wrangling

Wrangling or munging data is the process for converting raw data into a machine-readable format for analysis purposes. Contrary to other sectors, humanitarian data in terms of both access and quality, are far from this objective.

Acquisition

Obtaining access to the different datasets was originally very human-resource intensive. Certain datasets were not publicly available and/or they were not in a machine-readable format (e.g. graph with values on a website, public PDFs with historical figures or formatted-tables shared by email). Building Jetson from the scratch required extensive coordination with the data providers and partners and building multiple data parsers.

Cleaning

The first remark about data quality of humanitarian data is the presence of missing values. Access to humanitarian data is tight to security conditions and the presence of staff with technical skills on the ground. For this reason, teams dealing with humanitarian data need to develop deep knowledge on imputation techniques.

For Project Jetson, different imputation methods were used after observing missing values in majority of the input datasets. Missing values, affect the accuracy of predictions. This is particularly true for machine-learning in regions where the target variable values were missing. For example, in Sanaag region where majority of the values for the independent variable (historical arrivals) are missing. Several imputation techniques were used to address this issue. As an example was the use of the multivariate imputation by chained equations (MICE) algorithm for experiment #1, and the team also took advantage of some of the solutions suggested by the modeling tool blog (Eureqa). For experiment #2 a binary indicator has been developed to flag those values missing, and be able to have two values for each calculation.

In addition to this, some datasets presented certain values which are typical to their measurement tools errors. For example, a .999 value means that the gauge rain sensor was were not working (error) instead of 0, which is equal to no rain, or rain = 0 mm. Other techniques were used to deal with this type of errors. For example, in experiment #1 when more than 50% of the rain data was missing from an specific station, we excluded the station from the overall calculations. For those rain stations with missing values but more than 50% of the values present we averaged the values for the number of days present.

To the date, we are still experimenting different methods how to deal with missing values to produce more accuracy in the predictions.

Aggregation

One of the most common formats for predictive analytics is time series. Time series are data points indexed in a time/date order. The time series aggregation format for Project Jetson input data is per admin level 1 (per region) and per month. This is due to the fact that some variables used or considered to use, as described in the dependent variables section, are reported in that frequency.

Although we fully acknowledge Dollo Ado comprises a larger region that includes five UNHCR refugee camps (Melkadida, Kobe, Hilawayn, Bokolmanyo and Buramino), we considered as "region 19" for modeling purposes and took the number of refugee arrivals as admin level 1.

The time series starts from January 1, 2010 and for machine learning purposes finishes in June 1 or September 1, 2017 (experiment #1) and June 1, 2018 (for initial stage of experiment #2). Available data for the validation set is July 1, 2018 to July 1, 2019.

Bias

It is important to recognize that all data is bias, and therefore the outputs of predictive analytics. Bias in Project Jetson data comes from either methodological approach at the collection point (read PRMN methodological notes) about underlying displacement movement that could not potentially be represented in the data. ACLED Methodology also has the same bias, there could be underlying violent conflict not captured by the enumerators. In addition, the selection of variables was done taking into consideration the opinion of humanitarian workers and refugees (observer bias). Finally, as explain in the cleaning section, the omitted variable bias (N/A values) can skew some of the calculations.

Data exploration

Initial data exploration was conducted to observe different datasets and that could potentially help us identify some trends. For example, the following graph represents the aggregation of all ACLED Data violent conflict incidents in all 18 regions of Somalia per month.

Violent Conflict Somalia

The following graphs represents the number of arrivals of internally displaced people aggregated by all regions (18 regions of Somalia) per month, including arrivals within regions and across regions.

IDP arrivals

In addition to this, we explore the data from a geospatial perspective. The following map contains the aggregated number of violent conflict incidents, per month per region. The bubble represent the number (frequency).

Violent Conflict map

Additionally, the following map contains the aggregated number of deaths (fatalities) per month, per region.

Fatalities map