Gathering accurate information in rural areas is often costly and difficult, which means local leaders may lack important knowledge on their constituents’ needs required to inform evidence-based decisions. Given IDinsight’s focus on data-driven decisions in the social sector, we see this as an important information gap.
Further, we identified the need to collect household-level data from a representative sample of households. However, there was a lack of reliable and up-to-date list of households (or villages) that we could use as a sampling frame (and from which to draw a sample). An additional challenge is Nano’s mandate to be scalable and provide information at low cost, so censusing the whole chiefdom to create a sampling frame was not an option. Therefore, we strove to create a sampling methodology that was economical, scalable and easy for surveyors to implement.
This notebook contains code to implement the geographic segmentation portion of Nano's sampling strategy. A more detailed description is contained in the corresponding blog post.
The geographic segmentation is implemented in Python as follows:
- Plot all relevant boundaries and population/structure datasets to visualize the study area, boundaries (e.g. rivers and roads) and population (e.g. Facebook and OpenStreetMap datasets) information and ensure that the data makes sense.
- Divide the study area into smaller cells (we call them enumeration areas - EAs). We use 500 by 500 meter squares as EAs, but other shapes are possible, such as village cluster boundaries.
- Determine areas with high probability of household presence, using the Facebook population and OpenStreetMap buildings datasets.
- Identify EAs with non-zero probability of household presence. We assume these are EAs that have a non-zero population from Facebook and/or buildings from OpenStreetMaps. Other rules could be used.
File structure:
- geographical_segmentation.ipynb: the main Python script as a Jupyter Notebook,
- geographical_segmentation.py: the main Python script (idential to geographical_segmentation.ipynb),
- functions/clean_data.py: user built functions to clean data,
- functions/mapping.py: user built functions to create maps,
- shapefiles: folder to contain all Shapefiles used in analysis,
- plots: folder to save all plots created in analysis,
- data: folder to contain the primary output from analysis (see 1 of the Outputs section below).
Inputs (all saved as shapefiles in shapefiles folder):
- Boundary of study area (Mukobela Chiefdom in our case). This should be in a coordinate reference system (CRS) using meters so that the EAs can be constructed using a width and length specified in meters. The CRS system for meters in southern africa is EPSG: 32735, and one can change a shapefiles CRS in QGIS,
- Population estimates (Facebook and roofs dataset in our case). This should be in the CRS using latitude and longitude, which is EPSG: 4326,
- Any other relevant boundaries to separate EAs (roads and rivers in our case). This should be in the CRS using latitude and longitude, which is EPSG: 4326.
Outputs:
- data/EA_information.csv: a CSV with all EAs coordinates and population estimates. This file allows us to construct our sampling frame.
- shapefiles: there are 4 new shapefiles saved to the shapefiles folder. There are described below:
2a) study_area_4326.shp: the study area converted into the latitude and longitude coordinate reference system.
2b) grids_final_4326.shp: the EAs with an unique identify and whether the EA contains a non-zero Facebook population estimate and/or OpenStreetMap buildings.
2c) roads_final_4326.shp: the roads used as a boundary to construct the EAs. This only includes "large" roads.
2d) rivers_final_4326.shp: the rivers used as a boundary to construct the EAs. This only includes "large" rivers.
To install all prequisites, you first need to download Python and Jupyter Notebook. An easy way to do so is to download these through Anaconda.
Then, the Python modules required to run the analysis can be installed using the requirements.txt file.
pip install -r requirements.txt
The input files with pre-processing instructed are listed below. All of these files should be saved in the shapefiles folder.
- study_area_32735.shp: boundary of study area. We created this by drawing the boundary on QGIS. If the study area is an administrative region (e.g. country, state, district) then there are possibly publicly available shapefiles. A good place to look for such files is the HumData website.
1a) This should be in a CRS using meters, which is EPSG: 32735 for southern africa. One can change a shapefiles CRS in QGIS3 by importing the shapefile into QGIS3 and then right-click the imported shapefile, select export, select save features as, and change the CRS in the drop-down.
1b) A description of CRS is here.
1c) Line 90 of geographical_segmentation.py contains the original CRS. The CRS for the study area shapefile should be reflected in this line of code. - roofs_4326.shp and fb_roofs_4326.shp: shapefiles with OpenStreetMaps buildings and Facebook’s population datasets.
2a) Save the datasets in shapefiles using the longitude and latitude CRS (EPSG: 4326). - roads_4326.shp and rivers_4326.shp: shapefiles with roads and rivers boundaries in your study area.
3a) Save the datasets in shapefiles using the longitude and latitude CRS (EPSG: 4326).
The code can be ran from the Python terminal or Jupyter Notebook.
To run the code from the Python terminal, open the Python terminal, navigate to the local directory with the GitHub repo, and enter the code below:
cd enter_local_directory
python geographic_segmentation.py
To run the Jupyter Notebook, open Jupyter Notebook, navigate to the local directory with the GitHub repo, and open the Notebook. Then, select Cells and Run All.
This project is licensed under the GPLv3 license - see the LICENSE file for details.