Skip to content

Commit

Permalink
Merge pull request #20 from transport-data/issue/18
Browse files Browse the repository at this point in the history
Adapt to changes in the ADB ATO data source
  • Loading branch information
khaeru committed Jun 18, 2024
2 parents d52312e + 1765e43 commit 47e2c3a
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 3 deletions.
35 changes: 35 additions & 0 deletions doc/adb.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,39 @@
Asian Development Bank
**********************

This module handles the `Asian Transport Outlook (ATO) <https://asiantransportoutlook.com>`_ source maintained by the Asian Development Bank (ADB, initially) and Asian Infrastructure Investment Bank (AIIB, more recently), specifically the `ATO National Database <https://asiantransportoutlook.com/snd/>`_.

In particular, it converts data from the :ref:`ATO native Excel file format <ato-format>`—both the 2022-10-07 and 2024-05-20 formats—to SDMX and extracts metadata.

.. contents::
:local:
:backlinks: none

.. _ato-format:

ATO National Database format
============================

The ATO native Excel file format is characterized by the following.
There is an `ATO National Database User Guide <https://asiantransportoutlook.com/snd/snd-userguide/>`_ that contains some of the information below, but does not describe the file format.

- Data flow IDs like ``TAS-PAT-001(1)``, wherein:

- ``TAS`` is the ID of a ‘category’ code with a corresponding name like “Transport Activity & Services”.
Individual files (‘workbooks’) contain data for flows within one category.
- ``PAT`` is the ID of a ‘subcategory’ code with a corresponding name like “Passenger Activity Transit”.
- ``001(1)`` is the ID of an ‘indicator’ code with a corresponding name like “Passengers Kilometer Travel - Railways”.
- All data flows with the same initial part like ``TAS-PAT-001`` contain data for the same *measure*.
The final part ``(1)`` indicates data from alternate sources for the same measure.
- Files contain a “TOC” sheet and further individual sheets.
See :func:`~.adb.read_sheet`, which reads these sheets, for a detailed description of the apparent format.
- The files are periodically updated.
- The update schedule is not fixed in advance.
- Previous versions of the files do not appear to be available.
- The file metadata contains a "Created by:" field with information like "2022-10-07, 11:41:56, openpyxl".
At least two different dates have been observed:

- 2022-10-07
- 2024-05-20.

.. include:: _api/transport_data.adb.rst
2 changes: 2 additions & 0 deletions doc/whatsnew.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ Next release
============

- Add :doc:`standards` and :doc:`roadmap` documentation pages (:pull:`9`).
- Adjust :mod:`.adb` for changes in data format in the 2024-05-20 edition of the ATO National Database (:pull:`20`, :issue:`18`).
Document the :ref:`current file format <ato-format>` that the code supports.

v24.2.5
=======
Expand Down
20 changes: 17 additions & 3 deletions transport_data/adb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import Callable, Tuple
from urllib.parse import quote

import numpy as np
import pandas as pd
import sdmx.model.v21 as m

Expand Down Expand Up @@ -167,11 +168,24 @@ def read_sheet(
data_col_mask = list(map(str.isnumeric, df.columns))

# Handle values not parsed by pd.ExcelFile.parse(). Some cells have values like
# "12,345.6\t", which are not parsed to float. Strip the trailing whitespace and
# thousands separators, then convert.
# "12,345.6\t", which are not parsed to float.
#
# - Strip trailing whitespace.
# - Remove thousands separators ("," in 2022-10-17 edition; " " in 2024-05-20
# edition) and whitespace before the decimal separator ("10860 .6", 2024-05-20
# edition).
# - Replace "-" (no data) and "long ton" (erroneous) appearing since 2024-05-20
# edition.
# - Finally, convert to float.
dtypes = df.loc[:, data_col_mask].dtypes # Dtypes of data columns only
for col, _ in filter(lambda x: x[1] != "float", dtypes.items()):
df[col] = df[col].str.strip().str.replace(",", "").astype(float)
df[col] = (
df[col]
.str.strip()
.str.replace(r"(\d)[, ]([\d\.])", r"\1\2", regex=True)
.replace("^(-|long ton)$", np.nan, regex=True)
.astype(float)
)

# Identify remarks columns: any entries at the *end* of `df.columns` with non-
# numeric labels. Use the index of last data column, counting backwards.
Expand Down

0 comments on commit 47e2c3a

Please sign in to comment.