Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geocode during ETL #139

Open
slifty opened this issue Aug 24, 2021 · 2 comments
Open

Geocode during ETL #139

slifty opened this issue Aug 24, 2021 · 2 comments

Comments

@slifty
Copy link
Contributor

slifty commented Aug 24, 2021

As part of our maps exploration we spent some time geocoding data; there's a desire to actually do that geocoding as part of the ETL pipeline.

This would give us a few things:

  1. Any application would have access to geocoded data without needing to re-transform. This also opens possibilities for search and other functions.
  2. If data can't be geocoded, we would be able to know that right away.

As part of this issue we should try to leverage the R&D done in that analysis repository, though it may turn out that the tools used there aren't a perfect fit for our ETL.

@slifty
Copy link
Contributor Author

slifty commented Aug 24, 2021

Did some digging, here are the results so far.

The Situation

The prototype we created so far uses the Nominatum API, which is ultimately powered by Open Street Maps. This is fine, but they have a usage policy which makes it fairly clear that they don't really want full blown ETL pipelines built against it.

That said our data volume is on the lower end of things (e.g. thousands of addresses rather than millions). This means it might be possible for us to use the OSM Nominatum API while staying in the spirit of their terms, but it would would require a bit of engineering (in particular: caching and the ability to handle unexpected application of rate limits for larger batches).

The Nominatum API also might dramatically slow the ETL pipeline -- their spec says they want no more than 1 request per second (though later it does say small batch jobs are OK as well).

Some Options

1: Use the OSM Nominatum API

Using the existing API, as the prototype does, can be done. There is a bit of overhead associated with being a good FOSS citizen (and of course we risk them shutting down our requests if we don't follow their rules). Specifically, we would want to:

  1. Figure out a way to cache geocoded data -- possibly treating geocoding as a preprocessing step that is only run once on a given column and then written back to the original data file and ultimately committed back to the SVN repo.

  2. Make sure that ETL does not run multiple addresses through the API at once, meaning we only submit one request at a time.

  3. Ensure that we never have infrastructure where we're running multiple batched addresses at one time (I don't think this will be a problem, but it's still a policy with long term technical implications)

2: Self host the Nominatum API

This can be done! It is a big undertaking and we won't want to do it, but just in case somehow that changes, the instructions are here.

3: Use a paid API

There are a few third party APIs some of which use Nominatum under the hood plus their own mix of spices / other FOSS tools.

Ideally we could use something that GeoPy supports out of the box, that way it's easy to swap out something truly FOSS (e.g. Nominatum) at any point in time.

@slifty
Copy link
Contributor Author

slifty commented Aug 24, 2021

Good news everyone -- It looks like OpenCage is supported by GeoPy.

I don't think the service itself is FOSS, though they do publish a lot of their code.

Importantly: switching geocode provider should be pretty darn simple thanks to GeoPy. Very importantly, the data is open, which is important too.

slifty added a commit that referenced this issue Aug 30, 2021
The GeocodeAdder is an InformationAdder which will generate a lat / lng
pair for a given address in a proposal. The adder is able to combine
several separate columns into one, since our addresses tend to be split
into parts.

Issue #139
slifty added a commit that referenced this issue Aug 30, 2021
This geocodes the various addresses associated with proposals in the
LLIIA2020 competition.

Issue #139
slifty added a commit that referenced this issue Sep 1, 2021
The GeocodeAdder is an InformationAdder which will generate a lat / lng
pair for a given address in a proposal. The adder is able to combine
several separate columns into one, since our addresses tend to be split
into parts.

Issue #139
slifty added a commit that referenced this issue Sep 1, 2021
This geocodes the various addresses associated with proposals in the
LLIIA2020 competition.

Issue #139
slifty added a commit that referenced this issue Sep 1, 2021
This geocodes the various addresses associated with proposals in the
LLIIA2020 competition.

Issue #139
slifty added a commit that referenced this issue Sep 16, 2021
SimpleMaps is a mediawiki plugin which will convert certain wiki tables
to Leaflet maps.  This adds a new map TOC in the SimpleMap format.

Issue #139
slifty added a commit that referenced this issue Sep 16, 2021
Geocoding is an expensive process, and sometimes we want to be able to
run local ETL scripts that generate geo data but don't actually run
against a geocoder.  The `debug` flag enables this functionality.

This is intended to only be used for local testing, and the flag may
disappear in future.

Issue #139
slifty added a commit that referenced this issue Sep 16, 2021
SimpleMaps is a mediawiki plugin which will convert certain wiki tables
to Leaflet maps.  This adds a new map TOC in the SimpleMap format.

Issue #139
slifty added a commit that referenced this issue Sep 16, 2021
Geocoding is an expensive process, and sometimes we want to be able to
run local ETL scripts that generate geo data but don't actually run
against a geocoder.  The `debug` flag enables this functionality.

This is intended to only be used for local testing, and the flag may
disappear in future.

Issue #139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant