Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get changsets for the osm ways and relations #27

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions openskistats/changeset.py
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this file to openstreetmap.py

Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import requests
import xml.etree.ElementTree as ET
import polars as pl
from openskistats.analyze import load_runs_pl, load_ski_areas_pl

import re

def parse_osm_url(url: str):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type annotation missing

"""Extract type and ID from an OpenStreetMap URL."""
match = re.match(r"https://www\.openstreetmap\.org/(way|relation)/(\d+)", url)
if match:
osm_type, osm_id = match.groups()
return osm_type, int(osm_id)
return None, None

def fetch_changeset_history(osm_type: str, osm_id: int):
"""Fetch the changeset history for a given OSM entity."""
url = f"https://api.openstreetmap.org/api/0.6/{osm_type}/{osm_id}/history"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like our usage is borderline according to their policies https://operations.osmfoundation.org/policies/api/.

If going this route, we should rate limit requests and set up some method of caching. For a complex decision with many ramifications, like the caching method, consider discussion on multiple options, either here or in the issue prior to implementation.

We should identify ourselves as well, similar to

headers = {
"From": "https://github.com/dhimmel/openskistats",
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will think about this a bit more, and discuss.

response = requests.get(url)
response.raise_for_status()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm including the response from https://api.openstreetmap.org/api/0.6/relation/6981791/history for reference:

Expand for xml response
<osm version="0.6" generator="openstreetmap-cgimap 2.0.1 (3288085 spike-08.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
<relation id="6981791" visible="true" version="1" changeset="46092405" timestamp="2017-02-14T23:01:29Z" user="BK_man" uid="242352">
<member type="way" ref="474283680" role=""/>
<member type="way" ref="474283683" role=""/>
<member type="way" ref="474283679" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
<relation id="6981791" visible="true" version="2" changeset="46385760" timestamp="2017-02-25T05:49:23Z" user="BK_man" uid="242352">
<member type="way" ref="474283680" role=""/>
<member type="way" ref="474283679" role=""/>
<member type="way" ref="474283683" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
<relation id="6981791" visible="true" version="3" changeset="94749008" timestamp="2020-11-25T06:09:29Z" user="David Sanderson" uid="7679993">
<member type="way" ref="474283680" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
<relation id="6981791" visible="true" version="4" changeset="94970016" timestamp="2020-11-29T05:36:54Z" user="David Sanderson" uid="7679993">
<member type="way" ref="878819927" role=""/>
<member type="way" ref="474283680" role=""/>
<member type="way" ref="878819926" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
<relation id="6981791" visible="true" version="5" changeset="95638024" timestamp="2020-12-10T19:40:18Z" user="David Sanderson" uid="7679993">
<member type="way" ref="878819926" role=""/>
<member type="way" ref="878819927" role=""/>
<member type="way" ref="474283680" role=""/>
<member type="way" ref="877632846" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
<relation id="6981791" visible="true" version="6" changeset="127640460" timestamp="2022-10-16T22:54:39Z" user="SMS03" uid="15395108">
<member type="way" ref="878819926" role=""/>
<member type="way" ref="1104527478" role=""/>
<member type="way" ref="878819927" role=""/>
<member type="way" ref="1104527481" role=""/>
<member type="way" ref="474283680" role=""/>
<member type="way" ref="877632846" role=""/>
<tag k="name" v="11. Warming Hut"/>
<tag k="piste:difficulty" v="easy"/>
<tag k="piste:type" v="nordic"/>
<tag k="route" v="piste"/>
<tag k="type" v="route"/>
</relation>
</osm>

Let's capture a bit more of this response including version and uid (user-id)


# Parse the XML response
root = ET.fromstring(response.content)
changesets = []
for element in root.findall(f"./{osm_type}"):
changesets.append({
"changeset_id": element.attrib.get("changeset"),
"user": element.attrib.get("user"),
"timestamp": element.attrib.get("timestamp"),
})
return changesets

def process_batch(batch: pl.DataFrame) -> pl.DataFrame:
"""Process a batch of OSM URLs to fetch their changeset histories."""
changeset_records = []
for url in batch["osm_url"]:
osm_type, osm_id = parse_osm_url(url)
if osm_type and osm_id:
history = fetch_changeset_history(osm_type, osm_id)
for record in history:
record["osm_url"] = url
record["osm_type"] = osm_type
record["osm_id"] = osm_id
changeset_records.append(record)
return pl.DataFrame(changeset_records)

def get_changesets_for_runs_and_ski_areas() -> pl.DataFrame:
run_sources = (
load_runs_pl()
.explode("run_sources")
.select(
"run_id",
"run_name",
"ski_area_ids",
pl.col("run_sources").alias("run_source"),
)
.collect()
)

ski_area_sources = (
load_ski_areas_pl()
.explode("ski_area_sources")
.select(
"ski_area_id",
"ski_area_name",
pl.col("ski_area_sources").alias("ski_area_source"),
)
.filter(pl.col("ski_area_source").str.starts_with("https://www.openstreetmap.org"))
)

osm_urls = sorted(set(ski_area_sources["ski_area_source"].to_list()) | set(run_sources["run_source"].to_list()))

osm_urls_df = pl.DataFrame({"osm_url": osm_urls})

# Process the OSM URLs using a map function
changeset_pl_df = osm_urls_df.select(
pl.col("osm_url").map_elements(
lambda url: process_batch(pl.DataFrame({"osm_url": [url]})),
return_dtype=pl.Object
).alias("changeset_data")
).explode("changeset_data")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, could you help me @dhimmel on the polars syntax to parallelize / collect the calls to process_batch?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, will help. But I think first we should collect all XML responses prior to polars with some sort of persistent caching. We could proceed with a dev sample of 100 or so OSM elements. Once we have a database/file with the XML, we can then figure read all records into polars, but it will make more sense to handle requests outside of the polars dataframe creation I think.


# Display the resulting DataFrame
return changeset_pl_df
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure pre-commit hooks are installed and then run pre-commit run --all. This might fix the failing CI runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yah the CI failures are on pre-commit.

Loading