Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTFS Schedule Schema: add filesize and calendar range metadata #525

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mil
Copy link
Contributor

@mil mil commented Oct 10, 2024

This PR adds 3 new system generated metadata properties to the bounding_box specification for newly added GTFS schedule sources:

  • extracted_filesize: indicates GTFS archive filesize in bytes
  • extracted_calendar_start: indicates GTFS archive calendar/calendar_dates min date
  • extracted_calendar_end: indicates GTFS archive calendar/calendar_dates max date

These properties function similarly to the existing extracted_on metadata property in the schema.

These 3 new metadata properties (extracted_filesize, extracted_calendar_start, and extracted_calendar_end) would be very helpful for end-consumers for certain applications. For example, with respect to end-users understanding general GTFS filesize consider consuming an archive that is 10MB is very different from consuming an archive that is 500MB in both download & processing time; as such it would be very helpful for end-consumers to know this stat flagged ahead-of-time. (One example of an application that would benefit from this change is my android app, Transito, which consumes GTFS indicated from MDB; and I would greatly appreciate the ability to pass for example filesize metadata along to my end-users). Additionally the calendar start/end range would help in historically understanding when the source was updated/added what the original calendar range was. While this PR only adds the 3 new properties; followup PR(s) could address updating existing sources and all new sources would have this metadata by default once applied.

In addition to the updated tests, if you just want to quickly test to see what the new format will look like for a sample, you can use for example:

add_gtfs_schedule_source(provider="foo", "country_code="bar", direct_download_url='http://data.trilliumtransit.com/gtfs/cedarrapids-ia-us/cedarrapids-ia-us.zip')

New properties for schedule schema:
  extracted_filesize: The filesize in bytes of GTFS archive extracted
  extracted_calendar_start: Earliest date referenced in calendar/calendar_dates
  extracted_calendar_end: Latest date referenced in calendar/calendar_dates

Also adds related helper functions:
  extract_gtfs_calendar_range: Extract calendar range from a GTFS archive
  get_filesize: Gets the filesize in bytes given a filepath
  is_gtfs_yyyymmdd_format: Determines if date is in GTFS YYYYMMDD format
Also adds tests for new helper extract_gtfs_calendar_range function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant