Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update routes for specific dates #638

Open
wants to merge 56 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
614bfe3
renamed save_routes method to save_new_routes
Brian-Lee Apr 22, 2020
857fe04
fixed naming error with routeconfig.new_save_routes
Brian-Lee Apr 22, 2020
c87b43e
added save_old_routes method
Brian-Lee Apr 22, 2020
3817c06
called save_old_routes after calling save_new_routes - doesnt save pr…
Brian-Lee Apr 22, 2020
22cb4c3
some progress making a versioned cache dir for old route
Brian-Lee Apr 23, 2020
e56fa1e
eliminated unecessary use_versioning variable
Brian-Lee Apr 23, 2020
28b1931
consolidated save_old_routes and save_new_routes into save_routes
Brian-Lee Apr 23, 2020
cb55856
framework for executing a scrape saving routes normally followed by a…
Brian-Lee Apr 23, 2020
33a8e53
removed 'notdated' from non-archived routes JSON files
Brian-Lee Apr 23, 2020
690e0cc
removed unused method download_gtfs_data and 'dated' from filenames
Brian-Lee Apr 23, 2020
0c9fba6
put in a more realistic date for the archive date version for the sin…
Brian-Lee Apr 23, 2020
26f1772
moved imports to the top
Brian-Lee Apr 23, 2020
9998299
added reminder comment to properly get archived GTFS data
Brian-Lee Apr 23, 2020
ee2ebe5
can add multiple archive urls to archive routes
Brian-Lee Apr 23, 2020
50812c4
pulling archive urls from a list
Brian-Lee Apr 23, 2020
5f7112f
make url from date and loop through archiving urls for archiving routes
Brian-Lee Apr 23, 2020
652a459
use transitfeeds api to get old routes to version by date and cache -…
Brian-Lee Apr 24, 2020
36565da
eliminate unecessary param archiving_old and other cleanup
Brian-Lee Apr 24, 2020
7ac1b10
some cleanup
Brian-Lee Apr 24, 2020
6ebe655
passed archiving_date instead of current date per reviewer suggestion
Brian-Lee Apr 27, 2020
ff9500b
remove unecessary archive_date
Brian-Lee Apr 27, 2020
e618a1f
framework to take archiving_date argument
Brian-Lee Apr 27, 2020
aa0b423
added some comments
Brian-Lee Apr 27, 2020
d4f13f8
changed archived_date to gtfs_date
Brian-Lee Apr 27, 2020
6e0c3b3
combined GtfsScraper calls for both cases
Brian-Lee Apr 27, 2020
24f56e9
eliminated variable d
Brian-Lee Apr 27, 2020
bb7dbf4
added backwards date search if gtfs_date doesnt match exact zipfile date
Brian-Lee Apr 27, 2020
4d568e4
date suffix now matches actual date found and used
Brian-Lee Apr 27, 2020
d5a7cbc
removed duplicative checking for dated gtfs zipfile
Brian-Lee Apr 27, 2020
0bf3cf2
pass gtfs_path to scraper instead of gtfs_date
Brian-Lee Apr 27, 2020
e496466
some cleanup
Brian-Lee Apr 27, 2020
93b90b7
fixed bug where save_routes.py broken without gtsf_date argument
Brian-Lee Apr 27, 2020
af19055
added a comment
Brian-Lee Apr 27, 2020
483a0b2
combined duplicative lines
Brian-Lee Apr 30, 2020
2084742
changed command line argument gtfs_date to date
Brian-Lee Apr 30, 2020
385e209
changed the method of finding most recent gtfs zip
Brian-Lee Apr 30, 2020
8711770
Merge branch 'master' of https://github.com/trynmaps/metrics-mvp into…
Brian-Lee May 7, 2020
f35f14a
combined two identical lines into one
Brian-Lee May 7, 2020
e219b42
reduced if-else to just if
Brian-Lee May 7, 2020
198a3ae
eliminated unecessary else keyword
Brian-Lee May 7, 2020
8fcbb5b
removed outdated comments
Brian-Lee May 7, 2020
46c9927
removed unecessary assignment of save_to_s3
Brian-Lee May 7, 2020
d4f8b72
changed
Brian-Lee May 7, 2020
2a4c807
setting starting date more appropriately to date argument
Brian-Lee May 7, 2020
a5f5a62
chained two lines into one
Brian-Lee May 21, 2020
4d24dfe
eliminated unecessary import shutil
Brian-Lee May 21, 2020
fdba9b7
changed all occurrences of version_date to gtfs_date
Brian-Lee May 21, 2020
d2be8d4
changed one missed version_date to gtfs_date and removed unecessary i…
Brian-Lee May 21, 2020
78e5386
removed unecessary imports
Brian-Lee May 21, 2020
ed4c4d7
improved the comment
Brian-Lee May 21, 2020
3b30c68
removed unecessary parameter from save_routes method
Brian-Lee May 21, 2020
7b81c1a
simplified vars - removed date_to_use
Brian-Lee May 21, 2020
8a7443c
added error msg for dated gtfs file not found and moved code into new…
Brian-Lee May 22, 2020
de0b556
corrected inconsistency -YYYY-MM-DD vs _YYYY-MM-DD in routes path
Brian-Lee May 22, 2020
5639610
fixed introduced bug resetting gtfs_path and gtfs_date outside of ELSE
Brian-Lee May 22, 2020
db1bacc
conditionally load gtfs data from the cache-dir or gtfs_path
Brian-Lee May 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 23 additions & 12 deletions backend/models/gtfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
import gzip
import hashlib
import zipfile
import os
from datetime import datetime, timedelta

from . import config, util, nextbus, routeconfig, timetables

Expand Down Expand Up @@ -49,28 +51,35 @@ def get_stop_geometry(stop_xy, shape_lines_xy, shape_cumulative_dist, start_inde
'after_index': best_index, # the index of the coordinate of the shape just before this stop
'offset': int(best_offset) # distance in meters between this stop and the closest line segment of shape
}


def download_gtfs_data(agency: config.Agency, gtfs_cache_dir):
def get_gtfs_data(agency: config.Agency, gtfs_cache_dir, gtfs_path=None):
cache_dir = Path(gtfs_cache_dir)
zip_path = f'{util.get_data_dir()}/gtfs-{agency.id}.zip'
gtfs_url = agency.gtfs_url


if gtfs_url is None:
raise Exception(f'agency {agency.id} does not have gtfs_url in config')

cache_dir = Path(gtfs_cache_dir)

if not cache_dir.exists():
print(f'downloading gtfs data from {gtfs_url}')
r = requests.get(gtfs_url)

if r.status_code != 200:
raise Exception(f"Error fetching {gtfs_url}: HTTP {r.status_code}: {r.text}")

zip_path = f'{util.get_data_dir()}/gtfs-{agency.id}.zip'

with open(zip_path, 'wb') as f:
f.write(r.content)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(gtfs_cache_dir)
if gtfs_path is not None:
zip_path = gtfs_path
Brian-Lee marked this conversation as resolved.
Show resolved Hide resolved

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(gtfs_cache_dir)


def is_subsequence(smaller, bigger):
smaller_len = len(smaller)
bigger_len = len(bigger)
Expand Down Expand Up @@ -108,15 +117,18 @@ def contains_excluded_stop(shape_stop_ids, excluded_stop_ids):
return False

class GtfsScraper:
def __init__(self, agency: config.Agency):
def __init__(self, agency: config.Agency, gtfs_path=None):
self.agency = agency
self.agency_id = agency_id = agency.id
gtfs_cache_dir = f'{util.get_data_dir()}/gtfs-{agency_id}'

download_gtfs_data(agency, gtfs_cache_dir)

self.feed = ptg.load_geo_feed(gtfs_cache_dir, {})
get_gtfs_data(agency, gtfs_cache_dir, gtfs_path=gtfs_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For say trimet, the GTFS directory ends up being:
/data/gtfs-trimet/gtfs-trimet-2020-02-22 instead of /data/gtfs-trimet_2020-02-22 (for a sample date).

Then, in line 127 the GTFS that's passed in ends up being the one in gtfs_cache_dir, so gtfs_path ends up not being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an if-else which should solve that. I think there are cleaner ways, and I haven't yet done a good job of testing this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to re-request a review at this point. Thank you so much for looking through this over and over again!


if gtfs_path is None:
self.feed = ptg.load_geo_feed(gtfs_cache_dir, {})
else:
self.feed = ptg.load_geo_feed(gtfs_path, {})

self.errors = []
self.stop_times_by_trip = None
self.stops_df = None
Expand Down Expand Up @@ -261,7 +273,6 @@ def save_timetables(self, save_to_s3=False, skip_existing=False):
agency_id = self.agency_id

dates_map = self.get_services_by_date()

#
# Typically, many dates have identical scheduled timetables (with times relative to midnight on that date).
# Instead of storing redundant timetables for each date, store one timetable per route for each unique set of service_ids.
Expand Down Expand Up @@ -1078,4 +1089,4 @@ def save_routes(self, save_to_s3, d):

routes = [routeconfig.RouteConfig(agency_id, route_data) for route_data in routes_data]

routeconfig.save_routes(agency_id, routes, save_to_s3=save_to_s3)
routeconfig.save_routes(agency_id, routes, save_to_s3=save_to_s3, gtfs_date=d)
19 changes: 14 additions & 5 deletions backend/models/routeconfig.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import re, os, time, requests, json, boto3, gzip
from pathlib import Path
from . import util, config

DefaultVersion = 'v3a'
Expand Down Expand Up @@ -121,8 +122,13 @@ def get_directions_for_stop(self, stop_id):
for s in direction['stops'] if s == stop_id
]

def get_cache_path(agency_id, version=DefaultVersion):
return f'{util.get_data_dir()}/routes_{version}_{agency_id}.json'
def get_cache_path(agency_id, version=DefaultVersion, gtfs_date=None):
if gtfs_date == None:
return f'{util.get_data_dir()}/routes_{version}_{agency_id}.json'

return f'{util.get_data_dir()}/routes_{version}_{agency_id}-{gtfs_date}/routes_{version}_{agency_id}-{gtfs_date}.json'



def get_s3_path(agency_id, version=DefaultVersion):
return f'routes/{version}/routes_{version}_{agency_id}.json.gz'
Expand Down Expand Up @@ -179,14 +185,17 @@ def get_route_config(agency_id, route_id, version=DefaultVersion):
return route
return None

def save_routes(agency_id, routes, save_to_s3=False):
def save_routes(agency_id, routes, save_to_s3=False, gtfs_date=None):
data_str = json.dumps({
'version': DefaultVersion,
'routes': [route.data for route in routes]
}, separators=(',', ':'))

cache_path = get_cache_path(agency_id)

cache_path = get_cache_path(agency_id, gtfs_date=gtfs_date)
cache_dir = Path(cache_path).parent
if not cache_dir.exists():
cache_dir.mkdir(parents = True, exist_ok = True)

with open(cache_path, "w") as f:
f.write(data_str)

Expand Down
71 changes: 61 additions & 10 deletions backend/save_routes.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from models import gtfs, config
from models import gtfs, config, util
from compute_stats import compute_stats_for_dates
import argparse
from datetime import date
from datetime import date, datetime, timedelta
import os

# Downloads and parses the GTFS specification
# and saves the configuration for all routes to S3.
Expand Down Expand Up @@ -32,41 +33,91 @@
#}
#
#
# Currently the script just overwrites the one S3 path, but this process could be extended in the future to
# store different paths for different dates, to allow fetching historical data for route configurations.
#
# When no date is provided the script just overwrites the one S3 path
# representing the recentmost GTFS that an agency has made available that # is active. Providing the date adds -YYYY-MM-DD to the routes path,
# which would allow the backend to use versioned route files.
Brian-Lee marked this conversation as resolved.
Show resolved Hide resolved


def get_recentmost_date_qualified_gtfs_path(gtfs_date):
'''
Find most recent zip file before gtfs_date.
recentmost_date_qualified_zip_file is:
"date qualified" and "recentmost"

"date qualified" means the date of the file is no later than the date
argument given.

"recentmost" means it is the most recent file that qualifies.
'''

recentmost_date_qualified_zip_file = ""
recentmost_date_qualified_date = gtfs_date
smallest_timedelta_so_far = timedelta.max
for candidate_zip_file in os.listdir(util.get_data_dir()):
if f'gtfs-{agency.id}-' in candidate_zip_file and '.zip' in candidate_zip_file:
candidate_year = candidate_zip_file.split('-')[2]
candidate_month = candidate_zip_file.split('-')[3]
candidate_day = candidate_zip_file.split('-')[4].split(".zip")[0]
candidate_date_string = candidate_year+"-"+candidate_month+"-"+candidate_day
candidate_date = datetime.strptime(candidate_date_string,"%Y-%m-%d").date()
if candidate_date - gtfs_date <= smallest_timedelta_so_far and candidate_date <= gtfs_date:
recentmost_date_qualified_date = candidate_date
recentmost_date_qualified_zip_file = candidate_zip_file

gtfs_date = recentmost_date_qualified_date
gtfs_path = f'{util.get_data_dir()}/{recentmost_date_qualified_zip_file}'
if recentmost_date_qualified_zip_file == "":
print("an active GTFS for this date was not found")
raise SystemExit
return gtfs_path, gtfs_date

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Save route configuration from GTFS and possibly Nextbus API')
parser.add_argument('--agency', required=False, help='Agency ID')
parser.add_argument('--s3', dest='s3', action='store_true', help='store in s3')
parser.add_argument('--timetables', dest='timetables', action='store_true', help='also save timetables')
parser.add_argument('--scheduled-stats', dest='scheduled_stats', action='store_true', help='also compute scheduled stats if the timetable has new dates (requires --timetables)')
parser.add_argument('--date', required=False)
parser.set_defaults(s3=False)
parser.set_defaults(timetables=False)
parser.set_defaults(scheduled_stats=False)
parser.set_defaults(gtfs_date=None)

args = parser.parse_args()

agencies = [config.get_agency(args.agency)] if args.agency is not None else config.agencies

save_to_s3 = args.s3
d = date.today()
gtfs_date = args.date

errors = []

for agency in agencies:
scraper = gtfs.GtfsScraper(agency)
scraper.save_routes(save_to_s3, d)

if gtfs_date is None:
# save the normal way, downloading the most recent GTFS file
gtfs_date=date.today()
gtfs_path = None
else:
# save with date suffix, using the GTFS file provided
gtfs_date=datetime.strptime(gtfs_date, "%Y-%m-%d").date()
gtfs_path = f'{util.get_data_dir()}/gtfs-{agency.id}-{gtfs_date}.zip'

gtfs_path, gtfs_date = get_recentmost_date_qualified_gtfs_path(gtfs_date)

# save the routes
scraper = gtfs.GtfsScraper(agency, gtfs_path=gtfs_path)
Brian-Lee marked this conversation as resolved.
Show resolved Hide resolved
scraper.save_routes(save_to_s3, gtfs_date)
errors += scraper.errors


if args.timetables:
timetables_updated = scraper.save_timetables(save_to_s3=save_to_s3, skip_existing=True)

if timetables_updated and args.scheduled_stats:
dates = sorted(scraper.get_services_by_date().keys())
compute_stats_for_dates(dates, agency, scheduled=True, save_to_s3=save_to_s3)

errors += scraper.errors

if errors:
raise Exception("\n".join(errors))