Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline - Predicting Replaced (2nd choice) mode with logistic regression #1087

Open
Abby-Wheelis opened this issue Sep 4, 2024 · 19 comments

Comments

@Abby-Wheelis
Copy link
Member

Abby-Wheelis commented Sep 4, 2024

There are two main components to predicting mode choice with a choice model:

  1. The choice model itself, representing people's preferences ie Abby's preferences for travel are: (time: -2, cost: -5, fun: 10)
  2. The possible modes for the trip ie [Car: (cost=15$, time=10min, fun=0), E-bike:(cost=1$, time=15min, fun=1), walk:(cost=0$, time=60min, fun=-1)]

With these factors we can predict that Abby would choose e-bike (-25) and without the e-bike would choose car(-95) but wouldn't choose walk (-130), approximately.

As a baseline, we want to build a logistic regression model, since that is what is most commonly used in research and planning to model mode choice (ie what would the ridership returns on this transit investment be like?).

We have ground truth data about 2nd choice modes, through the replaced mode collected by programs that have a mode of interest, often e-bike. This is used to show the impact of the mode of interest, through things like emissions savings/reductions which we map on the public dashboard.

To build up the alternatives, we'll need a few different pieces of data, which could be complex to figure out:

  • what modes are available:
    • initial demographic survey asks people about their options - "do you have a license" "what modes are available to you"
    • we can check for transit availability: NTD? What does MEP use? Google or other API?
      • same method as Jack implemented for carbon/energy? (No because there are busses in Golden that could get me to work, but not to the climbing gym, we need routing)
  • cost:
    • cars - reimbursement rate to account for amortized ownership/maintenance cost
    • transit - what does MEP use? Does NTD have cost data? Google or other API?
    • bikes/ebikes?
    • shared micromobility?
  • time:
    • use overpass to query OSM?
    • what does MEP use?
    • general approximation factors?
  • any other factors? - likely something to pay attention to in the literature

@shankari @jpfleischer for visibility

@shankari
Copy link
Contributor

shankari commented Sep 4, 2024

FYI, I think that the uprm-civic also has replaced mode (scootershare)

@jpfleischer
Copy link

jpfleischer commented Sep 5, 2024

Hi everyone!
You currently use overpass-api.de but @Abby-Wheelis you said you need routing.
Please find this transitland route in Denver CO https://www.transit.land/routes/r-9xj3-h
Is this what we need?

"public transport routing ... requires timetable data to work properly, and OSM doesn't have that."
https://www.reddit.com/r/openstreetmap/comments/v914h0/comment/ibttco1/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

EDIT: I found that OSM does this
for example

[out:json];
area[name="Gainesville"]->.searchArea;
(
  relation["route"="bus"](area.searchArea);
);
out body;
>;
out skel qt;

https://overpass-turbo.eu

Realtime GTFS

Some places provide realtime data such as boston MBTA https://www.mbta.com/developers/gtfs-realtime
Can we find a free opensource resource that gives all available GTFS Realtime data sources worldwide?

https://github.com/MobilityData/awesome-transit?tab=readme-ov-file
https://mobilitydatabase.org/feeds/mdb-1602
https://mobilitydata.github.io/mobility-feed-api/SwaggerUI/index.html
https://docs.opentripplanner.org/en/latest/

@shankari
Copy link
Contributor

shankari commented Sep 5, 2024

@jpfleischer I meant we need routing in the sense of:

I went from my house to the drugstore by car. I want to be able to run a query (ideally via API) that will give me the time and cost of the alternatives (e.g. the equivalent of this
Screenshot 2024-09-05 at 1 57 53 PM

but with cost included

OSM has transit data, and we use transit data from it using overpass for mode detection (look at `emission/net/ext_services ) but it doesn't do routing. OSM-based routing services such as OSRM or GraphHopper typically do not support transit. So we cannot use them to find transit alternatives.

Screenshot 2024-09-05 at 1 59 53 PM

There is an open-source routing engine that takes transit into account (Open Trip Planner)
https://opentransitsoftwarefoundation.org/

We are friends with the OTP folks and have tried using their software before. But for us to use this in a production system, somebody still needs to run the software, load the data, keep it updated, etc. Ideally, there would be an overpass-like system that we could use for routing and that we could pay for if needed. But I am not sure that google maps alternative exists.

Can we find a free opensource resource that gives all available GTFS Realtime data sources worldwide?

transit.land is intended to do that, at least for the US. But somebody needs to load that data

@shankari
Copy link
Contributor

shankari commented Sep 5, 2024

One final comment on this: wrt the framing of this problem, we have discussed how there are people's preferences (which are related to the person) and the alternatives (which are related to the environment)

So the same person may make different choices in a different environment (e.g. @jpfleischer taking transit in Boston but not in FL) even though their internalized preferences have not changed.

Just wanted to highlight the flip side of that, which is that different people can have different preferences. While @jpfleischer would not ever take the bus in FL, there are clearly people who do (otherwise, the bus system would have shut down).

For the replaced mode project, we want to understand individual or group preferences, specifically as a set of factors that influence their (assumed rational) choices. We can then apply those preferences to a different set of alternatives (new transit line, no e-bike available, parking restrictions...) and get a sense of how they will behave, and by extension, what the impact of the modification to the alternatives is.

@Abby-Wheelis
Copy link
Member Author

@jpfleischer Here is a PR related to the NTD data processing and integration for energy and emissions, maybe similar methods would allow us to extract transit cost? e-mission-common PR

I think the notebooks in metrics/footprint/.archive could be a good place to start

@jpfleischer
Copy link

@Abby-Wheelis
Average fare collected per passenger is a column here https://data.transportation.gov/Public-Transit/2022-NTD-Annual-Data-Metrics/ekg5-frzt/explore

@Abby-Wheelis
Copy link
Member Author

For frequency - NTD glossary defines "Headway" as "The time interval between vehicles moving in the same direction on a particular route. Can be found in: S-10" - now if I can just figure out where S-10 is...

@Abby-Wheelis
Copy link
Member Author

S-10 is a form that agencies fill out for reporting to NTD: the 2023 version here includes many of the fields that we saw in the data table with time periods and when they are active (AM peak, Sunday, etc) but I don't see "Headway" in the form or the data table, unfortunately

@Abby-Wheelis
Copy link
Member Author

I have not been able to find service frequency or headway, but I did find a paper (from 2011) referring to methods for evaluating performance using NTD data System for Transit Performance Analysis Using the National Transit Database, notably:

Average Headway (in minutes). This is an important measure of service frequency. It is computed by first dividing the total directional route mileage from Form S-10 by the system’s calculated average speed, as defined above, to obtain an estimate of the number of hours it takes to traverse the entire system’s total route miles. This time (in hours) is then divided by the system’s average weekday total vehicles from Form S-10 to determine the amount of time in hours it takes for a vehicle to complete its portion of the total route miles, one time. The resulting time is then multiplied by 60 for conversion from hours to minutes.

@jpfleischer
Copy link

The way to get headway

It is true that GTFS agencies publish their stop times a lot more frequently than they publish their fares. However, as @Abby-Wheelis has found, there is a documented way to discover headway within a paper, and it will be more straightforward to apply such logic (after verifying its accuracy and reasoning).

It would be quite complicated to get the stop times also because there is no NTD ID in the GTFS data, only the stop coordinates, so we would have to add logic to convert coordinates to UACE.

We may consider comparing both options if time allows, but for now, just do NTD headway calculation.

@Abby-Wheelis
Copy link
Member Author

A few new notes from our meeting today:

  • the fare information has been added to the existing JSON files!
  • the next step for fare information is to add functionality to pull the information for a given trip - it seems feasible to add this alongside the energy and emissions intensities as "cost intensity" - @jpfleischer feel free to chime in with more specifics about your plan
  • thinking ahead to headway -- the "formula" from the paper is quite complex and relies on a number of columns, including revenue miles - for which "train" and "passenger" have separate columns, but for some rail mode entries, there is a value for both "train" and "passenger" miles which don't match - which should take precedent?
  • we also need to consider data formatting for headway:
    • ideally, we would find the headway by mode and time of day - which is listed in the table
    • but this gets hard to aggregate - ie all 5 agencies that operate busses in Seattle define "AM peak" differently
    • we could just aggregate by named time of day and assume "AM Peak" means the same thing to everyone and would store something like fare: {"am_peak":15, "sunday":90 ...} for each mode/agency
    • Alternately we can just use the "total annual" numbers as a general average - this reduces our ability to catch things like "Abby would take the bus on an AM peak (20 mins) but not a Sunday (1 hr) in Denver" but would catch "Abby will take the bus in Denver because they generally come fairly often (avg 30 mins), but not in Danville KY because they only ever run every 60 minutes (also only run 1 day a week)"

@Abby-Wheelis
Copy link
Member Author

Given that the pseudoformula I found in the paper is fairly complicated I just wanted to think through it to sanity check to make sure we agree with it before trying to implement it:

Average Speed. This is the average speed of vehicles in revenue service operation (i.e., not including travel to and from the garage or any other deadhead) and is calculated by dividing the total actual vehicle (for non-rail modes) or train (for rail modes) revenue miles by the total actual vehicle/train revenue hours. Both of the variables come from Form S-10.
4
Average Headway (in minutes). This is an important measure of service frequency. It is computed by first dividing the total directional route mileage from Form S-10 by the system’s calculated average speed, as defined above, to obtain an estimate of the number of hours it takes to traverse the entire system’s total route miles. This time (in hours) is then divided by the system’s average weekday total vehicles from Form S-10 to determine the amount of time in hours it takes for a vehicle to complete its portion of the total route miles, one time. The resulting time is then multiplied by 60 for conversion from hours to minutes.

speed = revenue miles / revenue hours

headway = (directional mileage / speed) / num vehicles
= time taken to cover entire system / num vehicles
= how long it would take one vehicle to complete a lap? so how often it arrives?

I'm not sure I've gotten my head wrapped around this formula, if anyone sees it differently please feel free to let me know how we can interpret it

@jpfleischer
Copy link

Abby is right, we have now used the preexisting ntd script and leveraged its logic to add on fares, while fixing a bug to get it to work. We are considering, in regards to coordinate-to-fare return function-

  • either combining cost with the intensities because the cost is part of the intensities
    line 89 prg metrics/footprint/transit.py-
    proposal is to just say intensities['fare'] = my_fare_variable
    @JGreenlee does this sound reasonable? I will likely make the comment in that repository.
  • OR making a duplicate function (maybe it is repeating ourselves) specifically for fares

I think half of the preexisting function can be generalized, but for now, I will go with the first option.

We also anticipate that, since the fare information is only attached to UACE, that we will calculate a general average fare across the entire UACE, weighted by number of passenger trips, to return fare information for a particular coordinate.

@shankari
Copy link
Contributor

@Abby-Wheelis @jpfleischer have you started working with the OTP yet? We definitely need travel time as well. I wonder if the OTP API supports any generic queries related to GTFS and headways, similar to https://nycplanning.github.io/td-travelshed/mapbox/public/

@Abby-Wheelis
Copy link
Member Author

Short-term goals:

  • implement "Get fare by trip" functionality alongside the existing "get emissions/energy for trip" functionality
  • reasearch/test OTP and weigh options - OTP1 / OTP2 / r5 / r5 python wrapper? NYC planning methodology?
  • weigh difference between using OTP and using a previously explored option like OpenRoutePlanner (main con, did not have transit)

@jpfleischer
Copy link

jpfleischer commented Sep 27, 2024

We now have a mechanism to launch OpenTripPlanner within a docker container and to build an instance for Denver's RTD.

The shortcoming is that it is required to manually specify the gtfs source, but since I know how to pull the GTFS links from Mobility Database according to State, then we can combine the two projects to have an automated GTFS fetcher and an automated transit time calculator (fetched from the OTP API on our local docker instance).

JGreenlee/e-mission-common@e012774

Next to do is to make the OTP API logic to get transit times according to coordinates.

image

An issue is that currently, the transit times are not able to be calculated for trips from more than a several months prior. A potential solution is using Mobility Database to pull historical GTFS.

What are the key findings?

GTFS is bad for fare, great for stop times.
OTP uses GTFS feeds as input-- the developer has to specify these GTFS feeds and provide them.
OTP is great to serve as a local calculator and provider of transit times. No reliance on external API or website needed.
OTP even appears to use OneBusAway in its logic.

@jpfleischer
Copy link

jpfleischer commented Sep 28, 2024

Mobility Database only has records for RTD Denver dating back to Feb 2024: https://mobilitydatabase.org/feeds/mdb-178

However, I downloaded the gtfs using the Wayback Machine and then successfully calculated trips for 2022.
The URL was the same in 2022 as it was now, leading me to successfully add that to my OTP instance.

image

...however, it is much more straightforward to use the older version of Mobility Database, called OpenMobilityData,
https://transitfeeds.com/p/rtd-denver/188?p=26 to get the old GTFS files.

@jpfleischer
Copy link

With ~60 GTFS zip files, OTP takes many hours to fully start up.
The same is likely similar for r5, as its Python wrapper is shown here taking a while to initialize.

image

Possible workarounds are combining all the GTFS files into one using a merge tool such as gtfsmerge, or instead of using 1 gtfs file per month per year, use maybe one every three months.

@jpfleischer
Copy link

jpfleischer commented Oct 7, 2024

The most centralized status of this objective is located at the README.md at my e-mission-common fork:
https://github.com/jpfleischer/e-mission-common/blob/03b789d344ab9d55ec1d6b6bd668262e33003401/scripts/otp/README.md?plain=1#L1-L14

The most crucial aspect is using historical GTFS data. Right now that data lives on AWS servers belonging to OpenMobilityData. The website has a banner on its front page declaring that it is deprecated. I am hoping that this data does not disappear because it is quite crucial.

It would be ideal, upon returning to this objective, to save a json of all the agencies, as in:

rtd-denver: https://transitfeeds.com/p/rtd-denver/188
miami-dade-county-transit: https://transitfeeds.com/p/miami-dade-county-transit/48
# .... and so on .....

we scrape because OpenMobilityData no longer gives out API keys.

The URL value in the key-value pairing (taken from above) is needed, and the existing logic in https://github.com/jpfleischer/e-mission-common/blob/master/scripts/otp/scrape.ipynb takes care of the rest.

We could ideally get more agencies for Colorado but as of now, we only use RTD Denver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants