Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper Reproducibility Changes #102

Open
wants to merge 62 commits into
base: main
Choose a base branch
from

Conversation

Abby-Wheelis
Copy link
Member

As I am going through the charts in the paper to polish them up, I am also taking the time to organize, document, and check in the code used to produce those results. This maintains transparency for future researchers who might want to reproduce our results.

Added a DataFiltering Notebook:

  • This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper.
  • The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in CSV, and applying all data treatments, then saving the results to be loaded into the analysis notebooks
  • Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me

Analysis Notebooks: planning for 1 with non-spatial data and 1 to work with spatial data, will update as I make these changes, as the plan may change depending on the data formats.

Another note: much of this code is coming from a previous researcher who worked on the paper, Cemal Akcicek. My work is focused on organizing and polishing.

This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper.

The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in csv, and applying all data treatments, then saving the results to be loaded into the analysis notebooks

Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me
@Abby-Wheelis
Copy link
Member Author

This is ending up being VERY tricky and confusing. The goal is to have the results and charts we show in the paper be 100% reproducible from the TSDC data -- open source data and script to allow for full transparency and reproducibility. This will not only benefit the credibility of this paper, but will hopefully lay the groundwork to make analysis of other, future, OpenPATH programs archived in the TSDC easy and accessible.

However, the data originally used to generate the paper is not the same as what TSDC will be providing. The column names in the csvs are almost all different, and the TSDC is redacting a fair bit of information. The problematic columns being redacted (so far) include Age (used for some analysis of the affect of age on e-bike usage) and trip timestamps (a key datapoint in calculating when someone's first e-bike trip was, and then cleaning out all data before that point).

I've been spinning my wheels for a little more than a week now, trying to reverse engineer a way to get the data from the TSDC cleaned and filtered in such a way that it matches the process we outline in the paper and the numbers that yielded. I'm hoping it will help to write out my thought processes a little bit more, which I've been doing in a bit more scattered manner but should really include here.

If I'm able to get the data to "match", the next hurdle is that a fair number of the charts rely on the data being loaded into the database, which I personally can't do because my computer can't process that much data at once without completely wigging out. But even if I could (there are some parts of the dataset that I won't need for this highish-level analysis, so I could clean them out and try loading a smaller subset of data) -- that's not the format that TSDC is providing, so then I'll need to A) reconcile the chart generating tactics with different data sourcing or B) wrestle the "matched" data into a zipped file compatible with the database loading

I'll keep updating here with my thoughts and progress as I have / continue to try different things

@Abby-Wheelis
Copy link
Member Author

One issue I've encountered is the lack of Mode_confirm, etc columns in the TSDC data. @shankari pointed out that these columns come from the dictionaries in viz_scripts/auxillary_files and are leveraged by viz_scripts/scaffolding.py. I examined the code in scaffolding.py and ended up scraping the following lines from that file in order to use them in the data cleaning process I'm trying to develop:

#first, add the cleaned mode
data['Mode_confirm']= data['data_user_input_mode_confirm'].map(dic_re)

#second, add the cleaned replaced mode ASSUMES PROGRAM
data['Replaced_mode']= data['data_user_input_replaced_mode'].map(dic_re)

#third, add the cleaned purpose
data['Trip_purpose']= data['data_user_input_purpose_confirm'].map(dic_pur)

I did not just use the functions in scaffolding.py because they assume the database paradigm, and I am working from csvs

@shankari
Copy link
Contributor

shankari commented Dec 6, 2023

I am fine with this for now, but we should revisit this whole mapping when we re-do the energy/emissions work.

In general, we should implement scaffolding as an abstract class with two concrete subclasses. We already use something similar for the abstract timeseries, and since we use the base image anyway, we can also switch to using that standard approach and adding a new implementation instead of having 5 different implementations for data access.

Abstractions FTW!

@Abby-Wheelis
Copy link
Member Author

Another issue, is the numbers not matching up. When I read in the TSDC data, I read it in program by program, matching the confirmed trips with the sociodemographic data then concatenating the merged data into the dataframe so that I end up with all of the merged data together. But the number of users in this dataset is less than I would expect.

As I'm accumulating the data, the programs have 13, 47, 29, 14, 14, and 9 users, for a total of 126 users, seems a little low for only having merged sociodemographic data and done no other cleaning, but is still > 122 so OK. But after all the data is put together there are only 112 unique IDs. This is a problem.

I'm worried that some of the users in different programs ended up with the same random ID as a result of the data cleaning process. I'm going to work on a way to prevent this - maybe appending the program name to the beginning of the id before I add that program to all the other programs.

@Abby-Wheelis
Copy link
Member Author

Abby-Wheelis commented Dec 6, 2023

I'm worried that some of the users in different programs ended up with the same random ID

Yep, I think this is exactly what happened, after adding the program to the end of id, the number after compiling the program when from 112 to the accurate 126. One step closer to finding the right data cleaning process!

BUT after just part of the filtering we're down to 118 users ... maybe the socio merging is off just a little somehow?

@Abby-Wheelis
Copy link
Member Author

Screenshot 2023-12-06 at 12 21 13 PM
Adding more print statements revealed that there were 15 unique ids in the surveys, 14 unique ids in the trips, but then only 13 once they were merged. That continues for the other programs, dropping just 1-2 users per program. I wonder how that could be happening?

  1. ghost users - people that logged in and filled a survey but never took any trips
  2. errors in the merging code ... I feel like if there were id-matching issues the gap would be more than the 5ish total users

Best guess would be ghost users, but keeping this in mind.

@Abby-Wheelis
Copy link
Member Author

Wait, ok, so the bigger concern is the dip between the number of users that have trips and the number of users that have trips and a survey -- you can't enter the app without filling out a survey, even if you say "wish not to say" for all of the responses, we still have a survey record for you.

@shankari is this correct? If so, then we might have a bigger problem because there are some places where, just reading in the csvs, there are less entries in the surveys csv than there are unique users in the trip dataset. I don't feel like that should be possible, should it?
Screenshot 2023-12-06 at 12 53 44 PM

This is saying there were 12 unique users in vail, 11 entries in the survey list, down to 9 after deduplication, so the number of vail users got dropped from 12 to 9 because 3 did not have a survey entry.

Abby Wheelis added 10 commits December 7, 2023 13:11
keeping git up to date as I update the paper, changes are messy but need to be kept
loading from the database / showing participation rates

working on centralizing to as few scripts as possible
@Abby-Wheelis
Copy link
Member Author

Yesterday I discovered that (at least part) of the problem was leading/trailing whitespaces on some of the userids, which got rid of the problem where I was randomly dropping users in the trip-survey merging process. I'm still working with the TSDC and my data cleaning scripts to verify an equivalent process of preparing the data to that which we used in the paper, when starting with TSDC data

Abby Wheelis added 3 commits December 19, 2023 16:27
mine is not currently working for some reason, erroring out over "within" in one part of the code but not others
very close to the TSDC numbers

all but 8 of the charts are now in this Analysis file
@Abby-Wheelis
Copy link
Member Author

Abby-Wheelis commented Dec 23, 2023

New status update heading into the holidays:

My TSDC data work file intakes the files that the TSDC will have, and output similar numbers (same number of users, off by around 1,000 trips). TSDC data does still have some issues that I can ee:

  • pueblo county seems to be missing around 300 e-bike trips
  • community cycles has some messy data - off by one column messing up 8-10 trips

My Analysis script has almost all of the charts generated now:

  • added the labeling rate charts - reading from the TSDC now!
  • added more charts from Cemal's notebooks to centralize what's in the paper
  • spatial charts in their own notebook - not running
  • trip mode splits over distance not working great
  • timelines also somewhat broken - could be the data mistakes in CC - affecting timestamps
  • emissions charts I have not yet copied over

Left to address:

  • lingering TSDC issues [should have fixed data in the first few days of 2024]
  • spatial charts (2) [Denver's is a little off still]
  • mode charts (2)
  • time charts (2) [too many axis labels, and does not match the paper plot (could be the data)]
  • emissions charts (2) [waiting to see if data matches]

Abby Wheelis and others added 7 commits December 28, 2023 12:56
working through the energy calculations, might still need some updates to the data used as input until the paper is matched exactly
figured out what the underlying function could have been - was able to use the seaborn library
Abby Wheelis and others added 24 commits March 16, 2024 18:43
discovered some of the header code was no longer needed since we are working with csvs and not mongodump

factored repeated code into a function
these have been replaced by the notebooks in the Abby folder, and are no longer needed as the newer notebooks are compatible with the TSDC data
implemented function, removed unnecessary code, centralized import statements
needed for spatial analysis notebook functions
after refactoring, the outputs now reflect the refined notebooks
This reverts commit 05398e2.
@Abby-Wheelis
Copy link
Member Author

Large chunk of refactoring now done on this branch, based on commentary from @iantei on #118, this PR is now much smaller, with the elimination of older code and significant reduction in duplicate code. Changes fall in three folders:

  • muni_boundaries - the shapefiles used for spatial analysis
  • Abby - notebooks with code for processing TSDC data and generating each of the charts in the paper
  • viz_outputs - the data processing, spatial, and analysis charting notebooks

Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024
replaced by work in e-mission#102
Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024
Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this pull request Apr 1, 2024
these are the scripts that I got from Cemal, and what later work on the paper visualizations in e-mission#102
are based on
this way expected outputs can be viewed, separate from the code itself
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PRs for review by peers
Development

Successfully merging this pull request may close these issues.

2 participants