-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Congressional Hearing Parser #290
base: main
Are you sure you want to change the base?
Conversation
@connorjoleary thank you for this - it's a really useful area for us to go in. I'd like to hear from @JoshData about it, and in particular having a dependency on ProPublica's API (full disclosure, I currently run that API, but I'm not full-time at ProPublica and I can't guarantee that I'd be able to immediately address errors or downtime in every case). It appears that this PR uses the API to get current members of the House and Senate; I suspect there might be other ways to do that (using the congress-legislators repository, for example). |
Oh, a very good point. I'd be happy to switch out using propublicas API for that. Should be a fairly straightforward change assuming they both use the same ids. |
There's also the new official congress.gov API, which also should have a
list of current members of the House and Senate.
https://www.congress.gov/help/using-data-offsite
Message ID: ***@***.***
|
Thanks for sharing this, @connorjoleary. Yeah it would be nice if all of the data is fetched in a consistent way throughout this repository: legislators from congress-legislators, GPO documents from the fdsys scraper. But I won't block it based on that. I'd like that incoming code remain maintained by its maintainer for some reasonable period of time and be documented in a similar way to other tools in this repo (in the main README and the github wiki section), And if that's the case, there's no need to put things inside a |
Thank you all very much for the comments. I'm happy to maintain this code for a while after it is in place, but I do worry that the quality of this code is not up to snuff. The transcripts do not always follow consistent formatting, sometimes names are misspelled, and attributing who is speaking can be difficult (for example one hearing had two people with the same last names, but distinguished them by gender). This means that the output of this hearing parser is not always accurate. With that being said do you all still feel like this code would fit in alongside everything else? |
Hey all, update on this project. I created a website to easily search and visualize this data. I'll likely continue to make updates to my fork of this repo, as well as use it to collect more text data. Please let me know if you would like this data to be available from this project by approving this PR. Link to website: congresstext.com |
Depending on how far back you want to go, you may be interested in #236. |
Overview
This project gathers transcripts made available by
the US Government Publishing Office and uses this
information to assign who said what during federal
congressional meetings. The data can then be used
to gather insights on the speaking patterns of each
representative.
How to use
Follow installation instructions in the main README
to install the correct python libraries
Go to this website and create an api key
https://api.govinfo.gov/docs/
Create .env file in this folder with the key
python congress/contrib/congressional_hearing_info/grab_congressional_hearings.py --num 10