Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link directly to 990 PDFs #160

Open
4 tasks
hampelm opened this issue May 24, 2017 · 4 comments
Open
4 tasks

Link directly to 990 PDFs #160

hampelm opened this issue May 24, 2017 · 4 comments

Comments

@hampelm
Copy link
Member

hampelm commented May 24, 2017

As much as we dislike 'em, the 990 PDFs aren't going away. Since they provide so much info and are behind a login wall in many places, it'd be helpful to link directly to them on organization pages.

To do:

  • Find a stable source of them: either clone the S3 bucket or find someone we trust who does (the ProPublica terms might not work for us, but I can reach out them: https://projects.propublica.org/nonprofits/)
  • Identify a mapping of EIN => 990 by year
  • Import that into our database
  • Include an array of 990 links sorted by year in our org response. I envisions something like:
{
  ...org details...
  990s: [{ year: 2015, url: 'https://s3...'}, ...]
}
@hampelm
Copy link
Member Author

hampelm commented May 27, 2017

The Internet Archive has 900+ ISOs of filings from them IRS organized by date and type:
https://archive.org/details/IRS990?sort=-publicdate

I opened up a sample. Each contains the PDFs plus a manifest that has the file path, EIN, org name, filing type, date, and other metadata in a tab-delimited manifest file.

@hampelm
Copy link
Member Author

hampelm commented May 27, 2017

I'm hopeful that someone already has these on S3. Otherwise, the process of scripting this won't be that hard and it'll cost us about $1/month to host. We'll just have to find a good way to index them; seems like preserving the existing structure is the most straightforward (year+type/ein+year+type.pdf), and store the lookup in a single flat table

@hampelm
Copy link
Member Author

hampelm commented May 29, 2017

Following the instructions here:
https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

Do an advanced search and ask for a CSV: https://archive.org/advancedsearch.php?q=collection:IRS990

AWS machine for processing

ssh [email protected] -i ~/.ssh/matth.pem

ebs mounted at /data

@hampelm
Copy link
Member Author

hampelm commented May 29, 2017

Here's the list of 990 uploads on the archive: https://gist.github.com/hampelm/c5e22d1ac19bea8fd57b44aee4f09962

Work-in-progress wget command to capture a single one:

wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -A "*.iso" https://archive.org/download/IRS990-2010-09

Probably want to add a column called "s3path" to the file to define where each one will be uploaded on s3, since the directory paths vary

Downloads are running pretty slow (2-3MB/s on EC2) so this first part will take a while; next step will be to mount the ISOs with something lke

sudo mount -o loop whatever.iso /mnt/iso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant