Link directly to 990 PDFs #160

hampelm · 2017-05-24T15:33:33Z

As much as we dislike 'em, the 990 PDFs aren't going away. Since they provide so much info and are behind a login wall in many places, it'd be helpful to link directly to them on organization pages.

To do:

Find a stable source of them: either clone the S3 bucket or find someone we trust who does (the ProPublica terms might not work for us, but I can reach out them: https://projects.propublica.org/nonprofits/)
Identify a mapping of EIN => 990 by year
Import that into our database
Include an array of 990 links sorted by year in our org response. I envisions something like:

{
  ...org details...
  990s: [{ year: 2015, url: 'https://s3...'}, ...]
}

The text was updated successfully, but these errors were encountered:

hampelm · 2017-05-27T14:04:08Z

The Internet Archive has 900+ ISOs of filings from them IRS organized by date and type:
https://archive.org/details/IRS990?sort=-publicdate

I opened up a sample. Each contains the PDFs plus a manifest that has the file path, EIN, org name, filing type, date, and other metadata in a tab-delimited manifest file.

hampelm · 2017-05-27T14:08:27Z

I'm hopeful that someone already has these on S3. Otherwise, the process of scripting this won't be that hard and it'll cost us about $1/month to host. We'll just have to find a good way to index them; seems like preserving the existing structure is the most straightforward (year+type/ein+year+type.pdf), and store the lookup in a single flat table

hampelm · 2017-05-29T14:16:01Z

Following the instructions here:
https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

Do an advanced search and ask for a CSV: https://archive.org/advancedsearch.php?q=collection:IRS990

AWS machine for processing

ssh [email protected] -i ~/.ssh/matth.pem

ebs mounted at /data

hampelm · 2017-05-29T15:04:15Z

Here's the list of 990 uploads on the archive: https://gist.github.com/hampelm/c5e22d1ac19bea8fd57b44aee4f09962

Work-in-progress wget command to capture a single one:

wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -A "*.iso" https://archive.org/download/IRS990-2010-09

Probably want to add a column called "s3path" to the file to define where each one will be uploaded on s3, since the directory paths vary

Downloads are running pretty slow (2-3MB/s on EC2) so this first part will take a while; next step will be to mount the ISOs with something lke

sudo mount -o loop whatever.iso /mnt/iso

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link directly to 990 PDFs #160

Link directly to 990 PDFs #160

hampelm commented May 24, 2017

hampelm commented May 27, 2017

hampelm commented May 27, 2017

hampelm commented May 29, 2017

hampelm commented May 29, 2017

Link directly to 990 PDFs #160

Link directly to 990 PDFs #160

Comments

hampelm commented May 24, 2017

hampelm commented May 27, 2017

hampelm commented May 27, 2017

hampelm commented May 29, 2017

hampelm commented May 29, 2017