Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing bulk IRS data #138

Open
hampelm opened this issue Jul 30, 2016 · 15 comments
Open

Importing bulk IRS data #138

hampelm opened this issue Jul 30, 2016 · 15 comments

Comments

@hampelm
Copy link
Member

hampelm commented Jul 30, 2016

This statement creates the unified 990s table:

create table form (
id bigserial primary key,
created_at timestamp default current_timestamp,
updated_at timestamp default current_timestamp,
source varchar(50),
org bigint,
ein varchar(50),
subseccd varchar(50),
pdf varchar(255),
filing_type varchar(50),
start_year int,
end_year int,
irs_year int,
filing_date varchar(25),
tax_period varchar(25),
contributions_and_grants bigint,
program_service_revenue bigint,
investment_income bigint,
other_revenue bigint,
total_revenue bigint,
grants_paid bigint,
benefits_paid bigint,
compensation bigint,
fundraising_fees bigint,
total_fundraising_expenses bigint,
other_expenses bigint,
total_expenses bigint,
revenue_less_expenses bigint,
total_assets bigint,
total_liabilities bigint,
net_assets bigint,
data jsonb
);

CREATE INDEX ein_idx ON combined (ein);
@hampelm hampelm changed the title Create statement for new IRS data table Importing bulk IRS data Jul 31, 2016
@hampelm
Copy link
Member Author

hampelm commented Jul 31, 2016

To import:

A. Data downloads are at https://www.irs.gov/uac/soi-tax-stats-annual-extract-of-tax-exempt-organization-financial-data

B. Get desired file:

wget https://www.irs.gov/pub/irs-soi/12eofinextract990.zip

C. Extract file

unzip 12eofinextract990.zip

D. Generate schema for file:

csvsql -i postgresql py12_990.dat -d ' ' > py12_990.sql

This may give a segmentation fault for large files. It's probably running out of memory; try it on your personal computer if this happens.

E. Create a table using that statement:

psql postgres < py12_990.sql

F. Import the data into postgres:

psql postgres
postgres=#  copy py12_990 from '/tmp/py12_990.dat' delimiter ' ' csv header;

@jessicamcinchak
Copy link
Member

Note that 2015 extracts are linked directly as .dat (not zipped). Adjust import by skipping step C.

Revise step D:
csvsql -i postgresql 15eofinextract990pf.dat -d ' ' > py15_990pf.sql

Once imported, rename 2015 990 tables for consistency:
postgres=# ALTER TABLE "15eofinextract990pf" RENAME TO "py15_990pf";

@hampelm
Copy link
Member Author

hampelm commented Aug 5, 2016

Here are the statements I've come up with to standardize the 2012 data. I think I was overzealous in the schema and we can remove a lot of fields, but I'll think about that later.

To test these, I created a test table (matt, you should create one for yourself using the schema above).

You'll notice that these seem pretty redundant. Why select from a select statement? That's a quirk of postgres/SQL: if instead we did the much cleaner

insert into matt 
(ein, subseccd)
select subseccd as subseccd, ein as ein 
from 2012_data

the insert wouldn't save subseccd to the subseccd column of matt. Instead, it'd insert the data in the order of the columns in matt, no matter the column names. That's a recipe for hard-to-find bugs. This level of indirection means we can write the fields in any order.

Here goes -- you can use these as templates for doing the other tables:

2012 990s

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2012 990' as source,
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py12_990) as data
from py12_990 ) as stuff
limit 10

PY12 990 EZ

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'PY12 990 EZ' as source,
subseccd,
ein,
'990ez' as filing_type,
tax_prd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json(py12_990ez) as data
from py12_990ez ) as stuff
limit 10

2012 PF

insert into matt
(
source,
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source,
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2012 990PF' as source,
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py12_990pf) as data
from py12_990pf) as stuff
limit 10

@bnchdrff
Copy link
Member

schemas for 2013, 2014, and 2015 990: https://gist.github.com/bnchdrff/14413f0bf8253729bd646373d91519a0

@hampelm
Copy link
Member Author

hampelm commented Aug 22, 2016

2013 990

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2013 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_pd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py13_990) as data
from py13_990 ) as stuff
limit 10

@hampelm
Copy link
Member Author

hampelm commented Aug 22, 2016

2013 EZ

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2013 EZ' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
tax_pd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py13_EZ") as data
from "py13_EZ" ) as stuff
limit 10

@hampelm
Copy link
Member Author

hampelm commented Aug 22, 2016

2013 PF

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2013 PF' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990ez' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py13_990pf) as data
from py13_990pf) as stuff
limit 10 

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2014 990

insert into matt
(
source, 
ein, 
subseccd, 
filing_type, 
tax_period,
contributions_and_grants, 
program_service_revenue,
investment_income, 
total_revenue, 
compensation,
fundraising_fees, 
total_expenses, 
total_assets, 
total_liabilities,
net_assets, 
data
)
select
source, 
ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2014 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py14_990) as data
from py14_990 ) as stuff
limit 10 

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2014 990ez

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2014 990ez' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
taxprd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py14_EZ") as data
from "py14_EZ" ) as stuff
limit 10

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2014 990pf

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2014 990PF' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py14_990pf) as data
from py14_990pf) as stuff
limit 10 

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2015 990

insert into matt
(
source, 
ein, 
subseccd, 
filing_type, 
tax_period,
contributions_and_grants, 
program_service_revenue,
investment_income, 
total_revenue, 
compensation,
fundraising_fees, 
total_expenses, 
total_assets, 
total_liabilities,
net_assets, 
data
)
select
source, 
ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2015 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py15_990) as data
from py15_990 ) as stuff
limit 10 

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2015 990ez

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2015 990 EZ' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
a_tax_prd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py15_990ez") as data
from "py15_990ez" ) as stuff
limit 10

@jessicamcinchak
Copy link
Member

jessicamcinchak commented Aug 26, 2016

2015 990pf

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2015 990 pf' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py15_990pf) as data
from py15_990pf) as stuff
limit 10

@hampelm
Copy link
Member Author

hampelm commented Oct 21, 2016

Note that the real inserts should go into the combined table or another official table. matt is just a scratchpad for my testing, so don't rely on it to have good data. Use the schema above to create your own table for testing imports.

@JerryPoore
Copy link

Hi Hampelm. I know nothing about how this works. I read about it in Friedman's book Thank you for Being Late. There is something here I know but how to mine it, I don't. I am an Enrolled Agent for the IRS and I help people resolve tax issues. I am looking for a data base that tells me who the IRS has sent Tax audits, tax notices, and tax liens and etc. The faster I know this the faster I can help people avoid overspending on high priced professionals. It's digitized somewhere, I need an expert to dig it out on a program like GitHub. Any comments? Thank you Jerry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants