Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibrate number of people age {0-17, 18-64, 65+} per tax unit by s,j #9

Open
MaxGhenis opened this issue Nov 10, 2020 · 6 comments
Open

Comments

@MaxGhenis
Copy link
Contributor

Implementing UBI directly in OG-USA (https://github.com/PSLmodels/OG-USA/issues/626) requires calibrating the number of people per tax unit by s,j, split for each of the age groups that could have different UBI amounts, currently 0-17, 18-64, and 65+. We'll want to calculate the value per s,j and then apply kernel density smoothing.

@prrathi and I calculated unsmoothed values using CPS tax units in this notebook. Next step is to do it with PSID instead.

Seems like we can use psid_data_setup.py for this. Our first try crashed Colab but @prrathi will try it again.

@jdebacker, is psid_lifetime_income.pkl, produced in that script, too big for GitHub?

Or will we have to hold onto the columns listed in #6 and aggregate them along the way anyway, requiring modification to psid_data_setup.py?

@jdebacker
Copy link
Member

@MaxGhenis Yes, psid_lifetime_income.pkl is too big for GH (~124 MB).

I haven't run that script on Colab, but runs locally fine (assuming you have all dependencies installed).

All columns in Issue #6 are already included in the PSID data saved to the repo.

@MaxGhenis
Copy link
Contributor Author

Recapping next steps from a meeting with @prrathi:

  1. Verify that head_age, spouse_age and num_children_under18 are exported from the psid_download.R (see comments in Household structure variables from the PSID #6 on why these are the fields needed)
  2. Verify that psid_lifetime_income.pkl also preserves these variables; if not, may need to add to constant_vars
  3. Create a new file, e.g. household_structure.py, which (a) calculates nu18, n1864, and n65 from these variables for each record in psid_lifetime_income (per Household structure variables from the PSID #6), (b) calculates the average of each of these by s,j, and (c) applies the MVKDE function to smooth these cells (see Create smoothing file to centralize MVKDE function #25).

@MaxGhenis
Copy link
Contributor Author

MaxGhenis commented Mar 1, 2021

The KDE functions and the dependent scipy.stats.gaussian_kde require probability data. I think we have two options:

  1. Smooth with something like LOESS, though I couldn't find a multivariate LOESS smoother in Python
  2. Apply KDE using an extra dimension of the number of people, e.g. determining cells in s by j by nu18 (or n1864 or n65, separately). scipy.stats.gaussian_kde accepts multivariate (not just bivariate) data, so this should work, and then we can compute the average in each s x j cell using the density estimates.

@jdebacker what would you suggest?

@MaxGhenis
Copy link
Contributor Author

Actually @prrathi and I realized that we could use the existing KDE function where we model each sxj's share of total children/adults/seniors in the same way that e.g. the share of total transfers by sxj is modeled. Then we can multiply that by the current number of children/adults/seniors to get the average by sxj.

@jdebacker
Copy link
Member

Yes - that is a good solution!

@MaxGhenis
Copy link
Contributor Author

Some updates:

@prrathi tried the KDE with some PSID data, but it was still noisy because it's the quotient of a smoothed numerator (# kids in bin) and unsmoothed denominator (# families in bin). He's going to try smoothing the denominator too.

Given the PSID data issues described in #28, we tried returning to the taxdata CPS file in this notebook, and using stratified LOESS. Here's the raw data for 18-64:
image
To avoid the jumps, @prrathi is going to start with the counts excluding the household head, then add the household head to the appropriate count based on their age post hoc.

Here's the LOESS smoother with the 18-64 bin, just for household head ages 18-64 to avoid smoothing that spike:
image

And the residuals:
image

We tried some different values of frac (essentially bandwidth, defaults to 0.67), and found that 0.4 avoided large sustained residuals while also avoiding too many inflection points which seem implausible.

If the KDE smoothing for the numerator and denominator doesn't work as well, this stratified LOESS seems pretty good (though a multivariate LOESS would be better). @rickecon fyi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants