Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputing wealth/parent income and Removing parent savings from parent net worth in college enrollment count #11

Merged
merged 7 commits into from
Dec 12, 2023

Conversation

storresrod
Copy link
Collaborator

Addressing Issue #1 by imputing wealth/parent income in new script nlsy_impute.

Main question for team discussion in Issue #1: The current imputation uses a limited set of predictor variables, including student age, sex, and a combined race and ethnicity variable. I originally attempted including a larger set of predictor variables, but multiple imputation with the mice package is sensitive to predictor variables with a lot of missingness and to highly correlated predictors. At the same time, it would be good to review literature to see what types of predictors are traditionally included when imputing missing wealth and income, and to ensure we have sufficient predictor variables for imputation to be more accurate.

Addressing Issue #3 by removing parent savings from college enrollment count. Adapted the nlsy_lib script and the nlsy_get_col_stat_annual_df function. Utilized newly imputed net worth and subtracted net savings. If needed, this can be replicated in the function which only looks at fall semester enrollment.

Also, I created a new joint race and ethnicity variable in the nlsy_lib script.

…e enrollment count

Addressing Issue #1 by imputing wealth/parent income in new script. Addressing Issue #3 by removing parent savings from college enrollment count.
@storresrod storresrod requested a review from dc0sic November 9, 2023 07:47
@storresrod storresrod self-assigned this Nov 9, 2023
@dc0sic
Copy link
Collaborator

dc0sic commented Nov 9, 2023

Thank you, Sonia! I have a couple of thought regarding predictors used in imputation.

Predictors should be mostly parents' characteristics and not youth's, except maybe for race and ethnicity, because youth's race captures parents' race. Aside from race, parents' characteristics should include their age and education, and at the time of the interview (i.e., in 1997). NLSY has mother's age at youth's birth (although it's not in our current sample) but I couldn't find father's age. The survey also has parent's highest grade completed. Some other potential predictors are indicators for renting vs. owning a home, and having retirement savings.

We should also get a better understanding of what parent's income and wealth contain for youth's who don't live with both parents, and include variables that indicate such living arrangements.

To verify that predictors are significant and to check for collinearity, you could run a regression of the variable being predicted (e.g. wealth) on the set of predictors being considered, while restricting the sample to those with non-missing wealth observations.

We should also think of steps that would help us validate imputation results. For example, plotting the distributions of non-missing and imputed values would be helpful. Also, because we are imputing income and wealth jointly, a scatterplot of these two variables would be useful.

I hope this is helpful. Aside from these thoughts on imputation, I have some questions regarding changes in nlsy_lig.R. I will leave them as inline comments.

NLSY/nlsy_lib.R Outdated Show resolved Hide resolved
NLSY/nlsy_lib.R Outdated Show resolved Hide resolved
NLSY/nlsy_lib.R Outdated Show resolved Hide resolved
NLSY/nlsy_lib.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@dc0sic dc0sic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to discuss some of the changes in nlsy_lib.R before we merge this PR.

@storresrod
Copy link
Collaborator Author

storresrod commented Nov 15, 2023

Thanks for the comments and support Damir! I have reverted all suggestions in the lib script, and addressed the comments on the imputation script. Updates on this script include:

  • Added new parent indicators (such as mom age, homeownership, retirement savings)
  • Created new dummy variables for relevant categorical variables (like parent education)
  • Checked for significance on subset of data with non-missing income and wealth. Excluded father education and dummy for deceased parents.
  • Added new multi-collinearity (correlation plots, stepwise regression, and VIF). Did not identify major concerns, so proceeded with selected parent indicators.
  • Created new data frames which show original wealth and income, values for original indicators, and the 10 iterations of newly imputed income and wealth
  • Added plots comparing imputed and non-imputed plots (for both income and wealth)
  • Added scatter plot comparing income and wealth for non-imputed and each iteration of imputation
    If this looks closer to what we would like, we can make a decision about how to call these imputed values in future modelling. Look forward to the team's thoughts!

NLSY/nlsy_impute.R Outdated Show resolved Hide resolved
NLSY/nlsy_impute.R Outdated Show resolved Hide resolved
NLSY/nlsy_impute.R Outdated Show resolved Hide resolved
NLSY/nlsy_impute.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@dc0sic dc0sic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Sonia! This looks good, but if you are moving the imputation code to a Quarto document, maybe we should wait with the merge. Let me know if you have preferences.

I have substantial comments on two things: missing values in variables used as predictors, and encoding of categorical variables and adding their interactions to the predictor set. I am happy to talk if that would be helpful.

NLSY/nlsy_impute.R Outdated Show resolved Hide resolved
…variables, adding interaction terms, and increasing checks for missingness

Updates to imputation script, including new encoding for categorical variables, adding interaction terms, and increasing checks for missingness
@dc0sic dc0sic merged commit b6173ef into main Dec 12, 2023
race_eth = ifelse(hisp == "No" & race == "White", "White NonHispanic", race_eth),
race_eth = ifelse(hisp == "Yes" & race == "Black or African American", "Black Hispanic", race_eth),
race_eth = ifelse(hisp == "No" & race == "Black or African American", "Black NonHispanic", race_eth),
race_eth = ifelse(hisp == "Yes" & race == "American Indian, Eskimo, or Aleut", "NonBW Hispanic", race_eth),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous comment about coding race and ethnicity, I suggested using all the race categories present in the sample because it would increase the predictive power of this variable. Knowing that a person is American Indian or Asian likely provides more information about the person's income and wealth than if know only that the person is NonBW.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good flag!! I was still thinking about this from a missingness concern, which is not applicable here. Updated!

select(-id, -hispanic, -race, -mom_education, -dad_education, -savings, -home, -both_parents, -par1_deceased, -par2_deceased, - mom_age_birth, -pincome, -pnetworth, -par_dec, -mom_educ_hs,-dad_educ_hs, -hisp,-race_eth) %>%
drop_na()

M <- cor(testing_cor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing correlation table would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the correlation table to be displayed right before the correlation plot

```

### Checking for multi-collinearity - VIF and Successive addition of regressors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes here would be helpful, especially with respect to vif, which I am not familiar with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The documentation has more technical information about VIF: https://rdrr.io/cran/car/man/vif.html.
  • The Wiki has the formula and steps for calculation: https://en.wikipedia.org/wiki/Variance_inflation_factor. The intuition is that you regress a given variable against all other regressors and calculate the coefficient of determination (1/(1-R^2)).
  • For practical purposes, a rule of thumb is that VIF values above 10 are considered indications of multicollinearity, and it looks like in this code chunk, none of the values come close to that, which is good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Judah!

```

## Checking for missings in the predictor variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, some explanation would be helpful. What does this ration mr/(mr+mm) tell us and how do we use it to make a decision about the imputation process? For example, (race, pincome) is 0.6, and (pincome, race) is 0.987. Does this mean that when pincome is missing, race is also missing in 60% of cases, and when race is missing, pincome is also missing in 98.7% of cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpretation of mr/(mr+mm) is "percentage of usable cases to impute row variable from column variable." So in the example above, a ratio of 0.6 means that for 60% of observations with missing race, pincome is non-missing and can thus be used in imputation

Documentation here for reference: https://www.rdocumentation.org/packages/mice/versions/3.16.0/topics/md.pairs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a non-intuitive measure, but it's the one used in the documentation. As Judah shared, a lower value indicates higher missingness in the predictor for the outcome of interest.


# Income and Wealth imputation
set.seed(25)
imputed = mice(impute, method=meth, predictorMatrix=predM, m=10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we're doing 10 imputation sets; I don't think we need that many. Five or even just three should enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed down to 5

```

## Scatter Plots for both parent income and wealth (non-imputed is zero)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scatter plots reveal some concerning outliers in the imputed sets. The minimum net-worth in the original set is about -$100,000 and some imputed sets have values that approach -$1,000,000. I am not sure what the best way to deal with this is, but one thing we could do is limit min and max values to those from the original set.

It would also be useful to see whisker plots or summary tables with mean, median, min, max, and perhaps 25th and 75th percentile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added whisker plots! I think this is a great flag. Unfortunately, it does not seem the mice package has this as a built in functionality so we will have to get created.

I think it would be pretty simple to do this in post-processing, by taking all values that exceed the min and max and replacing them with the data set min and max. However, this doesn't seem to be what we would ideally want - to bound the min and max before the imputation so that the values are generated within those limits. I think we may need to think about what a custom function, with mice embedded into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants