Imputing wealth/parent income and Removing parent savings from parent net worth in college enrollment count #11

storresrod · 2023-11-09T07:47:26Z

Addressing Issue #1 by imputing wealth/parent income in new script nlsy_impute.

Main question for team discussion in Issue #1: The current imputation uses a limited set of predictor variables, including student age, sex, and a combined race and ethnicity variable. I originally attempted including a larger set of predictor variables, but multiple imputation with the mice package is sensitive to predictor variables with a lot of missingness and to highly correlated predictors. At the same time, it would be good to review literature to see what types of predictors are traditionally included when imputing missing wealth and income, and to ensure we have sufficient predictor variables for imputation to be more accurate.

Addressing Issue #3 by removing parent savings from college enrollment count. Adapted the nlsy_lib script and the nlsy_get_col_stat_annual_df function. Utilized newly imputed net worth and subtracted net savings. If needed, this can be replicated in the function which only looks at fall semester enrollment.

Also, I created a new joint race and ethnicity variable in the nlsy_lib script.

…e enrollment count Addressing Issue #1 by imputing wealth/parent income in new script. Addressing Issue #3 by removing parent savings from college enrollment count.

dc0sic · 2023-11-09T14:33:09Z

Thank you, Sonia! I have a couple of thought regarding predictors used in imputation.

Predictors should be mostly parents' characteristics and not youth's, except maybe for race and ethnicity, because youth's race captures parents' race. Aside from race, parents' characteristics should include their age and education, and at the time of the interview (i.e., in 1997). NLSY has mother's age at youth's birth (although it's not in our current sample) but I couldn't find father's age. The survey also has parent's highest grade completed. Some other potential predictors are indicators for renting vs. owning a home, and having retirement savings.

We should also get a better understanding of what parent's income and wealth contain for youth's who don't live with both parents, and include variables that indicate such living arrangements.

To verify that predictors are significant and to check for collinearity, you could run a regression of the variable being predicted (e.g. wealth) on the set of predictors being considered, while restricting the sample to those with non-missing wealth observations.

We should also think of steps that would help us validate imputation results. For example, plotting the distributions of non-missing and imputed values would be helpful. Also, because we are imputing income and wealth jointly, a scatterplot of these two variables would be useful.

I hope this is helpful. Aside from these thoughts on imputation, I have some questions regarding changes in nlsy_lig.R. I will leave them as inline comments.

NLSY/nlsy_lib.R

dc0sic

I would like to discuss some of the changes in nlsy_lib.R before we merge this PR.

…earity, and validation plots

storresrod · 2023-11-15T19:20:21Z

Thanks for the comments and support Damir! I have reverted all suggestions in the lib script, and addressed the comments on the imputation script. Updates on this script include:

Added new parent indicators (such as mom age, homeownership, retirement savings)
Created new dummy variables for relevant categorical variables (like parent education)
Checked for significance on subset of data with non-missing income and wealth. Excluded father education and dummy for deceased parents.
Added new multi-collinearity (correlation plots, stepwise regression, and VIF). Did not identify major concerns, so proceeded with selected parent indicators.
Created new data frames which show original wealth and income, values for original indicators, and the 10 iterations of newly imputed income and wealth
Added plots comparing imputed and non-imputed plots (for both income and wealth)
Added scatter plot comparing income and wealth for non-imputed and each iteration of imputation
If this looks closer to what we would like, we can make a decision about how to call these imputed values in future modelling. Look forward to the team's thoughts!

NLSY/nlsy_impute.R

dc0sic

Thank you Sonia! This looks good, but if you are moving the imputation code to a Quarto document, maybe we should wait with the merge. Let me know if you have preferences.

I have substantial comments on two things: missing values in variables used as predictors, and encoding of categorical variables and adding their interactions to the predictor set. I am happy to talk if that would be helpful.

NLSY/nlsy_impute.R

…variables, adding interaction terms, and increasing checks for missingness Updates to imputation script, including new encoding for categorical variables, adding interaction terms, and increasing checks for missingness

dc0sic · 2023-12-12T13:14:53Z

NLSY/nlsy_imputation.qmd

+        race_eth = ifelse(hisp == "No" & race == "White", "White NonHispanic", race_eth),
+        race_eth = ifelse(hisp == "Yes" & race == "Black or African American", "Black Hispanic", race_eth),
+        race_eth = ifelse(hisp == "No" & race == "Black or African American", "Black NonHispanic", race_eth),
+        race_eth = ifelse(hisp == "Yes" & race == "American Indian, Eskimo, or Aleut", "NonBW Hispanic", race_eth),


In my previous comment about coding race and ethnicity, I suggested using all the race categories present in the sample because it would increase the predictive power of this variable. Knowing that a person is American Indian or Asian likely provides more information about the person's income and wealth than if know only that the person is NonBW.

Good flag!! I was still thinking about this from a missingness concern, which is not applicable here. Updated!

dc0sic · 2023-12-12T13:15:36Z

NLSY/nlsy_imputation.qmd

+    select(-id, -hispanic, -race, -mom_education, -dad_education, -savings, -home, -both_parents, -par1_deceased, -par2_deceased, - mom_age_birth, -pincome, -pnetworth, -par_dec, -mom_educ_hs,-dad_educ_hs, -hisp,-race_eth) %>%
+    drop_na()
+
+M <- cor(testing_cor)


Showing correlation table would be helpful.

Added the correlation table to be displayed right before the correlation plot

dc0sic · 2023-12-12T13:16:48Z

NLSY/nlsy_imputation.qmd

+```
+
+### Checking for multi-collinearity - VIF and Successive addition of regressors
+


Some notes here would be helpful, especially with respect to vif, which I am not familiar with.

The documentation has more technical information about VIF: https://rdrr.io/cran/car/man/vif.html.

The Wiki has the formula and steps for calculation: https://en.wikipedia.org/wiki/Variance_inflation_factor. The intuition is that you regress a given variable against all other regressors and calculate the coefficient of determination (1/(1-R^2)).

For practical purposes, a rule of thumb is that VIF values above 10 are considered indications of multicollinearity, and it looks like in this code chunk, none of the values come close to that, which is good.

Thanks Judah!

dc0sic · 2023-12-12T13:24:25Z

NLSY/nlsy_imputation.qmd

+```
+
+## Checking for missings in the predictor variables
+


Same here, some explanation would be helpful. What does this ration mr/(mr+mm) tell us and how do we use it to make a decision about the imputation process? For example, (race, pincome) is 0.6, and (pincome, race) is 0.987. Does this mean that when pincome is missing, race is also missing in 60% of cases, and when race is missing, pincome is also missing in 98.7% of cases?

The interpretation of mr/(mr+mm) is "percentage of usable cases to impute row variable from column variable." So in the example above, a ratio of 0.6 means that for 60% of observations with missing race, pincome is non-missing and can thus be used in imputation

Documentation here for reference: https://www.rdocumentation.org/packages/mice/versions/3.16.0/topics/md.pairs

It's a non-intuitive measure, but it's the one used in the documentation. As Judah shared, a lower value indicates higher missingness in the predictor for the outcome of interest.

dc0sic · 2023-12-12T13:29:22Z

NLSY/nlsy_imputation.qmd

+
+# Income and Wealth imputation
+set.seed(25)
+imputed = mice(impute, method=meth, predictorMatrix=predM, m=10)


I am not sure why we're doing 10 imputation sets; I don't think we need that many. Five or even just three should enough.

Fixed down to 5

dc0sic · 2023-12-12T13:39:32Z

NLSY/nlsy_imputation.qmd

+```
+
+## Scatter Plots for both parent income and wealth (non-imputed is zero)
+


The scatter plots reveal some concerning outliers in the imputed sets. The minimum net-worth in the original set is about -$100,000 and some imputed sets have values that approach -$1,000,000. I am not sure what the best way to deal with this is, but one thing we could do is limit min and max values to those from the original set.

It would also be useful to see whisker plots or summary tables with mean, median, min, max, and perhaps 25th and 75th percentile.

Added whisker plots! I think this is a great flag. Unfortunately, it does not seem the mice package has this as a built in functionality so we will have to get created.

I think it would be pretty simple to do this in post-processing, by taking all values that exceed the min and max and replacing them with the data set min and max. However, this doesn't seem to be what we would ideally want - to bound the min and max before the imputation so that the values are generated within those limits. I think we may need to think about what a custom function, with mice embedded into it.

Imputing wealth/parent income and Removing parent savings from colleg…

56a7fb8

…e enrollment count Addressing Issue #1 by imputing wealth/parent income in new script. Addressing Issue #3 by removing parent savings from college enrollment count.

storresrod requested a review from dc0sic November 9, 2023 07:47

storresrod self-assigned this Nov 9, 2023