-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imputing wealth/parent income and Removing parent savings from parent net worth in college enrollment count #11
Conversation
Thank you, Sonia! I have a couple of thought regarding predictors used in imputation. Predictors should be mostly parents' characteristics and not youth's, except maybe for race and ethnicity, because youth's race captures parents' race. Aside from race, parents' characteristics should include their age and education, and at the time of the interview (i.e., in 1997). NLSY has mother's age at youth's birth (although it's not in our current sample) but I couldn't find father's age. The survey also has parent's highest grade completed. Some other potential predictors are indicators for renting vs. owning a home, and having retirement savings. We should also get a better understanding of what parent's income and wealth contain for youth's who don't live with both parents, and include variables that indicate such living arrangements. To verify that predictors are significant and to check for collinearity, you could run a regression of the variable being predicted (e.g. wealth) on the set of predictors being considered, while restricting the sample to those with non-missing wealth observations. We should also think of steps that would help us validate imputation results. For example, plotting the distributions of non-missing and imputed values would be helpful. Also, because we are imputing income and wealth jointly, a scatterplot of these two variables would be useful. I hope this is helpful. Aside from these thoughts on imputation, I have some questions regarding changes in nlsy_lig.R. I will leave them as inline comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to discuss some of the changes in nlsy_lib.R before we merge this PR.
…earity, and validation plots
Thanks for the comments and support Damir! I have reverted all suggestions in the lib script, and addressed the comments on the imputation script. Updates on this script include:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Sonia! This looks good, but if you are moving the imputation code to a Quarto document, maybe we should wait with the merge. Let me know if you have preferences.
I have substantial comments on two things: missing values in variables used as predictors, and encoding of categorical variables and adding their interactions to the predictor set. I am happy to talk if that would be helpful.
…variables, adding interaction terms, and increasing checks for missingness Updates to imputation script, including new encoding for categorical variables, adding interaction terms, and increasing checks for missingness
race_eth = ifelse(hisp == "No" & race == "White", "White NonHispanic", race_eth), | ||
race_eth = ifelse(hisp == "Yes" & race == "Black or African American", "Black Hispanic", race_eth), | ||
race_eth = ifelse(hisp == "No" & race == "Black or African American", "Black NonHispanic", race_eth), | ||
race_eth = ifelse(hisp == "Yes" & race == "American Indian, Eskimo, or Aleut", "NonBW Hispanic", race_eth), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous comment about coding race and ethnicity, I suggested using all the race categories present in the sample because it would increase the predictive power of this variable. Knowing that a person is American Indian or Asian likely provides more information about the person's income and wealth than if know only that the person is NonBW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good flag!! I was still thinking about this from a missingness concern, which is not applicable here. Updated!
select(-id, -hispanic, -race, -mom_education, -dad_education, -savings, -home, -both_parents, -par1_deceased, -par2_deceased, - mom_age_birth, -pincome, -pnetworth, -par_dec, -mom_educ_hs,-dad_educ_hs, -hisp,-race_eth) %>% | ||
drop_na() | ||
|
||
M <- cor(testing_cor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Showing correlation table would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the correlation table to be displayed right before the correlation plot
``` | ||
|
||
### Checking for multi-collinearity - VIF and Successive addition of regressors | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes here would be helpful, especially with respect to vif, which I am not familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The documentation has more technical information about VIF: https://rdrr.io/cran/car/man/vif.html.
- The Wiki has the formula and steps for calculation: https://en.wikipedia.org/wiki/Variance_inflation_factor. The intuition is that you regress a given variable against all other regressors and calculate the coefficient of determination (1/(1-R^2)).
- For practical purposes, a rule of thumb is that VIF values above 10 are considered indications of multicollinearity, and it looks like in this code chunk, none of the values come close to that, which is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Judah!
``` | ||
|
||
## Checking for missings in the predictor variables | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, some explanation would be helpful. What does this ration mr/(mr+mm) tell us and how do we use it to make a decision about the imputation process? For example, (race, pincome) is 0.6, and (pincome, race) is 0.987. Does this mean that when pincome is missing, race is also missing in 60% of cases, and when race is missing, pincome is also missing in 98.7% of cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpretation of mr/(mr+mm) is "percentage of usable cases to impute row variable from column variable." So in the example above, a ratio of 0.6 means that for 60% of observations with missing race, pincome is non-missing and can thus be used in imputation
Documentation here for reference: https://www.rdocumentation.org/packages/mice/versions/3.16.0/topics/md.pairs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a non-intuitive measure, but it's the one used in the documentation. As Judah shared, a lower value indicates higher missingness in the predictor for the outcome of interest.
|
||
# Income and Wealth imputation | ||
set.seed(25) | ||
imputed = mice(impute, method=meth, predictorMatrix=predM, m=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why we're doing 10 imputation sets; I don't think we need that many. Five or even just three should enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed down to 5
``` | ||
|
||
## Scatter Plots for both parent income and wealth (non-imputed is zero) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scatter plots reveal some concerning outliers in the imputed sets. The minimum net-worth in the original set is about -$100,000 and some imputed sets have values that approach -$1,000,000. I am not sure what the best way to deal with this is, but one thing we could do is limit min and max values to those from the original set.
It would also be useful to see whisker plots or summary tables with mean, median, min, max, and perhaps 25th and 75th percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added whisker plots! I think this is a great flag. Unfortunately, it does not seem the mice package has this as a built in functionality so we will have to get created.
I think it would be pretty simple to do this in post-processing, by taking all values that exceed the min and max and replacing them with the data set min and max. However, this doesn't seem to be what we would ideally want - to bound the min and max before the imputation so that the values are generated within those limits. I think we may need to think about what a custom function, with mice embedded into it.
Addressing Issue #1 by imputing wealth/parent income in new script
nlsy_impute
.Main question for team discussion in Issue #1: The current imputation uses a limited set of predictor variables, including student age, sex, and a combined race and ethnicity variable. I originally attempted including a larger set of predictor variables, but multiple imputation with the mice package is sensitive to predictor variables with a lot of missingness and to highly correlated predictors. At the same time, it would be good to review literature to see what types of predictors are traditionally included when imputing missing wealth and income, and to ensure we have sufficient predictor variables for imputation to be more accurate.
Addressing Issue #3 by removing parent savings from college enrollment count. Adapted the
nlsy_lib
script and thenlsy_get_col_stat_annual_df
function. Utilized newly imputed net worth and subtracted net savings. If needed, this can be replicated in the function which only looks at fall semester enrollment.Also, I created a new joint race and ethnicity variable in the
nlsy_lib
script.