-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wts
option doesn't automatically subset when a DataFrame has missing values
#255
Comments
Rather than duck typing, we could at least do But I'm afraid there's a more fundamental issue here: missing values are skipped by the code in StatsModels, and AFAIK GLM has no idea about them. We need to change this. Potentially related to JuliaStats/StatsModels.jl#32. |
I think it would be nice if we could leave missing values out of this package. I don't know if it is practical but I think we should try to come up with a way to handle this at the StatsModels level such that the weights vector doesn't have any missing values once it is here in this package. @kleinschmidt Have you thought about weights? |
I know this has been repeated a lot, but ultimately, what I would really like is for a |
It sounds a bit unusual to put weights in the formula, but why not... In terms of implementation that's not really simple though since that means GLM needs to be aware of StatsModels to extract the weights from the object it returns. That's completely different from the current system. Whatever the chosen design, GLM doesn't and shouldn't have to deal with missing values in weights. What it would need to know is which elements have to be dropped because they have been skipped by StatsModels, or directly get the subsetted weights vector from StatsModels. |
Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind. |
The reason for having an intermediate step of Original Data Frame -> Model Frame -> Model Matrix in R was to isolate operations like skipping rows with missing data, only considering variables that are used in the model, in the first step. Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step. |
That was also the reason for the bizarre "Standard Non-Standard R Evaluation Model" that I mentioned in a slack discussion with @kleinschmidt . The idea was to have all the arguments to the model-fitting function available to be examined at the time of creating a model frame. Of course R also has a delayed evaluation model which would allow these kinds of shenanigans but means that, effectively, it is impossible to compile an R function effectively because you can't tell what the types of the arguments are going to be until you are in the middle of a function. So I am not advocating this behavior - I am just saying that it has special treatment in R. |
That makes sense, thanks.
I guess my point is a philosophical one, in that weights are just as much as part of how I imagine a "model" as any covariate. So they should be included in the |
Currently GLM just takes the model matrix from StatsModels, and combines it with the weights which are provided directly. If weights are moved to the formula, GLM will have to take also take them from StatsModels. Or StatsModels will have to call GLM and pass it the weights, for example as a keyword argument. The decisive point is whether GLM needs to be aware of the existence of StatsModels or not (currently it isn't, but see JuliaStats/StatsModels.jl#32). |
I see. StatsModels just knows how to make Sounds good. Let me know if there are any good introductory PRs for this if we decide to go that direction. |
Maybe one simple solution after JuliaStats/StatsModels.jl#71 would be to allow packages to declare that a special function term (like |
I think it would definitely be possible to treat weights that way (using a special function within the formula) but it feels a little weird that the syntax would look so different. Then again, the Along those lines it might also be worth thinking about a general kind of "data transformer" component for the model fitting API, which could be a formula but could be something else too. It seems like one of the components of that would have to be expanding data in "raw" form (e.g., a table) into a combination of positional and/or keyword arguments which get spliced into calls to predict, fit, etc. |
Also, to @dmbates comment about the motivation for the "standard non-standard evaluation", that might motivate having a separate But perhaps that could expand to something like |
Is this actually breaking? Currently we don't support missing values in weights, so if we make this work after 2.0 nothing should break. |
If I have a weight variable that's part of the DataFrame, I might want to do this:
However this doens't work if
x
ory
contain missing values.glm
should subset a weights vector the same way it subsets the DataFrame.Additionally, if a weight vector is of type
Union{T, Missing}
GLM throws the following error:It seems like GLM should use duck-typing here, and chug along even if the weight vector could conceivably contain missing values.
The text was updated successfully, but these errors were encountered: