Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wts option doesn't automatically subset when a DataFrame has missing values #255

Open
pdeffebach opened this issue Sep 18, 2018 · 15 comments

Comments

@pdeffebach
Copy link
Contributor

If I have a weight variable that's part of the DataFrame, I might want to do this:

glm(@formula(y ~ x), df, ..., wts = df[:weight]

However this doens't work if x or y contain missing values.

ERROR: LoadError: DimensionMismatch("wts must have length 4011 or length 0 but was 9447")

glm should subset a weights vector the same way it subsets the DataFrame.

Additionally, if a weight vector is of type Union{T, Missing} GLM throws the following error:

ERROR: LoadError: TypeError: in #fit, in typeassert, expected Array{Float64,1}, got Array{Union{Missing, Float64},1}

It seems like GLM should use duck-typing here, and chug along even if the weight vector could conceivably contain missing values.

@nalimilan
Copy link
Member

Rather than duck typing, we could at least do <:Union{Real,Missing} (as in many other places in JuliaStats packages).

But I'm afraid there's a more fundamental issue here: missing values are skipped by the code in StatsModels, and AFAIK GLM has no idea about them. We need to change this. Potentially related to JuliaStats/StatsModels.jl#32.

@andreasnoack
Copy link
Member

I think it would be nice if we could leave missing values out of this package. I don't know if it is practical but I think we should try to come up with a way to handle this at the StatsModels level such that the weights vector doesn't have any missing values once it is here in this package. @kleinschmidt Have you thought about weights?

@pdeffebach
Copy link
Contributor Author

I know this has been repeated a lot, but ultimately, what I would really like is for a weight and cluster to be inside the formula object and have all of those handled at once. If that's feasible that seems like the best solution in the long run.

@nalimilan
Copy link
Member

It sounds a bit unusual to put weights in the formula, but why not... In terms of implementation that's not really simple though since that means GLM needs to be aware of StatsModels to extract the weights from the object it returns. That's completely different from the current system.

Whatever the chosen design, GLM doesn't and shouldn't have to deal with missing values in weights. What it would need to know is which elements have to be dropped because they have been skipped by StatsModels, or directly get the subsetted weights vector from StatsModels.

@pdeffebach
Copy link
Contributor Author

GLM needs to be aware of StatsModels to extract the weights from the object it returns.

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

@dmbates
Copy link
Contributor

dmbates commented Sep 18, 2018

The reason for having an intermediate step of Original Data Frame -> Model Frame -> Model Matrix in R was to isolate operations like skipping rows with missing data, only considering variables that are used in the model, in the first step. Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

@dmbates
Copy link
Contributor

dmbates commented Sep 18, 2018

That was also the reason for the bizarre "Standard Non-Standard R Evaluation Model" that I mentioned in a slack discussion with @kleinschmidt . The idea was to have all the arguments to the model-fitting function available to be examined at the time of creating a model frame. Of course R also has a delayed evaluation model which would allow these kinds of shenanigans but means that, effectively, it is impossible to compile an R function effectively because you can't tell what the types of the arguments are going to be until you are in the middle of a function. So I am not advocating this behavior - I am just saying that it has special treatment in R.

@pdeffebach
Copy link
Contributor Author

That makes sense, thanks.

Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

I guess my point is a philosophical one, in that weights are just as much as part of how I imagine a "model" as any covariate. So they should be included in the @formula call.

@nalimilan
Copy link
Member

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

Currently GLM just takes the model matrix from StatsModels, and combines it with the weights which are provided directly. If weights are moved to the formula, GLM will have to take also take them from StatsModels. Or StatsModels will have to call GLM and pass it the weights, for example as a keyword argument. The decisive point is whether GLM needs to be aware of the existence of StatsModels or not (currently it isn't, but see JuliaStats/StatsModels.jl#32).

@pdeffebach
Copy link
Contributor Author

I see. StatsModels just knows how to make (X, y). This is all @formula serves to keep track of. And the whole pipeline would need to be expanded to allow (X, y, weight, cluster, strata).

Sounds good. Let me know if there are any good introductory PRs for this if we decide to go that direction.

@nalimilan
Copy link
Member

Maybe one simple solution after JuliaStats/StatsModels.jl#71 would be to allow packages to declare that a special function term (like weights) should be passed as a keyword argument to the package's fitting function after dropping observations with missing values, or stored as a field in the model, using the same name. That would be general enough to accommodate the needs of various model families.

@kleinschmidt
Copy link
Member

I think it would definitely be possible to treat weights that way (using a special function within the formula) but it feels a little weird that the syntax would look so different. Then again, the fit(GLM, X, y) syntax looks very different from the fit(GLM, @formula( y ~ x1 + x2 + x3)) so maybe that's not such a big deal. Another possibility would be to allow KW args in the formula expression (as in JuliaStats/StatsModels.jl#21).

Along those lines it might also be worth thinking about a general kind of "data transformer" component for the model fitting API, which could be a formula but could be something else too. It seems like one of the components of that would have to be expanding data in "raw" form (e.g., a table) into a combination of positional and/or keyword arguments which get spliced into calls to predict, fit, etc.

@kleinschmidt
Copy link
Member

kleinschmidt commented Sep 19, 2018

Also, to @dmbates comment about the motivation for the "standard non-standard evaluation", that might motivate having a separate @model or @fit API like @fit(GLM, y ~ x1+x2+x3, weights=~wt, another_kw=1) that similarly has access to all the arguments; there are problems there with being able to store the extracted transformations so that they can be re-used for predict later, unless a wrapper type is created.

But perhaps that could expand to something like fit(GLM, Deferred(), Deferred(), weights=Deferred(), another_kw=1, transformer=@formula(...)) where Defferred is some kind of singleton type like Missing that tells fit to get values from the transformer or something I dunno just spitballin'

@nalimilan
Copy link
Member

Is this actually breaking? Currently we don't support missing values in weights, so if we make this work after 2.0 nothing should break.

@andreasnoack
Copy link
Member

I didn't analyze the implications. I just assumed that somebody else had concluded that it was breaking since it was listed in #500. @palday do you know if this is breaking or can we move it off the 2.0 milestone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants