`wts` option doesn't automatically subset when a DataFrame has missing values #255

pdeffebach · 2018-09-18T19:25:50Z

If I have a weight variable that's part of the DataFrame, I might want to do this:

glm(@formula(y ~ x), df, ..., wts = df[:weight]

However this doens't work if x or y contain missing values.

ERROR: LoadError: DimensionMismatch("wts must have length 4011 or length 0 but was 9447")

glm should subset a weights vector the same way it subsets the DataFrame.

Additionally, if a weight vector is of type Union{T, Missing} GLM throws the following error:

ERROR: LoadError: TypeError: in #fit, in typeassert, expected Array{Float64,1}, got Array{Union{Missing, Float64},1}

It seems like GLM should use duck-typing here, and chug along even if the weight vector could conceivably contain missing values.

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-09-18T20:00:01Z

Rather than duck typing, we could at least do <:Union{Real,Missing} (as in many other places in JuliaStats packages).

But I'm afraid there's a more fundamental issue here: missing values are skipped by the code in StatsModels, and AFAIK GLM has no idea about them. We need to change this. Potentially related to JuliaStats/StatsModels.jl#32.

andreasnoack · 2018-09-18T20:10:17Z

I think it would be nice if we could leave missing values out of this package. I don't know if it is practical but I think we should try to come up with a way to handle this at the StatsModels level such that the weights vector doesn't have any missing values once it is here in this package. @kleinschmidt Have you thought about weights?

pdeffebach · 2018-09-18T20:11:52Z

I know this has been repeated a lot, but ultimately, what I would really like is for a weight and cluster to be inside the formula object and have all of those handled at once. If that's feasible that seems like the best solution in the long run.

nalimilan · 2018-09-18T20:27:03Z

It sounds a bit unusual to put weights in the formula, but why not... In terms of implementation that's not really simple though since that means GLM needs to be aware of StatsModels to extract the weights from the object it returns. That's completely different from the current system.

Whatever the chosen design, GLM doesn't and shouldn't have to deal with missing values in weights. What it would need to know is which elements have to be dropped because they have been skipped by StatsModels, or directly get the subsetted weights vector from StatsModels.

pdeffebach · 2018-09-18T20:30:22Z

GLM needs to be aware of StatsModels to extract the weights from the object it returns.

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

dmbates · 2018-09-18T20:43:52Z

The reason for having an intermediate step of Original Data Frame -> Model Frame -> Model Matrix in R was to isolate operations like skipping rows with missing data, only considering variables that are used in the model, in the first step. Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

dmbates · 2018-09-18T20:47:53Z

That was also the reason for the bizarre "Standard Non-Standard R Evaluation Model" that I mentioned in a slack discussion with @kleinschmidt . The idea was to have all the arguments to the model-fitting function available to be examined at the time of creating a model frame. Of course R also has a delayed evaluation model which would allow these kinds of shenanigans but means that, effectively, it is impossible to compile an R function effectively because you can't tell what the types of the arguments are going to be until you are in the middle of a function. So I am not advocating this behavior - I am just saying that it has special treatment in R.

pdeffebach · 2018-09-18T20:54:06Z

That makes sense, thanks.

Evaluations of model matrices, model response, constrasts, weights, etc. are done after that step.

I guess my point is a philosophical one, in that weights are just as much as part of how I imagine a "model" as any covariate. So they should be included in the @formula call.

nalimilan · 2018-09-18T20:56:42Z

Could you expand on this? A weight is just like a covariate: it's a vector of observed data you use to estimate a coefficient. Both clustering and weights aren't separate from covariates in my mind.

Currently GLM just takes the model matrix from StatsModels, and combines it with the weights which are provided directly. If weights are moved to the formula, GLM will have to take also take them from StatsModels. Or StatsModels will have to call GLM and pass it the weights, for example as a keyword argument. The decisive point is whether GLM needs to be aware of the existence of StatsModels or not (currently it isn't, but see JuliaStats/StatsModels.jl#32).

pdeffebach · 2018-09-18T20:59:32Z

I see. StatsModels just knows how to make (X, y). This is all @formula serves to keep track of. And the whole pipeline would need to be expanded to allow (X, y, weight, cluster, strata).

Sounds good. Let me know if there are any good introductory PRs for this if we decide to go that direction.

nalimilan · 2018-09-18T21:04:41Z

Maybe one simple solution after JuliaStats/StatsModels.jl#71 would be to allow packages to declare that a special function term (like weights) should be passed as a keyword argument to the package's fitting function after dropping observations with missing values, or stored as a field in the model, using the same name. That would be general enough to accommodate the needs of various model families.

kleinschmidt · 2018-09-19T18:29:19Z

I think it would definitely be possible to treat weights that way (using a special function within the formula) but it feels a little weird that the syntax would look so different. Then again, the fit(GLM, X, y) syntax looks very different from the fit(GLM, @formula( y ~ x1 + x2 + x3)) so maybe that's not such a big deal. Another possibility would be to allow KW args in the formula expression (as in JuliaStats/StatsModels.jl#21).

Along those lines it might also be worth thinking about a general kind of "data transformer" component for the model fitting API, which could be a formula but could be something else too. It seems like one of the components of that would have to be expanding data in "raw" form (e.g., a table) into a combination of positional and/or keyword arguments which get spliced into calls to predict, fit, etc.

kleinschmidt · 2018-09-19T18:32:05Z

Also, to @dmbates comment about the motivation for the "standard non-standard evaluation", that might motivate having a separate @model or @fit API like @fit(GLM, y ~ x1+x2+x3, weights=~wt, another_kw=1) that similarly has access to all the arguments; there are problems there with being able to store the extracted transformations so that they can be re-used for predict later, unless a wrapper type is created.

But perhaps that could expand to something like fit(GLM, Deferred(), Deferred(), weights=Deferred(), another_kw=1, transformer=@formula(...)) where Defferred is some kind of singleton type like Missing that tells fit to get values from the transformer or something I dunno just spitballin'

nalimilan · 2024-11-21T08:09:42Z

Is this actually breaking? Currently we don't support missing values in weights, so if we make this work after 2.0 nothing should break.

andreasnoack · 2024-11-21T08:55:38Z

I didn't analyze the implications. I just assumed that somebody else had concluded that it was breaking since it was listed in #500. @palday do you know if this is breaking or can we move it off the 2.0 milestone?

pdeffebach mentioned this issue Sep 19, 2018

Multipart Formula JuliaStats/StatsModels.jl#21

Open

palday mentioned this issue Sep 29, 2022

Version 2.0 Breaking Changes #500

Open

andreasnoack added bug breaking labels Nov 20, 2024

andreasnoack added this to the Release 2.0 milestone Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wts` option doesn't automatically subset when a DataFrame has missing values #255

`wts` option doesn't automatically subset when a DataFrame has missing values #255

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

andreasnoack commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

dmbates commented Sep 18, 2018

dmbates commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

kleinschmidt commented Sep 19, 2018

kleinschmidt commented Sep 19, 2018 •

edited

Loading

nalimilan commented Nov 21, 2024

andreasnoack commented Nov 21, 2024

wts option doesn't automatically subset when a DataFrame has missing values #255

wts option doesn't automatically subset when a DataFrame has missing values #255

Comments

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

andreasnoack commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

dmbates commented Sep 18, 2018

dmbates commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

kleinschmidt commented Sep 19, 2018

kleinschmidt commented Sep 19, 2018 • edited Loading

nalimilan commented Nov 21, 2024

andreasnoack commented Nov 21, 2024

`wts` option doesn't automatically subset when a DataFrame has missing values #255

`wts` option doesn't automatically subset when a DataFrame has missing values #255

kleinschmidt commented Sep 19, 2018 •

edited

Loading