Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking weighting seriously #487

Open
wants to merge 117 commits into
base: master
Choose a base branch
from

Conversation

gragusa
Copy link

@gragusa gragusa commented Jul 15, 2022

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

  • Deal with weighted standard errors with rank deficient designs
  • Document the new API
  • Improve testing

Closes #186.

@codecov-commenter
Copy link

codecov-commenter commented Jul 16, 2022

Codecov Report

Attention: Patch coverage is 79.81073% with 64 lines in your changes missing coverage. Please review.

Project coverage is 86.45%. Comparing base (89493a4) to head (574ec69).

Files with missing lines Patch % Lines
src/glmfit.jl 78.30% 23 Missing ⚠️
src/lm.jl 75.60% 20 Missing ⚠️
src/linpred.jl 84.74% 18 Missing ⚠️
src/glmtools.jl 62.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   90.33%   86.45%   -3.89%     
==========================================
  Files           8        8              
  Lines        1107     1277     +170     
==========================================
+ Hits         1000     1104     +104     
- Misses        107      173      +66     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lrnv
Copy link

lrnv commented Jul 20, 2022

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

@gragusa
Copy link
Author

gragusa commented Jul 20, 2022

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

@alecloudenback
Copy link

alecloudenback commented Aug 14, 2022

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

@nalimilan
Copy link
Member

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl Outdated Show resolved Hide resolved
src/GLM.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Contributor

bkamins commented Aug 31, 2022

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

@andreasnoack
Copy link
Member

It looks like one of the last digits is flipping in a doctests. Would you be able to add a regex filter to that block?

src/linpred.jl Outdated
Comment on lines 351 to 357
"""
nobs(obj::LinearModel)
nobs(obj::GLM)
residuals(obj::LinPredModel; weighted::Bool=false) = residuals(obj.rr; weighted=weighted)

For linear and generalized linear models, returns the number of rows, or,
when prior weights are specified, the sum of weights.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have removed this docstring, which explains why CI fails when building docs.

@nalimilan
Copy link
Member

Sorry to point this, but a few comments @bkamins and I made are still unresolved AFAICT. Can you have a look? Codecov also indicates that some parts of the code that have been changed are not tested.

@gragusa
Copy link
Author

gragusa commented Dec 13, 2024

@nalimilan I think I addressed all issues and comments.

@nalimilan
Copy link
Member

Thanks and sorry for the delay. I think we're close, but I still see some comments from reviews by @bkamins and I in 2022 which still seem to apply. For example https://github.com/JuliaStats/GLM.jl/pull/487/files#r1032949805, which is an important point to decide.

Also Codecov indicates that only 80% of the diff is tested, ideally it should be 100%, at least for code that was introduced by this PR. For example right below the comment I mentioned there seem to be loglik_apweights_obs methods that are not tested at all. Same for some isweighted, loglikelihood or residuals methods.

docs/src/index.md Outdated Show resolved Hide resolved
docs/src/index.md Outdated Show resolved Hide resolved
src/linpred.jl Outdated
Comment on lines 482 to 488
"""
nobs(obj::LinearModel)
nobs(obj::GLM)

For linear and generalized linear models, returns the number of rows, or,
when prior weights are specified, the sum of weights.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the sum of weights is only correct when using FrequencyWeights, right? For other weights the number of rows is more appropriate.


r2(obj::LinearModel) = 1 - deviance(obj)/nulldeviance(obj)
adjr2(obj::LinearModel) = 1 - (1 - r²(obj))*(nobs(obj)-hasintercept(obj))/dof_residual(obj)

working_residuals(x::LinearModel) = residuals(x)
working_weights(x::LinearModel) = x.pp.wts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define working_weights(x::LinPred) and call that from here for consistency.

u = residuals(obj)
mse = dispersion(obj,true)
u = residuals(obj; weighted=isweighted(obj))
mse = GLM.dispersion(obj,true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really needed AFAICT?

Suggested change
mse = GLM.dispersion(obj,true)
mse = dispersion(obj,true)

end
nobs(obj::LinPredModel) = nobs(obj.rr)

weights(obj::RegressionModel) = weights(obj.model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type piracy and no longer needed anyway in git master as we don't use TableRegressionModel anymore.

Suggested change
weights(obj::RegressionModel) = weights(obj.model)

src/linpred.jl Outdated Show resolved Hide resolved

f, (y, X) = modelframe(f, data, contrasts, LinearModel)
_wts = convert_weights(wts)
_wts = isempty(_wts) ? uweights(length(y)) : _wts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also print a deprecation warning when weights have a different length from y. We don't want to continue accepting empty vectors in the future as people should use UnitWeights instead.

Comment on lines +260 to +262
N = length(m.rr.y)
n = sum(log, wts)
0.5*(n - N * (log(2π * nulldeviance(m)/N) + 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this definition is OK for both analytical weights and probability weights? I think we discussed this before, but loglikelihood throws an error for probability weights so I'm surprised that nullloglikelihood doesn't.

@@ -316,8 +355,7 @@ function StatsModels.predict!(res::Union{AbstractVector,
prediction, lower, upper = res
length(prediction) == length(lower) == length(upper) == size(newx, 1) ||
throw(DimensionMismatch("length of vectors in `res` must equal the number of rows in `newx`"))
length(mm.rr.wts) == 0 || error("prediction with confidence intervals not yet implemented for weighted regression")

mm.rr.wts isa UnitWeights || error("prediction with confidence intervals not yet implemented for weighted regression")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mm.rr.wts isa UnitWeights || error("prediction with confidence intervals not yet implemented for weighted regression")
isweighted(mm) && error("prediction with confidence intervals not yet implemented for weighted regression")

gragusa and others added 4 commits December 18, 2024 12:51
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Path towards GLMs with fweights, pweights, and aweights