Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divergence from GLMnet when using a matrix with many variables #78

Open
tiemvanderdeure opened this issue Jan 5, 2024 · 2 comments
Open

Comments

@tiemvanderdeure
Copy link

When fitting models with a large number of variables, Lasso.jl and GLMnet return different paths, and the difference grows bigger as the number of variables is bigger.

An example to illustrate this:

using Lasso, GLMNet, Statistics

# fits identical models in Lasso and GLMNet from mock data
# and returns the mean absolute difference of the betas of both models
function lasso_glmnet_dif(nrow, ncol, n_col_contributing)
    data = rand(nrow, ncol)
    outcome = mean(data[:, 1:n_col_contributing], dims = 1)[:,1] .> rand(nrow)
    presence_matrix = [1 .- outcome outcome]

    l = Lasso.fit(LassoPath, data, outcome, Binomial())
    g = GLMNet.glmnet(data, presence_matrix, Binomial())

    lcoefs = Vector(l.coefs[:,end])
    gcoefs = g.betas[:, end]

    mean(abs, lcoefs .- gcoefs)
end

# 1000 records, 5 variables that all contribute to outcome
lasso_glmnet_dif(1000, 5, 5) # order of magnitude 1e-9
# 1000 records, 100 variables of which 5 contribute to the outcome
lasso_glmnet_dif(1000, 1000, 5) # around 0.05

The context for this problem is that I'm working on a julia implementation of maxnet, where a big-ish model matrix is generated (100s of columns) and a lasso path is used to select the most important ones.

@AsafManela
Copy link
Collaborator

The packages may be generating a different path of regularization lambdas.
You can get them from the lasso path as l.λ, and from GLMNet as g.lambda.

Also, in your example it looks like you are picking the last coefs of the regularization path, but those are not necessarily the most interesting ones. Take a look at the docs here.

@tiemvanderdeure
Copy link
Author

Hi, thanks for chipping in.

I don't think the regularization lambdas is where the differences are coming from. In the maxnet algorithm the lambdas are generated inside the algorithm and not by the packages. Even if I force the lambdas to be identical, I see the same kind of behaviour.

The same goes for when I look at some other part of the regularization path (maxnet always takes the last one, but I see your point).

E.g. in this example I take coefficients halfway in the path and force the lambdas to be identical, and lasso_glmnet_dif(1000, 1000, 5) is still around 0.02.

That seems like a big difference for it to come from floating point errors, which leads me to think the algorithms are somehow different?

function lasso_glmnet_dif(nrow, ncol, n_col_contributing)
    data = rand(nrow, ncol)
    outcome = mean(data[:, 1:n_col_contributing], dims = 1)[:,1] .> rand(nrow)
    presence_matrix = [1 .- outcome outcome]

    l = Lasso.fit(LassoPath, data, outcome, Binomial())
    g = GLMNet.glmnet(data, presence_matrix, Binomial(); lambda = l.λ)

    lcoefs = Vector(l.coefs[:,floor(Int,end/2)])
    gcoefs = g.betas[:, floor(Int,end/2)]

    mean(abs, lcoefs .- gcoefs)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants