-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect optimal support size in a multitask learning problem with nonnegativity constraints #510
Comments
@tomwenseleers , thanks for your questions! I have reproduced the results based on your pasted code, and I notice that the sample size may be insufficient for recovering such a little bit dense coefficient matrix that has 50 non-zero rows. According to A Polynomial Algorithm for Best-Subset Selection Problem, the recommended maximum support size is A direct solution is collecting more samples. In the following example, 3 times sample size is used and library(abess)
# simulate blurred multichannel spike train
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=3*p,
p=p,
k=round(s*p), # true support size = 0.1*500 = 50
mean_beta = 10000,
sd_logbeta = 1,
family="gaussian",
sd_noise = sd_noise,
multichannel=TRUE, sparse=TRUE,
seed = 123
)
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix
colnames(X) = paste0("x", 1:ncol(X)) # NOTE: if colnames of X and Y are not set abess gives an error message, maybe fix this?
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)
# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X,
y = Y,
family = "mgaussian",
tune.path = "sequence",
lambda=0,
warm.start = TRUE,
tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune")
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 50 peaks detected
real <- as.numeric(rowMax(sim$beta_true)!=0)
est <- as.numeric(rowMin(beta_abess)!=0)
mean(real == est) So, As for the question about multitask identity link Poisson models, I conduct a simple experiment to check your idea: library(abess)
# simulate blurred multichannel spike train
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=3*p,
p=p,
k=round(s*p), # true support size = 0.1*500 = 50
mean_beta = 10000,
sd_logbeta = 1,
family="poisson",
sd_noise = sd_noise,
multichannel=TRUE, sparse=TRUE,
seed = 123
)
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix
colnames(X) = paste0("x", 1:ncol(X))
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)
W <- 1/(Y+0.1) # approx 1/variance Poisson observation weights with family="poisson", n x m matrix
# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X,
y = Y,
weights = W,
family = "mgaussian",
tune.path = "sequence",
# support.size = c(1:50),
lambda=0,
warm.start = TRUE,
tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune")
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 51 peaks detected
real <- as.numeric(rowMax(sim$beta_true)!=0)
est <- as.numeric(rowMin(beta_abess)!=0)
mean(real == est) # 0.998 The estimated result is very close to the true one. Feel free to contact us if you still have any questions. |
Many thanks for that - that's very helpful! And glad As a first step we then do this nonnegative group LASSO : library(glmnet)
system.time(cvfit <- cv.glmnet(X,
Y,
alpha = 1, # lasso
family = "mgaussian",
nlambda = 100,
nfolds = 5,
standardize = FALSE,
standardize.response = FALSE,
intercept = FALSE,
relax = FALSE,
lower.limits = rep(0, ncol(X)+1))) # for nonnegativity constraints
# 26.71s
plot(cvfit)
# Fit the glmnet model over a sequence of lambda values
system.time(fit_glmnet <- glmnet(X,
Y,
alpha = 1, # lasso
family = "mgaussian",
nlambda = 100,
standardize = FALSE,
standardize.response = FALSE,
intercept = FALSE,
relax = FALSE,
lower.limits = rep(0, ncol(X)+1))) # for nonnegativity constraints
# 8.55s
# Select the best lambda value (I am using cvfit$lambda.1se to get slightly sparser solution)
best_lambda <- cvfit$lambda.1se
plot(fit_glmnet, xvar="lambda")
# get the coefficients for each task for the best lambda value
coefs <- coef(fit_glmnet, s = best_lambda)
beta_mgaussian_glmnet <- do.call(cbind, lapply(seq_len(m), function (channel) as.matrix(coefs[[channel]][-1,,drop=F])))
colnames(beta_mgaussian_glmnet) = 1:ncol(beta_mgaussian_glmnet)
beta_mgaussian_glmnet[abs(beta_mgaussian_glmnet)<0.01] = 0 # slight amount of thresholding of small coefficients
image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_mgaussian_glmnet^0.01, col = topo.colors(255),
useRaster=TRUE,
xlab="Time", ylab="Channel", main="nonnegative group Lasso glmnet (red=true support)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
# abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_mgaussian_glmnet)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_mgaussian_glmnet)!=0) # 71 peaks detected Then we do best subset selection among these variables that are now known to yield nonnegative coefficients using library(abess)
preselvars = which(rowMin(beta_mgaussian_glmnet)!=0) # pre-select variables based on nonnegative mgaussian group LASSO
system.time(abess_fit <- abess(x = X[,preselvars,drop=F],
y = Y,
family = "mgaussian",
tune.path = "sequence",
support.size = c(1:length(preselvars)),
lambda=0,
max.splicing.iter = 20,
warm.start = TRUE,
tune.type = "ebic") # or cv or bic or aic or gic
) # 0.29s
extract(abess_fit)$support.size # 55
plot(abess_fit, type="tune")
beta_abess = beta_mgaussian_glmnet
beta_abess[preselvars,] = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
beta_abess[beta_abess < 0] = 0
image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_abess^0.1, col = topo.colors(255),
useRaster=TRUE,
xlab="Time", ylab="Channel", main="abess multitask mgaussian (red=true peaks)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_abess)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 55 peaks detected
length(intersect(which(rowMax(sim$beta_true)!=0),
which(rowMin(beta_abess)!=0))) # 45 out of 50 peaks detected With regard to the calculation of the information criteria - one thing I was not 100% sure about myself was how the effective degree of freedom should be defined for the multitask learning setting. Am I correct that |
I will first address your questions.
Finally, we are grateful to you for actively using |
Well I was asking these questions also partly to compare the solution quality of Right now the method can recover the true support for a noisy multichannel signal like this By way of comparison, a glmnet group LASSO fit selects far too many variables & also has much worse prediction performance: |
Thank you for providing a detailed description of the algorithm and for generously sharing the details of L0glm. Intuitively, clipping appears to be a viable approach, particularly when the true regression coefficients of the variables fall within the specified range and the variables are independent. While this point requires a rigorous proof, preliminary implementations from a software perspective could benefit users. We will consider implementing this feature as soon as possible and will notify you once it is complete. |
In this particular example, my variables are not independent, and yet clipping still seemed to perform very well to recover the true solution. In fact, |
Yes. I agree with your point of view that orthogonality is not necessary. I just used it as a direct entry point to understand why clipping is effective. Moreover, thank you for sharing the experimental results with us and explaining the power of clipping. We are willing to implement this in |
@tomwenseleers Hello, I want to inform you that we have implemented the clipping operation in C++, which has made the entire iteration process very fast. Currently, this feature is available in the Python software package (see https://github.com/abess-team/abess/pull/516/files), but requires compiling and installing the latest code from GitHub. We will soon extend this feature to R and release it on CRAN, making it more convenient to use. It may not take long, and once it's done, we will notify you immediately. Finally, thank you for providing valuable suggestions for the abess project. |
Ha that's very cool - many thanks for that! Looking forward to that! Did that result in good performance for this problem then, not only in terms of speed, but also in terms of support recovery & coefficient estimates? To what problem sizes does it scale then? Would example above also still run well e.g. for n=p=10000 (in my application that's my problem size)? |
Yes, it also supports the accurate recovery of relevant variables. Additionally, we tested it with n=p=10000. For simulated data, we were able to compute the results on a personal computer in approximately 1 to 2 minutes. Below is the specific code: # %%
import numpy as np
import abess
np.random.seed(123)
# %%
%%time
n = 10000
p = 10000
k = 100
coef_real = np.zeros(p)
coef_real[np.random.choice(p, k, replace=False)] = 1
data = abess.make_glm_data(n, p, k, "gaussian", coef_=coef_real)
print(f"n={n}, p={p}, k={k}, ")
# %%
%%time
model = abess.LinearRegression()
model.fit(data.x, data.y, beta_low=0)
# %%
nonzero1 = np.nonzero(model.coef_)[0]
nonzero2 = np.nonzero(data.coef_)[0]
print(f"Successful? {(nonzero1==nonzero2).all()}") I hope our follow-up developments will be helpful for your real-world data applications. |
Many thanks - that sounds really promising - thanks for your work on this! My iterative adaptive ridge method took half a minute for that problem size (currently finishing up the documentation) - will be interesting to compare the performance in terms of solution quality! I think for small problems abess was faster than my method, whereas for big problems mine was slightly faster. One slight remark perhaps - I see beta_low and beta_high are passed as a single value and not as a vector to specify the bounds on each coefficient. Just bear in mind that setting beta_low to zero will work then for gaussian or poisson, but not for binomial or multinomial, as the intercept term there should always be allowed to be unbounded & only the remaining coefficients can be constrained to be nonnegative. But I imagine you've thought of that & taken care of this by not applying box constraints to the intercept term(s)? Another option of course would be to allow lower and upper to be vectors instead of single values, as in e.g. glmnet. That would also allow the flexibility to support different box constraints on different parameters if desired. |
I'm delighted to inform you that version 0.4.8, which includes the functionality of restricting estimation ranges through truncation, has been successfully uploaded to CRAN. You can now conveniently utilize library(abess)
n = 10000
p = 10000
k = 100
coef_real = rep(0, p)
coef_real[sample(1:p, k, replace = FALSE)] <- 1
data <- generate.data(n, p, k, "gaussian", beta = coef_real)
abess_fit = abess(
data$x,
data$y,
family="gaussian",
beta.low=0,
)
true_support_set = which(coef_real != 0)
est_support_set = which(coef(abess_fit, support.size = abess_fit[["best.size"]], sparse = FALSE)[-1] != 0)
all(true_support_set == est_support_set) Thank you for your valuable feedback that contributed to this enhancement! |
Many thanks for this! Today I was just testing with the latest version on CRAN (abess_0.4.8) but I am still running into a couple of problems - here a multitask learning example :
Here simulating multitask case with Gaussian noise, but also interested in multitask case with Poisson noise (& identity link), which I would like to deal with by using observation weight matrix for each outcome
If I would add
Here I noticed that
This looks like a bug perhaps... Any advice welcome! Especially on whether or not you might allow weight to be a matrix as opposed to a vector to also allow me to approximately fit nonnegativity identity link Poisson models. Alternatively, I suppose it would not be possible to allow all standard R GLM families & link functions? E.g. also
|
Thanks for your feedback! First, there is indeed an old bug in Regarding your two suggestions, allowing weights as matrix and specifying exponential family distributions and link functions in a similar manner to There is some documentation on how developers can implement new GLM models: users simply need to inherit from the
It's worth noting that
Firstly, create simulating multitask case with Gaussian noise. import skscope
from skscope import ScopeSolver
import abess
import numpy as np
import jax.numpy as jnp
import pandas as pd
np.random.seed(2023)
n = 500
p = 1000
y_dim = 10
k = 100
coef_real = np.zeros((p, y_dim))
nonzero_features = np.random.choice(np.arange(1, p + 1), k, replace=False)
coef_real[nonzero_features - 1, :] = np.maximum(
np.random.normal(1, 1, size=(k, y_dim)), 0
)
data = abess.make_multivariate_glm_data(
n, p, k, family="multigaussian", coef_=coef_real, M=y_dim
)
weight = 1 #/(data.y+0.1) ## matrix weights default to 1 Then, run # ScopeSolver uses similar algo with abess and other solvers may be useful too.
solver = ScopeSolver(
dimensionality=p * y_dim,
sparsity=range(2*k), # too slow to run from 0 to p
sample_size=n,
group=[i for i in range(p) for _ in range(y_dim)],
ic_type="sic",
)
# Key is the objective function, which is squared error loss with weight matrix
solver.solve(
objective=lambda params: jnp.sum(jnp.square(weight * (data.x @ params.reshape((p, y_dim)) - data.y))),
layers=[skscope.layer.NonNegative(dimensionality = p * y_dim)], # non-negative constraint
jit=True,
)
coef_est = solver.get_estimated_params().reshape(p, y_dim) Finally, print the results. comparison_data = {
"estimated coefs nonzero": np.sum(coef_est != 0, axis=1) != 0,
"true coefs nonzero": np.sum(coef_real != 0, axis=1) != 0
}
comparison_df = pd.DataFrame(comparison_data)
table = pd.crosstab(comparison_df["estimated coefs nonzero"], comparison_df["true coefs nonzero"])
print(table) Furthermore, I attempted to write a Poisson regression with an identity link, but the logarithmic term in the loss function necessitates a good initial value for optimization in each subproblem to avoid taking the logarithm of a negative number. If there's a way to easily specify good initial values, this code might be completed quickly. Once again, thank you for your friendly and helpful feedback! |
Many thanks for this - that's very helpful! I was just sort of hoping that allowing the observation weights to be a matrix instead of a vector might be a small change in the code - but of course I fully understand if it's not & that implementing this change would be too much work right now! |
Describe the bug
I was testing abess on the multitask learning problem below (deconvolution of a multichannel signal using a known point spread function / blur kernel by regressing the multichannel signal on shifted copies of the point spread function) but I am experiencing poor performance, where the best subset is inferred to be much larger than it actually is (support size of best subset = 157, while true support size of simulated dataset is 50). Is there a way to resolve this? Does this have to do with the fact that abess currently does not support nonnegativity constraints (all my simulated model coefficients are positive)? Is there any way to allow for nonnegativity or box constraints? Even if not taken into account during fitting it could be handy if the coefficients could be clipped within the allowed range, so that the information criteria would at least indicate the correct support size of the best allowed model. Taking into account the constraints during fitting would of course be even better. I was also interested in fitting a multitask identity link Poisson model - as currently only Poisson log link is incorporated I could potentially still get something close to that by using an observation weights matrix 1/(Y+0.1), which would be approx. 1/Poisson variance weights and then using family="mgaussian". Is that allowed by any chance? Can observation weights be a matrix with family="mgaussian", to allow for different observation weights per outcome channel/task? If not, could this be allowed for?
Code for Reproduction
Paste your code for reproducing the bug:
Expected behavior
Couple of questions here:
Desktop (please complete the following information):
Screenshots
The text was updated successfully, but these errors were encountered: