-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong results in a large data set with one set of FE #249
Comments
Thanks for the report. Does this happen without the survey fixed effects too? |
Random thought: with WildBootTests.jl, I have struggled with the lack of a function like Stata's invsym()--a matrix inversion function that is precise and gracefully handles rank-deficient, symmetric, positive-semi-definite matrices like X'X when X doesn't have full column rank. It leads to strange numerical behavior in sporadic instances. I've found that pinv() is often imprecise, I think even when I follow the recommendation to set Perhaps something like that is happening here? |
Don't have anything to contribute on the numerical issues here but just thought I'd point out that when you've got a formula that long it's often useful to constrict it programmatically, here I think your formula is equivalent to:
|
Thanks @nilshg. In fact that long command is programmatically generated. I've written a Stata program called |
I should have guessed that you're an advanced enough user to know this, but I thought I mention it just in case. Have learned a lot over the years from your blogs, good to see you involved here! |
I'm actually using some version of invsym to get the basis of a set of vectors: |
Oh interesting. I was looking at how GLM does it:. It calls cholesky() with check = false, zeros out certain entries, then divides X'X into X'Y using LAPACK.potrs!(). https://github.com/JuliaStats/GLM.jl/blob/b1ba4c5fdd5030b98a8cf9fe9c46319e5f5eb20e/src/linpred.jl#L186 |
@matthieugomez, yes, it happens even with One idea in that discussion is that it is more stable to use the QR decomposition of X than the Cholesky decomposition of X'X. This says that's what R does. That may be the root of the matter. |
Ah. GLM is gaining a |
I cannot reproduce this issue on my computer (Mac). Can you try using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
_Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
_Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
_Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
_Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
_Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
_Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
_Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
_Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
_Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
X = modelmatrix(f, df)
y = response(f, df)
coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)
plot([coef_qr coef_chol]) And tell me whether |
Yes, they look identical, and they look good. |
Thanks. Can you try |
Also, I have tried the original example on a Mac with an M2 and an Intel Mac, and it worked fine on both. |
Alas, reg(df,f) is failing:
|
Sorry just to recap, is this a good summary of what you obtain? using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
_Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
_Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
_Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
_Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
_Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
_Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
_Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
_Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
_Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
X = modelmatrix(f, df)
y = response(f, df)
# works everywhere
coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)
# works correctly with Mac Intel or M2 but fails on Linux
reg(df, f) |
Yes! FixedEffectModels.jl/src/fit.jl Line 397 in eda1b08
On Windows... |
Sure that would be helful. Could you also try this (note that I add an explicit intercept in the formula, which matters when using modelmatrix directly) using CSV, DataFrames, LinearAlgebra, FixedEffectModels, Plots
df = CSV.read(".../FEbug.csv", DataFrame)
f = @formula(part ~ 1 + _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
_Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
_Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
_Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
_Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
_Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
_Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
_Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
_Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
_Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
X = modelmatrix(f, df)
y = response(f, df)
coef_qr = qr(X) \ y
coef_chol = cholesky!(X' * X) \ (X' * y)
coef_chol2 = cholesky!(Symmetric(X' * X)) \ (X' * y) |
Separately, yes, switching to the QR decomposition fixes the problem! I just replaced Sources I looked at a few days ago were nearly unanimous in saying that software should use the QR, that it's probably what's used in Stata and core R packages. E.g., "QR decomposition is indeed the standard way of solving least square problems effectively with much simplicity" So I think this package should have an option for QR at least in the post-FE-removal stage, and I would have Relatedly, see this discussion of argument naming. |
I don't understand why coef_chol2 gives you different results than Just before https://github.com/FixedEffects/FixedEffectModels.jl/blob/master/src/fit.jl#L397 I have added the lines f2 = @formula(part ~ 1 + _Ibirthyr_1907+_Ibirthyr_1908+_Ibirthyr_1909+_Ibirthyr_1910+_Ibirthyr_1911+_Ibirthyr_1912+_Ibirthyr_1913+_Ibirthyr_1914+
_Ibirthyr_1915+_Ibirthyr_1916+_Ibirthyr_1917+_Ibirthyr_1918+_Ibirthyr_1919+_Ibirthyr_1920+_Ibirthyr_1921+_Ibirthyr_1922+_Ibirthyr_1923+_Ibirthyr_1924+
_Ibirthyr_1925+_Ibirthyr_1926+_Ibirthyr_1927+_Ibirthyr_1928+_Ibirthyr_1929+_Ibirthyr_1930+_Ibirthyr_1931+_Ibirthyr_1932+_Ibirthyr_1933+_Ibirthyr_1934+
_Ibirthyr_1935+_Ibirthyr_1936+_Ibirthyr_1937+_Ibirthyr_1938+_Ibirthyr_1939+_Ibirthyr_1940+_Ibirthyr_1941+_Ibirthyr_1942+_Ibirthyr_1943+_Ibirthyr_1944+
_Ibirthyr_1945+_Ibirthyr_1946+_Ibirthyr_1947+_Ibirthyr_1948+_Ibirthyr_1949+_Ibirthyr_1950+_Ibirthyr_1951+_Ibirthyr_1952+_Ibirthyr_1953+_Ibirthyr_1954+
_Ibirthyr_1955+_Ibirthyr_1956+_Ibirthyr_1957+_Ibirthyr_1958+_Ibirthyr_1959+_Ibirthyr_1960+_Ibirthyr_1962+_Ibirthyr_1963+_Ibirthyr_1964+
_Ibirthyr_1965+_Ibirthyr_1966+_Ibirthyr_1967+_Ibirthyr_1968+_Ibirthyr_1969+_Ibirthyr_1970+_Ibirthyr_1971+_Ibirthyr_1972+_Ibirthyr_1973+_Ibirthyr_1974+
_Ibirthyr_1975+_Ibirthyr_1976+_Ibirthyr_1977+_Ibirthyr_1978+_Ibirthyr_1979+_Ibirthyr_1980+_Ibirthyr_1981+_Ibirthyr_1982+_Ibirthyr_1983+_Ibirthyr_1984+
_Ibirthyr_1985+_Ibirthyr_1986+_Ibirthyr_1987+_Ibirthyr_1988+_Ibirthyr_1989+_Ibirthyr_1990+_Ibirthyr_1991+_Ibirthyr_1992+_Ibirthyr_1993+_Ibirthyr_1994+
_Ibirthyr_1995+_Ibirthyr_1996+_Ibirthyr_1997+_Ibirthyr_1998+_Ibirthyr_1999+_Ibirthyr_2000)
Xhat2 = modelmatrix(f2, df)
y2 = response(f2, df)
coef2 = cholesky!(Xhat2' * Xhat2) \ (Xhat2' * y2)
crossx = cholesky!(Symmetric(Xhat'Xhat))
coef = ldiv!(crossx, Xhat'y)
@assert sum(basis_coef) == size(Xhat2, 2)
@assert Xhat == Xhat2
@assert y == y2
@assert coef == coef2
@show "success" When I run reg(df, f), the four assertions pass for me and "success" gets printed. They should not pass for you. Could you try to run it? Understanding which one fails would help us pinpoint the issue. |
I get:
|
Thanks. Can you try again after replacing the line |
The upshot here is that replacing y2 in your test code with collect(y2) in order to convert it to Meanwhile, using collect(y2) doesn't solve my original problem. What does solve the problem is Here's a complicated screenshot. The last item in the Watch list pulls together coef2 as defined in what is now line 408, involving In my experience, this is how it is with numerical issues in linear algebra routines. You drill and drill to find where the computation goes off the rails, and it turns out it's not really a logical bug, just imprecision in the matrix inversion/factorization that you previously didn't need to think about. Then you have to confront a speed-precision tradeoff. |
Thanks a lot. I am open to using QR in some cases but, if you look at the conditioning number of X, it is not particularly different than the conditioning number of any other matrix (which, I think, is the same as saying that the system is not close to being collinear). So I'm unsure why this happens and how one could predict this issue. |
I think that's what SAS does btw (and Stata is probably similar): https://blogs.sas.com/content/iml/2018/04/18/sweep-operator-sas.html, https://blogs.sas.com/content/iml/2021/07/14/performance-ls-regression.html |
Yes, this other SAS blog post about the QR says "most SAS regression procedures use the sweep operator to construct least-squares solutions." OTOH, it starts out with "A SAS programmer recently mentioned that some open-source software uses the QR algorithm to solve least-squares regression problems" and we can guess what that refers to. And indeed it looks like R uses the QR (this, this). I don't know what Stata does, but evidently it sometimes uses something more precise than Cholesky. It looks like Numpy uses the SVD, which I read is even slower and more precise than the QR (this, this). I'm getting a condition number of 346 for Xhat, thus 346^2 for crossx (which is faster to compute especially since crossx will still be computed anyway, right?). Isn't that pretty big? I can see a few options:
|
Hi both, just for reference - in R, lm.fit (x, y, offset = NULL, method = "qr", tol = 1e-7,
singular.ok = TRUE, ...)
method: currently, only method = "qr" is supported.
tol: tolerance for the [qr](https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/qr-methods.html) decomposition. Default is 1e-7. |
Apologies as it is a bit off-topic, I was wondering if the data could be shared or tested with the LinearRegressionKit.jl, I would like to know out of curiosity if this data set is an issue with the sweep operator. Personally, I am not a big fan of a threshold on the condition number as it only considers the Xs, in my view, this leaves too much uncertainty to leave the Y out. My preference would be to have a QR by default, that could be changed by the user. |
@ericqu I'd be happy to send you the data if you provide a more private way to do so. |
I am still not convinced that QR is needed in a setup where one obtains small SEs for all coefficients. Relatedly, to compute standard errors, one needs to compute inv(X'X), even though the computer science / numerical methods people would say it's a horrible thing to do. In any case, having QR as a default would be bad as it would slow down everything (this is one reason why R is much slower than Stata for standard OLS regressions with big data). I have pushed a version that solves for coefficients using sweep operations, using the same algorithm I was already using to test for collinearity. Hopefully that solves your problem. #253 |
Thank you @matthieugomez. I upgraded to 1.10.0 but unfortunately, the results have not changed. If you are getting tired of this, I can just fork the project, tweak it, and point reghdfejl to that. |
Well I would prefer to understand the difference, but it's hard to me to solve something that does not happen on my computer (I have double checked and I get the exact same coefficients in Stata and Julia). If you time Stata, you can see that it returns a result faster than what a QR factorization would take, so I really don't think they use QR. |
|
Part of it is JuliaStats/StatsModels.jl#222. You should be able to avoid it by passing factors directly rather than the continuous variables. In any case, the point is that 4.85 seconds is less than what my computer takes to compute a QR factorization. |
Interesting. Yeah on my machine the QR takes ~8.1 seconds. |
Yes you are correct. I don't know if calls for a new issue on that package.
|
Well so that explains the problem — there is an issue with your version of Julia or with your underlying BLAS / MKL. Both Xhat and y_vectors are vectors/matrices of 0 and 1 so there should not be any difference between the two numbers. Have you tried with Julia 1.10? Can you try to find a version of this issue with simpler arrays so that you can open an issue on the Julia repository? (maybe Xhat= Xhat[:, end:end] and y = ones(length(y_vector)) |
A few unsolicited, but hopefully still helpful thoughts.
|
@grantmcdermott I could be convinced to add an option for QR but the first step would be for someone to come up with an example where the difference matters relative to the standard errors of the coefficients. The issue right now is a bug, not of instability of Cholesky versus QR. |
@matthieugomez I'm not sure what you mean by "the difference matters relative to the standard errors." In the motivating example here, the discontinuity is very large relative to the standard errors. |
Your example has nothing to do with the instability of Cholesky versus QR. It's about the fact that your computer returns a wrong result for X'y (X and y are just arrays of 0-1, so it is not related to numerical instability). You should make sure you are on the latest version of Julia (1.9.3), simplify the example so that you can share it, and open an issue mentioning your OS and the result of Here is some code that might help you get a reproductible example (it produces a vector y and a matrix X similar to the ones obtained from your data). N = 10_000_000
X = hcat(ones(N), rand(N, 100) .>= 0.95)
y = Vector{Float64}(rand(N) .>= 0.9)
@assert (X'y)[end] ≈ X[:, end]'y |
Submitted. I had thought that conversion of that Sentinel vector to a regular vector had eliminated the effect on the computations in this package. It did so in the debugging watches, but that's not quite the same thing. Ironically @grantmcdermott, switching from OpenBLAS to MKL eliminates the problem(!). What I don't understand is why sticking in a simple coef = qr(X) \ y also did. I guess the ldiv operator is not having this weird math issue. |
Thanks. If you replace |
I tried adding Following this comment I added |
Great, but I'd like a solution even for people that don't use reghdfejl. What if you replace X' * y by X' * reshape(y, length(y), 1) in your reproductible example? |
Yes, that works too! |
Well BLISBLAS can probably also create other types of bugs in the future... Out of curiosity, were you able to replicate this bug with other matrices? I'm wondering how common it is in the wild |
Solved by #255 |
The example I posted about OpenBlas had about half the columns dropped. I found that the problem went away with even small changes to the number of rows, so I figured I hit diminishing returns and stopped. |
I'm running reg() on data set with 9 million rows and nearly 100 non-absorbed regressors, most of which are dummies for certain birth years. I am absorbing only 1 set of FE. Occasionally some of the estimates are clearly wrong. The coefficients on the birth year dummies should vary rather smoothly with birth year, but they sometimes make big jumps. Running the same regression with
reghdfe
orareg
in Stata does not have this problem.I will paste the output of an example. I cannot share the data publicly but will email it to @matthieugomez
The text was updated successfully, but these errors were encountered: