-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking weighting seriously #487
base: master
Are you sure you want to change the base?
Changes from all commits
1754cbd
1d778a5
12121a3
4363ba4
ca702dc
e2b2d12
bc8709a
84cd990
cbc329f
23d67f5
f4d90a9
6b7d95c
c236b82
d4bd0c2
8bdfb55
3eb2ca4
63c8358
e93a919
7bb0959
ded17a8
3346774
7376e78
a738268
c9459e7
6af3ca5
0ded1d4
d923e48
84f27d1
8804dc1
7f3aa36
f67a8e0
23a3e87
5481284
d12222e
a17e812
58dec0c
a6f5c66
92ddb1e
0c61fff
8b0e8e1
f609f06
23f3d03
2749b84
82e472b
2d6aaed
dbc9ae9
e0d9cdf
46e8f92
6df401b
ca15eb8
0c18ae9
54d68d1
422a8cd
d6d4e6b
b457d74
b087679
a44e137
11db2c4
b649d4f
170148c
29c43cb
279e533
afb145e
2cead0a
a1ec49f
97bf28d
9ce2d89
9bddf63
3fe045a
852e307
d1ba3e5
831f280
b00dc16
0825324
48d15fb
3338eab
c27c749
970e26e
8832e9d
9eb2390
587c129
fa63a9a
807731a
72996fc
1ee383a
ba52ce9
5e790df
50c1a96
c4f7959
d2b5cb0
cd165d7
606a419
a1a1e10
5d948de
4fb18df
56d81ae
a2357cf
6068d2a
930a8cb
107d17d
a003b10
1c06c7e
b41cce7
2730277
cdeb1a3
95d506e
f124589
cbdadbc
2386ab9
36326ff
f26bc0e
2bc2138
0569600
dd1b4a8
1c5953d
cd39578
574ec69
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
{ | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
\setlength{\LTpost}{0mm} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's this? :-) |
||
\begin{longtable}{l|rrrrrrrr} | ||
\toprule | ||
\multicolumn{1}{l}{} & \multicolumn{2}{c}{\emph{G} = 50} & \multicolumn{2}{c}{\emph{G} = 100} & \multicolumn{2}{c}{\emph{G} = 150} & \multicolumn{2}{c}{\emph{G} = 200} \\ | ||
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9} | ||
\multicolumn{1}{l}{} & \(\frac{\sqrt{G}\bar{X}_{n}}{\bar{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\hat{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\bar{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\hat{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\bar{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\hat{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\bar{\sigma}_{n}}\) & \(\frac{\sqrt{G}\bar{X}_{n}}{\hat{\sigma}_{n}}\) \\ | ||
\midrule\addlinespace[2.5pt] | ||
\multicolumn{9}{l}{\(\alpha=1.5, \beta=1.9\)} \\ | ||
\midrule\addlinespace[2.5pt] | ||
10\% & $0.03$ & $0.06$ & $0.04$ & $0.03$ & $0.03$ & $0.04$ & $0.05$ & $0.06$ \\ | ||
5\% & $0.02$ & $0.03$ & $0.02$ & $0.01$ & $0.00$ & $0.01$ & $0.04$ & $0.06$ \\ | ||
1\% & $0.01$ & $0.00$ & $0.01$ & $0.00$ & $0.00$ & $0.01$ & $0.01$ & $0.02$ \\ | ||
\midrule\addlinespace[2.5pt] | ||
\multicolumn{9}{l}{\(\alpha=1.5, \beta=2.1\)} \\ | ||
\midrule\addlinespace[2.5pt] | ||
10\% & $0.05$ & $0.03$ & $0.03$ & $0.08$ & $0.03$ & $0.03$ & $0.02$ & $0.08$ \\ | ||
5\% & $0.02$ & $0.02$ & $0.01$ & $0.05$ & $0.02$ & $0.01$ & $0.02$ & $0.05$ \\ | ||
1\% & $0.01$ & $0.00$ & $0.00$ & $0.02$ & $0.00$ & $0.00$ & $0.00$ & $0.01$ \\ | ||
\bottomrule | ||
\end{longtable} | ||
\begin{minipage}{\linewidth} | ||
The table shows the rejection rates for \(G=\{50,100,150,200\}\) and \(\alpha=1.5\) and \(\beta=1.9\) and \(\beta=2.1\). The Monte Carlo is based on 10,000 simulations. The simulation standard errors are: 0.009 for \(\alpha=10\%\), 0.007 for \(\alpha=5\%\), and 0.0031 for \(\alpha=1\%\).\\ | ||
\end{minipage} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,20 @@ | ||
[deps] | ||
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597" | ||
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f" | ||
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" | ||
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a" | ||
Optim = "429524aa-4258-5aef-a3af-852621145aeb" | ||
RCall = "6f49c342-dc21-5d91-9882-a32aef131414" | ||
RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b" | ||
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe" | ||
Comment on lines
+9
to
+11
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really need these? |
||
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3" | ||
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" | ||
StatsFuns = "4c63d2b9-4356-54db-8cca-17b64c39e42c" | ||
StatsModels = "3eaba693-59b7-5ba5-a881-562e759f1c8d" | ||
|
||
[compat] | ||
DataFrames = "1" | ||
Documenter = "1" | ||
Optim = "1.6.2" | ||
Optim = "1.6.2" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -123,6 +123,110 @@ x: 4 -0.032673 0.0797865 -0.41 0.6831 -0.191048 0.125702 | |
─────────────────────────────────────────────────────────────────────────── | ||
``` | ||
|
||
## Weighting | ||
|
||
nalimilan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Both `lm` and `glm` allow weighted estimation. The four different | ||
[types of weights](https://juliastats.org/StatsBase.jl/stable/weights/) defined in | ||
[StatsBase.jl](https://github.com/JuliaStats/StatsBase.jl) can be used to fit a model: | ||
|
||
gragusa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `AnalyticWeights` describe a non-random relative importance (usually between 0 and 1) for | ||
each observation. These weights may also be referred to as reliability weights, precision | ||
weights or inverse variance weights. These are typically used when the observations being | ||
weighted are aggregate values (e.g., averages) with differing variances. | ||
- `FrequencyWeights` describe the number of times (or frequency) each observation was seen. | ||
These weights may also be referred to as case weights or repeat weights. | ||
- `ProbabilityWeights` represent the inverse of the sampling probability for each observation, | ||
providing a correction mechanism for under- or over-sampling certain population groups. | ||
These weights may also be referred to as sampling weights. | ||
- `UnitWeights` attribute a weight of 1 to each observation, which corresponds | ||
to unweighted regression (the default). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we add a comment somewhere how these weights are later treated in estimation? |
||
To indicate which kind of weights should be used, the vector of weights must be wrapped in | ||
one of the three weights types, and then passed to the `weights` keyword argument. | ||
Short-hand functions `aweights`, `fweights`, and `pweights` can be used to construct | ||
`AnalyticWeights`, `FrequencyWeights`, and `ProbabilityWeights`, respectively. | ||
|
||
We illustrate the API with randomly generated data. | ||
|
||
```jldoctest weights | ||
julia> using StableRNGs, DataFrames, GLM | ||
|
||
julia> data = DataFrame(y = rand(StableRNG(1), 100), x = randn(StableRNG(2), 100), weights = repeat([1, 2, 3, 4], 25)); | ||
|
||
julia> m = lm(@formula(y ~ x), data) | ||
LinearModel | ||
|
||
y ~ 1 + x | ||
|
||
Coefficients: | ||
────────────────────────────────────────────────────────────────────────── | ||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||
────────────────────────────────────────────────────────────────────────── | ||
(Intercept) 0.517369 0.0280232 18.46 <1e-32 0.461758 0.57298 | ||
x -0.0500249 0.0307201 -1.63 0.1066 -0.110988 0.0109382 | ||
────────────────────────────────────────────────────────────────────────── | ||
|
||
julia> m_aweights = lm(@formula(y ~ x), data, wts=aweights(data.weights)) | ||
LinearModel | ||
|
||
y ~ 1 + x | ||
|
||
Coefficients: | ||
────────────────────────────────────────────────────────────────────────── | ||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||
────────────────────────────────────────────────────────────────────────── | ||
(Intercept) 0.51673 0.0270707 19.09 <1e-34 0.463009 0.570451 | ||
x -0.0478667 0.0308395 -1.55 0.1239 -0.109067 0.0133333 | ||
────────────────────────────────────────────────────────────────────────── | ||
|
||
julia> m_fweights = lm(@formula(y ~ x), data, wts=fweights(data.weights)) | ||
LinearModel | ||
|
||
y ~ 1 + x | ||
|
||
Coefficients: | ||
───────────────────────────────────────────────────────────────────────────── | ||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||
───────────────────────────────────────────────────────────────────────────── | ||
(Intercept) 0.51673 0.0170172 30.37 <1e-84 0.483213 0.550246 | ||
x -0.0478667 0.0193863 -2.47 0.0142 -0.0860494 -0.00968394 | ||
───────────────────────────────────────────────────────────────────────────── | ||
|
||
julia> m_pweights = lm(@formula(y ~ x), data, wts=pweights(data.weights)) | ||
LinearModel | ||
|
||
y ~ 1 + x | ||
|
||
Coefficients: | ||
─────────────────────────────────────────────────────────────────────────── | ||
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% | ||
─────────────────────────────────────────────────────────────────────────── | ||
(Intercept) 0.51673 0.0287193 17.99 <1e-32 0.459737 0.573722 | ||
x -0.0478667 0.0265532 -1.80 0.0745 -0.100561 0.00482739 | ||
─────────────────────────────────────────────────────────────────────────── | ||
|
||
``` | ||
|
||
!!! warning | ||
|
||
In the old API, weights were passed as `AbstractVectors` and were silently treated in | ||
the internal computation of standard errors and related quantities as `FrequencyWeights`. | ||
Passing weights as `AbstractVector` is still allowed for backward compatibility, but it | ||
is deprecated. When weights are passed following the old API, they are now coerced to | ||
`FrequencyWeights` and a deprecation warning is issued. | ||
|
||
The type of the weights will affect the variance of the estimated coefficients and the | ||
quantities involving this variance. The coefficient point estimates will be the same | ||
regardless of the type of weights. | ||
|
||
```jldoctest weights | ||
julia> loglikelihood(m_aweights) | ||
-16.296307561384253 | ||
|
||
julia> loglikelihood(m_fweights) | ||
-25.51860961756451 | ||
``` | ||
|
||
## Comparing models with F-test | ||
|
||
Comparisons between two or more linear models can be performed using the `ftest` function, | ||
|
@@ -176,8 +280,8 @@ Many of the methods provided by this package have names similar to those in [R]( | |
- `vcov`: variance-covariance matrix of the coefficient estimates | ||
|
||
|
||
Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, but | ||
in practice one typically uses `LogLink`. | ||
Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, | ||
but in practice one typically uses `LogLink`. | ||
|
||
```jldoctest methods | ||
julia> using GLM, DataFrames, StatsBase | ||
|
@@ -209,7 +313,9 @@ julia> round.(predict(mdl, test_data); digits=8) | |
9.33333333 | ||
``` | ||
|
||
The [`cooksdistance`](@ref) method computes [Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation used to fit a linear model, giving an estimate of the influence of each data point. | ||
The [`cooksdistance`](@ref) method computes | ||
[Cook's distance](https://en.wikipedia.org/wiki/Cook%27s_distance) for each observation | ||
used to fit a linear model, giving an estimate of the influence of each data point. | ||
Note that it's currently only implemented for linear models without weights. | ||
|
||
```jldoctest methods | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file?