Update documentation

ethanweed · May 2, 2024 · 4a4751c · 4a4751c
1 parent 4123473
commit 4a4751c
Show file tree

Hide file tree

Showing 3 changed files with 2,743 additions and 135 deletions.
diff --git a/05.04-regression.html b/05.04-regression.html
@@ -3182,7 +3182,7 @@ <h3><span class="section-number">16.9.1. </span>Three kinds of residuals<a class
   <th>Date:</th>             <td>Thu, 02 May 2024</td> <th>  Prob (F-statistic):</th> <td>2.15e-36</td>
 </tr>
 <tr>
-  <th>Time:</th>                 <td>11:13:42</td>     <th>  Log-Likelihood:    </th> <td> -287.48</td>
+  <th>Time:</th>                 <td>13:53:03</td>     <th>  Log-Likelihood:    </th> <td> -287.48</td>
 </tr>
 <tr>
   <th>No. Observations:</th>      <td>   100</td>      <th>  AIC:               </th> <td>   581.0</td>
@@ -3250,7 +3250,7 @@ <h3><span class="section-number">16.9.1. </span>Three kinds of residuals<a class
   <th>Date:</th>             <td>Thu, 02 May 2024</td> <th>  Prob (F-statistic):</th> <td>2.78e-35</td>
 </tr>
 <tr>
-  <th>Time:</th>                 <td>11:13:42</td>     <th>  Log-Likelihood:    </th> <td> -287.48</td>
+  <th>Time:</th>                 <td>13:53:03</td>     <th>  Log-Likelihood:    </th> <td> -287.48</td>
 </tr>
 <tr>
   <th>No. Observations:</th>      <td>   100</td>      <th>  AIC:               </th> <td>   581.0</td>
@@ -3717,7 +3717,7 @@ <h3><span class="section-number">16.10.1. </span>Backward elimination<a class="h
   <th>Date:</th>             <td>Thu, 02 May 2024</td> <th>  Prob (F-statistic):</th> <td>3.42e-35</td>
 </tr>
 <tr>
-  <th>Time:</th>                 <td>11:13:42</td>     <th>  Log-Likelihood:    </th> <td> -287.43</td>
+  <th>Time:</th>                 <td>13:53:03</td>     <th>  Log-Likelihood:    </th> <td> -287.43</td>
 </tr>
 <tr>
   <th>No. Observations:</th>      <td>   100</td>      <th>  AIC:               </th> <td>   582.9</td>
@@ -3846,17 +3846,17 @@ <h3><span class="section-number">16.10.1. </span>Backward elimination<a class="h
 <div class="highlight-none notranslate"><div class="highlight"><pre><span></span>mod1 = pg.linear_regression(df[&#39;dan_sleep&#39;], df[&#39;dan_grump&#39;])
 </pre></div>
 </div>
-<p>This process, as I have described it, is fairly straightforward, especially once you a function to just grab the AIC from the <code class="docutils literal notranslate"><span class="pre">statsmodels</span></code> output. But if you find yourself doing this sort of thing often, you might want to automate it even further. It wouldn’t take <em>that</em> much work to expand on the code I have given you, and build a function that takes as input your full model, and then <em>automatically</em> considers all possible variations, and continues eliminating predictors until it finds the optimal model. But this I leave as a programming exercise for you, if you feel so inclined.</p>
+<p>This process, as I have described it, is fairly straightforward, especially once you have a function to just grab the AIC from the <code class="docutils literal notranslate"><span class="pre">statsmodels</span></code> output. But if you find yourself doing this sort of thing often, you might want to automate it even further. It wouldn’t take <em>that</em> much work to expand on the code I have given you, and build a function that takes as input your full model, and then <em>automatically</em> considers all possible variations, and continues eliminating predictors until it finds the optimal model. But this I leave as a programming exercise for you, if you feel so inclined.</p>
 </section>
 <section id="a-caveat">
 <h3><span class="section-number">16.10.2. </span>A caveat<a class="headerlink" href="#a-caveat" title="Permalink to this heading">#</a></h3>
 <p>Automated variable selection methods are seductive things. They provide an element of objectivity to your model selection, and that’s kind of nice. Unfortunately, they’re sometimes used as an excuse for thoughtlessness. No longer do you have to think carefully about which predictors to add to the model and what the theoretical basis for their inclusion might be… everything is solved by the magic of AIC. And if we start throwing around phrases like Ockham’s razor, well, it sounds like everything is wrapped up in a nice neat little package that no-one can argue with.</p>
-<p>Or, perhaps not. Firstly, there’s very little agreement on what counts as an appropriate model selection criterion. When I was taught backward elimination as an undergraduate, we used F-tests to do it, because that was the default method used by the software. Here we are using AIC, and since this is an introductory text that’s the only method I’ve described, but the AIC is hardly the Word of the Gods of Statistics. It’s an approximation, derived under certain assumptions, and it’s guaranteed to work only for large samples when those assumptions are met. Alter those assumptions and you get a different criterion, like the BIC for instance. Take a different approach again and you get the NML criterion. Decide that you’re a Bayesian and you get model selection based on posterior odds ratios. Then there are a bunch of regression specific tools that I haven’t mentioned. And so on. All of these different methods have strengths and weaknesses, and some are easier to calculate than others (AIC is probably the easiest of the lot, which might account for its popularity). Almost all of them produce the same answers when the answer is “obvious” but there’s a fair amount of disagreement when the model selection problem becomes hard.</p>
+<p>Or, perhaps not. First of all, there’s very little agreement on what counts as an appropriate model selection criterion. When I was taught backward elimination as an undergraduate, we used F-tests to do it, because that was the default method used by the software. Here we are using AIC, and since this is an introductory text that’s the only method I’ve described, but the AIC is hardly the Word of the Gods of Statistics. It’s an approximation, derived under certain assumptions, and it’s guaranteed to work only for large samples when those assumptions are met. Alter those assumptions and you get a different criterion, like the BIC for instance. Take a different approach again and you get the NML criterion. Decide that you’re a Bayesian and you get model selection based on posterior odds ratios. Then there are a bunch of regression-specific tools that I haven’t mentioned. And so on. All of these different methods have strengths and weaknesses, and some are easier to calculate than others (AIC is probably the easiest of the lot, which might account for its popularity). Almost all of them produce the same answers when the answer is “obvious”, but there’s a fair amount of disagreement when the model selection problem becomes hard.</p>
 <p>What does this mean in practice? Well, you could go and spend several years teaching yourself the theory of model selection, learning all the ins and outs of it; so that you could finally decide on what you personally think the right thing to do is. Speaking as someone who actually did that, I wouldn’t recommend it: you’ll probably come out the other side even more confused than when you started. A better strategy is to show a bit of common sense… if you’re staring at the results of a stepwise AIC model comparison procedure, and the model that makes sense is close to having the smallest AIC, but is narrowly defeated by a model that doesn’t make any sense… trust your instincts. Statistical model selection is an inexact tool, and as I said at the beginning, interpretability matters.</p>
 </section>
 <section id="comparing-two-regression-models">
 <h3><span class="section-number">16.10.3. </span>Comparing two regression models<a class="headerlink" href="#comparing-two-regression-models" title="Permalink to this heading">#</a></h3>
-<p>An alternative to using automated model selection procedures is for the researcher to explicitly select two or more regression models to compare to each other. You can do this in a few different ways, depending on what research question you’re trying to answer. Suppose we want to know whether or not the amount of sleep that my son got has any relationship to my grumpiness, over and above what we might expect from the amount of sleep that I got. We also want to make sure that the day on which we took the measurement has no influence on the relationship. That is, we’re interested in the relationship between <code class="docutils literal notranslate"><span class="pre">baby_sleep</span></code> and <code class="docutils literal notranslate"><span class="pre">dan_grump</span></code>, and from that perspective <code class="docutils literal notranslate"><span class="pre">dan_sleep</span></code> and <code class="docutils literal notranslate"><span class="pre">day</span></code> are nuisance variable or <strong><em>covariates</em></strong> that we want to control for. In this situation, what we would like to know is whether <code class="docutils literal notranslate"><span class="pre">dan_grump</span> <span class="pre">~</span> <span class="pre">dan_sleep</span> <span class="pre">+</span> <span class="pre">day</span> <span class="pre">+</span> <span class="pre">baby_sleep</span></code> (which I’ll call Model 1, or <code class="docutils literal notranslate"><span class="pre">M1</span></code>) is a better regression model for these data than <code class="docutils literal notranslate"><span class="pre">dan_grump</span> <span class="pre">~</span> <span class="pre">dan_sleep</span> <span class="pre">+</span> <span class="pre">day</span></code> (which I’ll call Model 0, or <code class="docutils literal notranslate"><span class="pre">M0</span></code>). There are two different ways we can compare these two models, one based on a model selection criterion like AIC, and the other based on an explicit hypothesis test. I’ll show you the AIC based approach first because it’s simpler, and follows naturally from the method we used in the last section. The first thing I need to do is actually run the regressions. Since we want to calculate AIC, it will be easier to use <code class="docutils literal notranslate"><span class="pre">statsmodels</span></code> than <code class="docutils literal notranslate"><span class="pre">pigouin</span></code>. First we’ll define the two models, then use our handy-dandy AIC function to get the AIC for each of them.</p>
+<p>An alternative to using automated model selection procedures is for the researcher to explicitly select two or more regression models to compare to each other. You can do this in a few different ways, depending on what research question you’re trying to answer. Suppose we want to know whether or not the amount of sleep that my son got has any relationship to my grumpiness, over and above what we might expect from the amount of sleep that I got. We also want to make sure that the day on which we took the measurement has no influence on the relationship. That is, we’re interested in the relationship between <code class="docutils literal notranslate"><span class="pre">baby_sleep</span></code> and <code class="docutils literal notranslate"><span class="pre">dan_grump</span></code>, and from that perspective <code class="docutils literal notranslate"><span class="pre">dan_sleep</span></code> and <code class="docutils literal notranslate"><span class="pre">day</span></code> are nuisance variables or <strong><em>covariates</em></strong> that we want to control for. In this situation, what we would like to know is whether <code class="docutils literal notranslate"><span class="pre">dan_grump</span> <span class="pre">~</span> <span class="pre">dan_sleep</span> <span class="pre">+</span> <span class="pre">day</span> <span class="pre">+</span> <span class="pre">baby_sleep</span></code> (which I’ll call Model 1, or <code class="docutils literal notranslate"><span class="pre">M1</span></code>) is a better regression model for these data than <code class="docutils literal notranslate"><span class="pre">dan_grump</span> <span class="pre">~</span> <span class="pre">dan_sleep</span> <span class="pre">+</span> <span class="pre">day</span></code> (which I’ll call Model 0, or <code class="docutils literal notranslate"><span class="pre">M0</span></code>). There are two different ways we can compare these two models, one based on a model selection criterion like AIC, and the other based on an explicit hypothesis test. I’ll show you the AIC-based approach first because it’s simpler, and follows naturally from the method we used in the last section. The first thing I need to do is actually run the regressions. Since we want to calculate AIC, it will be easier to use <code class="docutils literal notranslate"><span class="pre">statsmodels</span></code> than <code class="docutils literal notranslate"><span class="pre">pigouin</span></code>. First we’ll define the two models, then use our handy-dandy AIC function to get the AIC for each of them.</p>
 <div class="cell docutils container">
 <div class="cell_input docutils container">
 <div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">M0</span> <span class="o">=</span> <span class="n">smf</span><span class="o">.</span><span class="n">ols</span><span class="p">(</span><span class="s1">&#39;dan_grump ~ dan_sleep + day&#39;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">)</span>
@@ -3912,7 +3912,7 @@ <h3><span class="section-number">16.10.3. </span>Comparing two regression models
 </div>
 </div>
 <p>BIC is also smaller for <code class="docutils literal notranslate"><span class="pre">MO</span></code> than for <code class="docutils literal notranslate"><span class="pre">M1</span></code>, so based on both AIC and BIC, it looks like Model 0 is the better choice.</p>
-<p>A somewhat different approach to the problem comes out of the hypothesis testing framework. Suppose you have two regression models, where one of them (Model 0) contains a <em>subset</em> of the predictors from the other one (Model 1). That is, Model 1 contains all of the predictors included in Model 0, plus one or more additional predictors. When this happens we say that Model 0 is <strong><em>nested</em></strong> within Model 1, or possibly that Model 0 is a <strong><em>submodel</em></strong> of Model 1. Regardless of the terminology what this means is that we can think of Model 0 as a null hypothesis and Model 1 as an alternative hypothesis. And in fact we can construct an <span class="math notranslate nohighlight">\(F\)</span> test for this in a fairly straightforward fashion. We can fit both models to the data and obtain a residual sum of squares for both models. I’ll denote these as SS<span class="math notranslate nohighlight">\(_{res}^{(0)}\)</span> and SS<span class="math notranslate nohighlight">\(_{res}^{(1)}\)</span> respectively. The superscripting here just indicates which model we’re talking about.  Then our <span class="math notranslate nohighlight">\(F\)</span> statistic is</p>
+<p>A somewhat different approach to the problem comes out of the hypothesis testing framework. Suppose you have two regression models, where one of them (Model 0) contains a <em>subset</em> of the predictors from the other one (Model 1). That is, Model 1 contains all of the predictors included in Model 0, plus one or more additional predictors. When this happens we say that Model 0 is <strong><em>nested</em></strong> within Model 1, or possibly that Model 0 is a <strong><em>submodel</em></strong> of Model 1. Regardless of the terminology, what this means is that we can think of Model 0 as a null hypothesis and Model 1 as an alternative hypothesis. And in fact we can construct an <span class="math notranslate nohighlight">\(F\)</span> test for this in a fairly straightforward fashion. We can fit both models to the data and obtain a residual sum of squares for both models. I’ll denote these as SS<span class="math notranslate nohighlight">\(_{res}^{(0)}\)</span> and SS<span class="math notranslate nohighlight">\(_{res}^{(1)}\)</span> respectively. The superscripting here just indicates which model we’re talking about.  Then our <span class="math notranslate nohighlight">\(F\)</span> statistic is</p>
 <div class="math notranslate nohighlight">
 \[
 F = \frac{(\mbox{SS}_{res}^{(0)} - \mbox{SS}_{res}^{(1)})/k}{(\mbox{SS}_{res}^{(1)})/(N-p-1)}