updated parenthood data

replaced . with _ in variable names
ethanweed · Apr 12, 2022 · b92c108 · b92c108
1 parent 24e9143
commit b92c108
Show file tree

Hide file tree

Showing 3 changed files with 455 additions and 107 deletions.
diff --git a/Chapters/.ipynb_checkpoints/05.04-regression-checkpoint.ipynb b/Chapters/.ipynb_checkpoints/05.04-regression-checkpoint.ipynb
@@ -5,21 +5,188 @@
    "id": "constant-nation",
    "metadata": {},
    "source": [
+    "(regression)=\n",
     "# Linear regression"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "ae78dc30",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "The goal in this chapter is to introduce **_linear regression_**, the standard tool that statisticians rely on when analysing the relationship between interval scale predictors and interval scale outcomes. Stripped to its bare essentials, linear regression models are basically a slightly fancier version of the [Pearson correlation](correl) though as we'll see, regression models are much more powerful tools."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6eff7bf8",
+   "metadata": {},
+   "source": [
+    "Since the basic ideas in regression are closely tied to correlation, we'll return to the `parenthood.csv` file that we were using to illustrate how correlations work. Recall that, in this data set, we were trying to find out why Dan is so very grumpy all the time, and our working hypothesis was that I'm not getting enough sleep. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "adc7553d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>dan.sleep</th>\n",
+       "      <th>baby.sleep</th>\n",
+       "      <th>dan.grump</th>\n",
+       "      <th>day</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>7.59</td>\n",
+       "      <td>10.18</td>\n",
+       "      <td>56</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>7.91</td>\n",
+       "      <td>11.66</td>\n",
+       "      <td>60</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>5.14</td>\n",
+       "      <td>7.92</td>\n",
+       "      <td>82</td>\n",
+       "      <td>3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>7.71</td>\n",
+       "      <td>9.61</td>\n",
+       "      <td>55</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>6.68</td>\n",
+       "      <td>9.75</td>\n",
+       "      <td>67</td>\n",
+       "      <td>5</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   dan.sleep  baby.sleep  dan.grump  day\n",
+       "0       7.59       10.18         56    1\n",
+       "1       7.91       11.66         60    2\n",
+       "2       5.14        7.92         82    3\n",
+       "3       7.71        9.61         55    4\n",
+       "4       6.68        9.75         67    5"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "file = 'https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/parenthood.csv'\n",
+    "df = pd.read_csv(file)\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b55a3a72",
+   "metadata": {},
+   "source": [
+    "We drew some scatterplots to help us examine the relationship between the amount of sleep I get, and my grumpiness the following day. The actual scatterplot that we draw is the one shown in Figure \\@ref(fig:regression0), and as we saw previously this corresponds to a correlation of $r=-.90$, but what we find ourselves secretly imagining is something that looks closer to Figure \\@ref(fig:regression1a). That is, we mentally draw a straight line through the middle of the data. In statistics, this line that we're drawing is called a **_regression line_**. Notice that -- since we're not idiots -- the regression line goes through the middle of the data. We don't find ourselves imagining anything like the rather silly plot shown in Figure \\@ref(fig:regression1b). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1776f606",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97314458",
+   "metadata": {},
+   "source": [
+    "This is not highly surprising: the line that I've drawn in Figure \\@ref(fig:regression1b) doesn't \"fit\" the data very well, so it doesn't make a lot of sense to propose it as a way of summarising the data, right? This is a very simple observation to make, but it turns out to be very powerful when we start trying to wrap just a little bit of maths around it. To do so, let's start with a refresher of some high school maths. The formula for a straight line is usually written like this:\n",
+    "\n",
+    "$$\n",
+    "y = mx + c\n",
+    "$$ \n",
+    "\n",
+    "Or, at least, that's what it was when I went to high school all those years ago. The two *variables* are $x$ and $y$, and we have two *coefficients*, $m$ and $c$. The coefficient $m$ represents the *slope* of the line, and the coefficient $c$ represents the *$y$-intercept* of the line. Digging further back into our decaying memories of high school (sorry, for some of us high school was a long time ago), we remember that the intercept is interpreted as \"the value of $y$ that you get when $x=0$\". Similarly, a slope of $m$ means that if you increase the $x$-value by 1 unit, then the $y$-value goes up by $m$ units; a negative slope means that the $y$-value would go down rather than up. Ah yes, it's all coming back to me now. \n",
+    "\n",
+    "Now that we've remembered that, it should come as no surprise to discover that we use the exact same formula to describe a regression line. If $Y$ is the outcome variable (the DV) and $X$ is the predictor variable (the IV), then the formula that describes our regression is written like this:\n",
+    "\n",
+    "$$\n",
+    "\\hat{Y_i} = b_1 X_i + b_0\n",
+    "$$\n",
+    "\n",
+    "Hm. Looks like the same formula, but there's some extra frilly bits in this version. Let's make sure we understand them. Firstly, notice that I've written $X_i$ and $Y_i$ rather than just plain old $X$ and $Y$. This is because we want to remember that we're dealing with actual data. In this equation, $X_i$ is the value of predictor variable for the $i$th observation (i.e., the number of hours of sleep that I got on day $i$ of my little study), and $Y_i$ is the corresponding value of the outcome variable (i.e., my grumpiness on that day). And although I haven't said so explicitly in the equation, what we're assuming is that this formula works for all observations in the data set (i.e., for all $i$). Secondly, notice that I wrote $\\hat{Y}_i$ and not $Y_i$. This is because we want to make the distinction between the *actual data* $Y_i$, and the *estimate* $\\hat{Y}_i$ (i.e., the prediction that our regression line is making). Thirdly, I changed the letters used to describe the coefficients from $m$ and $c$ to $b_1$ and $b_0$. That's just the way that statisticians like to refer to the coefficients in a regression model. I've no idea why they chose $b$, but that's what they did. In any case $b_0$ always refers to the intercept term, and $b_1$ refers to the slope.\n",
+    "\n",
+    "Excellent, excellent. Next, I can't help but notice that -- regardless of whether we're talking about the good regression line or the bad one -- the data don't fall perfectly on the line. Or, to say it another way, the data $Y_i$ are not identical to the predictions of the regression model $\\hat{Y_i}$. Since statisticians love to attach letters, names and numbers to everything, let's refer to the difference between the model prediction and that actual data point as a *residual*, and we'll refer to it as $\\epsilon_i$.[^noteepsilon] Written using mathematics, the residuals are defined as:\n",
+    "\n",
+    "$$\n",
+    "\\epsilon_i = Y_i - \\hat{Y}_i\n",
+    "$$\n",
+    "\n",
+    "which in turn means that we can write down the complete linear regression model as:\n",
+    "\n",
+    "$$\n",
+    "Y_i = b_1 X_i + b_0 + \\epsilon_i\n",
+    "$$\n",
+    "\n",
+    "[^noteepsilon]: The $\\epsilon$ symbol is the Greek letter epsilon. It's traditional to use $\\epsilon_i$ or $e_i$ to denote a residual."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "actual-genealogy",
+   "id": "1c5bbcfe",
    "metadata": {},
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -33,7 +200,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.2"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,