diff --git a/Chapters/.ipynb_checkpoints/05.02-ttest-checkpoint.ipynb b/Chapters/.ipynb_checkpoints/05.02-ttest-checkpoint.ipynb
index c85d95a4..1c5c4442 100644
--- a/Chapters/.ipynb_checkpoints/05.02-ttest-checkpoint.ipynb
+++ b/Chapters/.ipynb_checkpoints/05.02-ttest-checkpoint.ipynb
@@ -230,7 +230,7 @@
},
{
"cell_type": "markdown",
- "id": "spanish-procurement",
+ "id": "musical-kingston",
"metadata": {},
"source": [
"### Constructing the hypothesis test\n",
@@ -255,7 +255,7 @@
{
"cell_type": "code",
"execution_count": 253,
- "id": "present-conservative",
+ "id": "demonstrated-tragedy",
"metadata": {
"tags": [
"hide-input"
@@ -327,7 +327,7 @@
},
{
"cell_type": "markdown",
- "id": "royal-retailer",
+ "id": "overhead-haven",
"metadata": {},
"source": [
"```{glue:figure} ztesthyp-fig\n",
@@ -344,7 +344,7 @@
},
{
"cell_type": "markdown",
- "id": "alive-implementation",
+ "id": "neither-balance",
"metadata": {},
"source": [
"The next step is to figure out what we would be a good choice for a diagnostic test statistic; something that would help us discriminate between $H_0$ and $H_1$. Given that the hypotheses all refer to the population mean $\\mu$, you'd feel pretty confident that the sample mean $\\bar{X}$ would be a pretty useful place to start. What we could do, is look at the difference between the sample mean $\\bar{X}$ and the value that the null hypothesis predicts for the population mean. In our example, that would mean we calculate $\\bar{X} - 67.5$. More generally, if we let $\\mu_0$ refer to the value that the null hypothesis claims is our population mean, then we'd want to calculate\n",
@@ -399,7 +399,7 @@
{
"cell_type": "code",
"execution_count": 254,
- "id": "treated-defeat",
+ "id": "prompt-regression",
"metadata": {
"tags": [
"hide-input"
@@ -478,7 +478,7 @@
},
{
"cell_type": "markdown",
- "id": "increased-harbor",
+ "id": "norwegian-beaver",
"metadata": {},
"source": [
"```{glue:figure} ztest-fig\n",
@@ -495,7 +495,7 @@
},
{
"cell_type": "markdown",
- "id": "established-vocabulary",
+ "id": "fixed-linux",
"metadata": {},
"source": [
"And what this meant, way back in the days where people did all their statistics by hand, is that someone could publish a table like this:"
@@ -503,7 +503,7 @@
},
{
"cell_type": "markdown",
- "id": "proud-programmer",
+ "id": "opposite-offering",
"metadata": {},
"source": [
"| || critical z value |\n",
@@ -517,7 +517,7 @@
},
{
"cell_type": "markdown",
- "id": "suspected-champagne",
+ "id": "swiss-david",
"metadata": {},
"source": [
"which in turn meant that researchers could calculate their $z$-statistic by hand, and then look up the critical value in a text book. That was an incredibly handy thing to be able to do back then, but it's kind of unnecessary these days, since it's trivially easy to do it with software like Python."
@@ -525,7 +525,7 @@
},
{
"cell_type": "markdown",
- "id": "superb-guide",
+ "id": "charitable-commerce",
"metadata": {},
"source": [
"### A worked example using Python\n",
@@ -536,7 +536,7 @@
{
"cell_type": "code",
"execution_count": 262,
- "id": "waiting-advantage",
+ "id": "multiple-match",
"metadata": {},
"outputs": [
{
@@ -558,7 +558,7 @@
},
{
"cell_type": "markdown",
- "id": "beneficial-austria",
+ "id": "ruled-revision",
"metadata": {},
"source": [
"Then, I create variables corresponding to known population standard deviation ($\\sigma = 9.5$), and the value of the population mean that the null hypothesis specifies ($\\mu_0 = 67.5$):"
@@ -567,7 +567,7 @@
{
"cell_type": "code",
"execution_count": 261,
- "id": "tight-society",
+ "id": "occasional-korean",
"metadata": {},
"outputs": [],
"source": [
@@ -577,7 +577,7 @@
},
{
"cell_type": "markdown",
- "id": "animal-feeding",
+ "id": "sealed-happiness",
"metadata": {},
"source": [
"Let's also create a variable for the sample size. We could count up the number of observations ourselves, and type `N = 20` at the command prompt, but counting is tedious and repetitive. Let's get Python to do the tedious repetitive bit by using the `len()` function, which tells us how many elements there are in a vector:"
@@ -586,7 +586,7 @@
{
"cell_type": "code",
"execution_count": 263,
- "id": "coral-skirt",
+ "id": "spoken-calibration",
"metadata": {},
"outputs": [
{
@@ -607,7 +607,7 @@
},
{
"cell_type": "markdown",
- "id": "unusual-toolbox",
+ "id": "different-turtle",
"metadata": {},
"source": [
"Next, let's calculate the (true) standard error of the mean:"
@@ -616,7 +616,7 @@
{
"cell_type": "code",
"execution_count": 266,
- "id": "instructional-second",
+ "id": "julian-palace",
"metadata": {},
"outputs": [
{
@@ -638,7 +638,7 @@
},
{
"cell_type": "markdown",
- "id": "transsexual-spiritual",
+ "id": "right-homework",
"metadata": {},
"source": [
"And finally, we calculate our $z$-score:"
@@ -647,7 +647,7 @@
{
"cell_type": "code",
"execution_count": 268,
- "id": "finished-harrison",
+ "id": "tested-saver",
"metadata": {},
"outputs": [
{
@@ -668,7 +668,7 @@
},
{
"cell_type": "markdown",
- "id": "billion-workplace",
+ "id": "amazing-complement",
"metadata": {},
"source": [
"At this point, we would traditionally look up the value 2.26 in our table of critical values. Our original hypothesis was two-sided (we didn't really have any theory about whether psych students would be better or worse at statistics than other students) so our hypothesis test is two-sided (or two-tailed) also. Looking at the little table that I showed earlier, we can see that 2.26 is bigger than the critical value of 1.96 that would be required to be significant at $\\alpha = .05$, but smaller than the value of 2.58 that would be required to be significant at a level of $\\alpha = .01$. Therefore, we can conclude that we have a significant effect, which we might write up by saying something like this:\n",
@@ -681,7 +681,7 @@
{
"cell_type": "code",
"execution_count": 286,
- "id": "comic-departure",
+ "id": "standard-hundred",
"metadata": {},
"outputs": [
{
@@ -703,7 +703,7 @@
},
{
"cell_type": "markdown",
- "id": "appointed-andrew",
+ "id": "suffering-controversy",
"metadata": {},
"source": [
"`NormalDist().cdf()` calculates the \"cumulative density function\" for a normal distribution. Translated to something slightly less opaque, this means that `NormalDist().cdf()` gives us the probability that a random variable X will be less than or equal to a given value. In our case, the given value for the lower tail of the distribution was our z-score, $2.259$. So `NormalDist().cdf(-z_score)` gives us the probability that a random value draw from a normal distribution would be less than or equal to $-2.259$.\n",
@@ -716,7 +716,7 @@
{
"cell_type": "code",
"execution_count": 288,
- "id": "earned-marketing",
+ "id": "intelligent-relief",
"metadata": {},
"outputs": [
{
@@ -739,7 +739,7 @@
},
{
"cell_type": "markdown",
- "id": "solar-island",
+ "id": "initial-affairs",
"metadata": {},
"source": [
"(zassumptions)=\n",
@@ -756,7 +756,7 @@
},
{
"cell_type": "markdown",
- "id": "considered-adoption",
+ "id": "controlling-copying",
"metadata": {},
"source": [
"(onesamplettest)=\n",
@@ -768,7 +768,7 @@
{
"cell_type": "code",
"execution_count": 289,
- "id": "completed-reproduction",
+ "id": "decimal-arlington",
"metadata": {},
"outputs": [
{
@@ -789,7 +789,7 @@
},
{
"cell_type": "markdown",
- "id": "decimal-incentive",
+ "id": "cognitive-prisoner",
"metadata": {},
"source": [
"In other words, while I can't say that I know that $\\sigma = 9.5$, I *can* say that $\\hat\\sigma = 9.52$. \n",
@@ -800,7 +800,7 @@
{
"cell_type": "code",
"execution_count": 291,
- "id": "incoming-duplicate",
+ "id": "right-vancouver",
"metadata": {
"tags": [
"hide-input"
@@ -873,7 +873,7 @@
},
{
"cell_type": "markdown",
- "id": "elect-direction",
+ "id": "premium-philosophy",
"metadata": {},
"source": [
"\n",
@@ -893,7 +893,7 @@
},
{
"cell_type": "markdown",
- "id": "reported-statistics",
+ "id": "raised-height",
"metadata": {},
"source": [
"### Introducing the $t$-test\n",
@@ -910,7 +910,7 @@
{
"cell_type": "code",
"execution_count": 325,
- "id": "closing-armenia",
+ "id": "dedicated-julian",
"metadata": {
"tags": [
"hide-input"
@@ -966,7 +966,7 @@
},
{
"cell_type": "markdown",
- "id": "romantic-accuracy",
+ "id": "faced-fundamentals",
"metadata": {},
"source": [
"```{glue:figure} ttestdist-fig\n",
@@ -982,7 +982,7 @@
},
{
"cell_type": "markdown",
- "id": "fresh-synthetic",
+ "id": "electoral-cricket",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -995,7 +995,7 @@
{
"cell_type": "code",
"execution_count": 359,
- "id": "legendary-turkey",
+ "id": "auburn-symphony",
"metadata": {},
"outputs": [
{
@@ -1016,7 +1016,7 @@
},
{
"cell_type": "markdown",
- "id": "finnish-thousand",
+ "id": "engaged-designation",
"metadata": {},
"source": [
"So that seems straightforward enough. Our calculation resulted in a $t$-statistic of 2.54, and a $p$-value of 0.36. Now what do we *do* with this output? Well, since we're pretending that we actually care about my toy example, we're overjoyed to discover that the result is statistically significant (i.e. $p$ value below .05), and we will probably want to report our result. We could report the result by saying something like this:\n",
@@ -1033,7 +1033,7 @@
{
"cell_type": "code",
"execution_count": 352,
- "id": "adopted-portfolio",
+ "id": "superior-poison",
"metadata": {},
"outputs": [
{
@@ -1055,7 +1055,7 @@
},
{
"cell_type": "markdown",
- "id": "universal-customs",
+ "id": "recreational-engineering",
"metadata": {},
"source": [
"Now at least we have the bare minimum of what is necessary to report our results. Still, it would be sweet if we could get those confidence intervals as well. `scipy` actually has all the tools we need, and why these are not just built into the `ttest_1samp()` method is beyond me. To find the confidence interval, we need to:\n",
@@ -1071,7 +1071,7 @@
{
"cell_type": "code",
"execution_count": 356,
- "id": "unlimited-mystery",
+ "id": "sized-african",
"metadata": {},
"outputs": [
{
@@ -1099,7 +1099,7 @@
},
{
"cell_type": "markdown",
- "id": "gothic-disclaimer",
+ "id": "disturbed-gravity",
"metadata": {},
"source": [
"Whew. Now at least we have everything we need for a full report of our results.\n",
@@ -1113,7 +1113,7 @@
},
{
"cell_type": "markdown",
- "id": "lovely-medication",
+ "id": "electric-calgary",
"metadata": {},
"source": [
"(ttestoneassumptions)=\n",
@@ -1129,7 +1129,7 @@
},
{
"cell_type": "markdown",
- "id": "middle-paris",
+ "id": "overhead-anderson",
"metadata": {},
"source": [
"(studentttest)=\n",
@@ -1140,7 +1140,7 @@
},
{
"cell_type": "markdown",
- "id": "corrected-cleaning",
+ "id": "adult-clarity",
"metadata": {},
"source": [
"### The data\n",
@@ -1151,7 +1151,7 @@
{
"cell_type": "code",
"execution_count": 360,
- "id": "canadian-hunger",
+ "id": "authorized-handling",
"metadata": {},
"outputs": [
{
@@ -1232,7 +1232,7 @@
},
{
"cell_type": "markdown",
- "id": "honest-simpson",
+ "id": "demanding-happening",
"metadata": {},
"source": [
"As we can see, there's a single data frame with two variables, `grade` and `tutor`. The `grade` variable is a numeric vector, containing the grades for all $N = 33$ students taking Dr Harpo's class; the `tutor` variable is a factor that indicates who each student's tutor was. The first five observations in this data set are shown above, and below is a nice little table with some summary statistics:"
@@ -1241,7 +1241,7 @@
{
"cell_type": "code",
"execution_count": 437,
- "id": "possible-dance",
+ "id": "orange-newsletter",
"metadata": {
"tags": [
"hide-input"
@@ -1320,7 +1320,7 @@
},
{
"cell_type": "markdown",
- "id": "material-wagon",
+ "id": "suffering-luxury",
"metadata": {},
"source": [
"To give you a more detailed sense of what's going on here, I've plotted histograms showing the distribution of grades for both tutors {numref}`fig-harpohist`. Inspection of these histograms suggests that the students in Anastasia's class may be getting slightly better grades on average, though they also seem a little more variable."
@@ -1329,7 +1329,7 @@
{
"cell_type": "code",
"execution_count": 408,
- "id": "flying-being",
+ "id": "authentic-basin",
"metadata": {
"tags": [
"hide-input"
@@ -1371,7 +1371,7 @@
},
{
"cell_type": "markdown",
- "id": "professional-block",
+ "id": "expired-challenge",
"metadata": {},
"source": [
" ```{glue:figure} harpohist_fig\n",
@@ -1384,7 +1384,7 @@
},
{
"cell_type": "markdown",
- "id": "manufactured-ordering",
+ "id": "permanent-summit",
"metadata": {},
"source": [
"{numref}`fig-ttestci` is a simpler plot showing the means and corresponding confidence intervals for both groups of students."
@@ -1393,7 +1393,7 @@
{
"cell_type": "code",
"execution_count": 413,
- "id": "explicit-frame",
+ "id": "academic-prospect",
"metadata": {
"tags": [
"hide-input"
@@ -1430,7 +1430,7 @@
},
{
"cell_type": "markdown",
- "id": "defensive-mouth",
+ "id": "intense-utilization",
"metadata": {},
"source": [
" ```{glue:figure} ttestci-fig\n",
@@ -1444,7 +1444,7 @@
},
{
"cell_type": "markdown",
- "id": "compressed-preference",
+ "id": "known-directory",
"metadata": {},
"source": [
"### Introducing the test\n",
@@ -1464,7 +1464,7 @@
{
"cell_type": "code",
"execution_count": 436,
- "id": "aggregate-democrat",
+ "id": "supreme-metadata",
"metadata": {
"tags": [
"hide-input"
@@ -1528,7 +1528,7 @@
},
{
"cell_type": "markdown",
- "id": "fleet-semiconductor",
+ "id": "generic-infection",
"metadata": {},
"source": [
" ```{glue:figure} ttesthyp_fig\n",
@@ -1541,7 +1541,7 @@
},
{
"cell_type": "markdown",
- "id": "opened-aberdeen",
+ "id": "theoretical-establishment",
"metadata": {},
"source": [
"To construct a hypothesis test that handles this scenario, we start by noting that if the null hypothesis is true, then the difference between the population means is *exactly* zero, \n",
@@ -1563,7 +1563,7 @@
},
{
"cell_type": "markdown",
- "id": "hairy-injection",
+ "id": "civilian-visitor",
"metadata": {},
"source": [
"### A \"pooled estimate\" of the standard deviation\n",
@@ -1594,7 +1594,7 @@
},
{
"cell_type": "markdown",
- "id": "changed-bishop",
+ "id": "efficient-problem",
"metadata": {},
"source": [
"### The same pooled estimate, described differently\n",
@@ -1622,7 +1622,7 @@
},
{
"cell_type": "markdown",
- "id": "limited-hampshire",
+ "id": "encouraging-superintendent",
"metadata": {},
"source": [
"(indsamplesttest_formula)=\n",
@@ -1645,7 +1645,7 @@
},
{
"cell_type": "markdown",
- "id": "unexpected-designer",
+ "id": "beginning-retreat",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -1658,7 +1658,7 @@
{
"cell_type": "code",
"execution_count": 454,
- "id": "democratic-edgar",
+ "id": "norman-collective",
"metadata": {},
"outputs": [
{
@@ -1684,7 +1684,7 @@
},
{
"cell_type": "markdown",
- "id": "auburn-sleeve",
+ "id": "reserved-uruguay",
"metadata": {},
"source": [
"This is fairly straightforward, and just as it was for the one-sample $t$-test, `scipy` does very little to format its results or give you any information over the bare minimum. You get a $t$-statistic and a $p$-value and that's that. Luckily, as was the case with the one-sampel $t$-test, getting the other elements we need to report our results isn't too bad. We will need the $t$-statistic, the $p$-value, the mean of each group, and the degrees of freedom. The first two we already have, and the last two are easy to get. As we discussed [above](indsamplesttest_formula), the degrees of freedom for an independent samples $t$-test is $N-2$, so..."
@@ -1693,7 +1693,7 @@
{
"cell_type": "code",
"execution_count": 465,
- "id": "thrown-macintosh",
+ "id": "valuable-blind",
"metadata": {},
"outputs": [
{
@@ -1717,7 +1717,7 @@
},
{
"cell_type": "markdown",
- "id": "anticipated-neighborhood",
+ "id": "forty-blond",
"metadata": {},
"source": [
"You probably noticed that in addition to telling `ttest_ind` which means I wanted to compare, I also added the argument `equal_var = True` to the command. This wasn't strictly necessary in this case, because by default this argument is set to `True`. But I made it explicit anyway, because we will be using this argument again later. By saying `equal_var = True`, what we're really doing is telling Python to use the *Student* independent samples $t$-test. More on this later."
@@ -1725,7 +1725,7 @@
},
{
"cell_type": "markdown",
- "id": "european-cosmetic",
+ "id": "arabic-narrow",
"metadata": {},
"source": [
"In any case, the difference between the two groups is significant (just barely), so we might write up the result using text like this:\n",
@@ -1735,7 +1735,7 @@
},
{
"cell_type": "markdown",
- "id": "economic-origin",
+ "id": "recognized-portrait",
"metadata": {},
"source": [
" \n",
@@ -1767,7 +1767,7 @@
},
{
"cell_type": "markdown",
- "id": "integral-consistency",
+ "id": "accurate-world",
"metadata": {},
"source": [
"(studentassumptions)= \n",
@@ -1783,7 +1783,7 @@
},
{
"cell_type": "markdown",
- "id": "amber-clear",
+ "id": "complimentary-logistics",
"metadata": {},
"source": [
"(welchttest)=\n",
@@ -1817,7 +1817,7 @@
{
"cell_type": "code",
"execution_count": 530,
- "id": "silver-contest",
+ "id": "beautiful-graham",
"metadata": {
"tags": [
"hide-input"
@@ -1888,7 +1888,7 @@
},
{
"cell_type": "markdown",
- "id": "endless-chosen",
+ "id": "strong-given",
"metadata": {},
"source": [
" ```{glue:figure} ttesthyp2_fig\n",
@@ -1901,7 +1901,7 @@
},
{
"cell_type": "markdown",
- "id": "scientific-modem",
+ "id": "plastic-utilization",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -1912,7 +1912,7 @@
{
"cell_type": "code",
"execution_count": 531,
- "id": "prompt-spoke",
+ "id": "innovative-payroll",
"metadata": {},
"outputs": [
{
@@ -1933,7 +1933,7 @@
},
{
"cell_type": "markdown",
- "id": "every-morris",
+ "id": "cross-platinum",
"metadata": {},
"source": [
"Not too difficult, right? Not surprisingly, the output has exactly the same format as it did last time too: a test statistic $t$, and a $p$-value. So that's all pretty easy. \n",
@@ -1948,7 +1948,7 @@
},
{
"cell_type": "markdown",
- "id": "administrative-quarterly",
+ "id": "proud-knowing",
"metadata": {},
"source": [
"(pairedsamplesttest)=\n",
@@ -1959,7 +1959,7 @@
},
{
"cell_type": "markdown",
- "id": "recovered-bunch",
+ "id": "extraordinary-amount",
"metadata": {},
"source": [
"### The data\n",
@@ -1970,7 +1970,7 @@
{
"cell_type": "code",
"execution_count": 563,
- "id": "korean-contest",
+ "id": "cardiovascular-prevention",
"metadata": {},
"outputs": [],
"source": [
@@ -1980,7 +1980,7 @@
},
{
"cell_type": "markdown",
- "id": "coordinate-moment",
+ "id": "distinguished-rachel",
"metadata": {},
"source": [
"The data frame `chico` contains three variables: an `id` variable that identifies each student in the class, the `grade_test1` variable that records the student grade for the first test, and the `grade_test2` variable that has the grades for the second test. Here's the first five students:"
@@ -1989,7 +1989,7 @@
{
"cell_type": "code",
"execution_count": 534,
- "id": "future-telescope",
+ "id": "marine-uncertainty",
"metadata": {},
"outputs": [
{
@@ -2073,7 +2073,7 @@
},
{
"cell_type": "markdown",
- "id": "changed-bobby",
+ "id": "union-survey",
"metadata": {},
"source": [
"At a glance, it does seem like the class is a hard one (most grades are between 50\\% and 60\\%), but it does look like there's an improvement from the first test to the second one. If we take a quick look at the descriptive statistics"
@@ -2082,7 +2082,7 @@
{
"cell_type": "code",
"execution_count": 535,
- "id": "clean-oxford",
+ "id": "facial-millennium",
"metadata": {},
"outputs": [
{
@@ -2178,7 +2178,7 @@
},
{
"cell_type": "markdown",
- "id": "superb-amendment",
+ "id": "labeled-botswana",
"metadata": {},
"source": [
"we see that this impression seems to be supported. Across all 20 students[^note12] the mean grade for the first test is 57\\%, but this rises to 58\\% for the second test. Although, given that the standard deviations are 6.6\\% and 6.4\\% respectively, it's starting to feel like maybe the improvement is just illusory; maybe just random variation. This impression is reinforced when you see the means and confidence intervals plotted in {numref}`pairedta` panel A. If we were to rely on this plot alone, we'd come to the same conclusion that we got from looking at the descriptive statistics that the `describe()` method produced. Looking at how wide those confidence intervals are, we'd be tempted to think that the apparent improvement in student performance is pure chance."
@@ -2187,7 +2187,7 @@
{
"cell_type": "code",
"execution_count": 595,
- "id": "passing-fraction",
+ "id": "motivated-least",
"metadata": {
"tags": [
"hide-input"
@@ -2237,7 +2237,7 @@
},
{
"cell_type": "markdown",
- "id": "difficult-montreal",
+ "id": "serious-spain",
"metadata": {},
"source": [
"```{glue:figure} pairedta_fig\n",
@@ -2251,7 +2251,7 @@
},
{
"cell_type": "markdown",
- "id": "front-madagascar",
+ "id": "proud-printer",
"metadata": {},
"source": [
"Nevertheless, this impression is wrong. To see why, take a look at the scatterplot of the grades for test 1 against the grades for test 2. shown in {numref}`fig-pairedta` panel B. \n",
@@ -2264,7 +2264,7 @@
{
"cell_type": "code",
"execution_count": 596,
- "id": "mature-mouth",
+ "id": "incorrect-witch",
"metadata": {},
"outputs": [],
"source": [
@@ -2273,7 +2273,7 @@
},
{
"cell_type": "markdown",
- "id": "present-breakfast",
+ "id": "synthetic-individual",
"metadata": {},
"source": [
"Notice that I assigned the output to a variable called `df['improvement]`. That has the effect of creating a new column called `improvement` inside the `chico` data frame. Now that we've created and stored this `improvement` variable, we can draw a histogram showing the distribution of these improvement scores, shown in {numref}`fig-pairedta` panel C. \n",
@@ -2285,7 +2285,7 @@
{
"cell_type": "code",
"execution_count": 598,
- "id": "instructional-interpretation",
+ "id": "athletic-upgrade",
"metadata": {},
"outputs": [
{
@@ -2310,7 +2310,7 @@
},
{
"cell_type": "markdown",
- "id": "disturbed-radius",
+ "id": "decent-patio",
"metadata": {},
"source": [
"we see that it is 95\\% certain that the true (population-wide) average improvement would lie between 0.95\\% and 1.86\\%. So you can see, qualitatively, what's going on: there is a real \"within student\" improvement (everyone improves by about 1\\%), but it is very small when set against the quite large \"between student\" differences (student grades vary by about 20\\% or so). "
@@ -2318,7 +2318,7 @@
},
{
"cell_type": "markdown",
- "id": "violent-empty",
+ "id": "spiritual-personal",
"metadata": {},
"source": [
"### What is the paired samples $t$-test?\n",
@@ -2357,7 +2357,7 @@
},
{
"cell_type": "markdown",
- "id": "cardiac-liver",
+ "id": "emotional-thanks",
"metadata": {},
"source": [
"### Doing the test in Python \n",
@@ -2368,7 +2368,7 @@
{
"cell_type": "code",
"execution_count": 600,
- "id": "greek-consciousness",
+ "id": "wound-bobby",
"metadata": {},
"outputs": [
{
@@ -2389,7 +2389,7 @@
},
{
"cell_type": "markdown",
- "id": "continent-routine",
+ "id": "through-borough",
"metadata": {},
"source": [
"However, suppose you're lazy and you don't want to go to all the effort of creating a new variable. Or perhaps you just want to keep the difference between one-sample and paired-samples tests clear in your head. In that case, `scipy` also has a built-in method for conducting paired $t$-tests called `ttest_rel` (the `_rel` part is for \"related\"). Using this method, we get:"
@@ -2398,7 +2398,7 @@
{
"cell_type": "code",
"execution_count": 599,
- "id": "mediterranean-clinic",
+ "id": "silent-compromise",
"metadata": {},
"outputs": [
{
@@ -2420,7 +2420,7 @@
},
{
"cell_type": "markdown",
- "id": "lyric-composer",
+ "id": "usual-shade",
"metadata": {},
"source": [
"Either way, the result is exactly the same, which is strangely comforting, actually. Not only that, but the result confirms our intuition. There’s an average improvement of 1.4% from test 1 to test 2, and this is significantly different from 0 ($t$(19) = 6.48, $p$ < .001). In fact, $p$ is quite a bit less than one, since the $p$-value has been given in scientific notation. The exact $p$-value is $3.32^{-06}$, that is, $p$ = 0.0000032."
@@ -2428,7 +2428,7 @@
},
{
"cell_type": "markdown",
- "id": "orange-address",
+ "id": "helpful-dallas",
"metadata": {},
"source": [
"## One sided tests\n",
@@ -2439,7 +2439,7 @@
{
"cell_type": "code",
"execution_count": 606,
- "id": "promising-recovery",
+ "id": "worst-shift",
"metadata": {},
"outputs": [
{
@@ -2464,7 +2464,7 @@
},
{
"cell_type": "markdown",
- "id": "absolute-allocation",
+ "id": "academic-quantity",
"metadata": {},
"source": [
"The $t$-statistics are exactly the same, which makes sense, if you think about it, because the calculation of the $t$ is based on the mean and standard deviation, and these do not change. The $p$-value, on the other hand, is lower for the one-sided test. The only thing that changes between the two tests is the _expectation_ that we bring to data. The way that the $p$-value is calculated depends on those expectations, and they are the reason for choosing one test over the other. It should go without saying, but maybe is worth saying anyway, that our reasons for choosing one test over the other should be theoretical, and not based on which test is more likely to give us the $p$-value we want!\n",
@@ -2476,7 +2476,7 @@
{
"cell_type": "code",
"execution_count": 613,
- "id": "popular-helmet",
+ "id": "fallen-multimedia",
"metadata": {},
"outputs": [
{
@@ -2510,7 +2510,7 @@
},
{
"cell_type": "markdown",
- "id": "cultural-petersburg",
+ "id": "geographic-reconstruction",
"metadata": {},
"source": [
"What about the paired samples $t$-test? Suppose we wanted to test the hypothesis that grades go *up* from test 1 to test 2 in Dr. Chico's class, and are not prepared to consider the idea that the grades go down. Again, we can use the `alternative` argument to specify the one-sided test, and it works the same way it does for the independent samples $t$-test. Since we are comparing test 1 to test 2 by substracting one from the other, it makes a difference whether we subract test 1 from test 2, or test 2 from test 1. So, to test the hypothesis that grades for test 2 are higher than test 2, we will need to enter the grades from test 2 first; otherwise we are testing the opposite hypothesis: "
@@ -2519,7 +2519,7 @@
{
"cell_type": "code",
"execution_count": 616,
- "id": "solved-swiss",
+ "id": "clean-performer",
"metadata": {},
"outputs": [
{
@@ -2545,7 +2545,7 @@
},
{
"cell_type": "markdown",
- "id": "latin-collar",
+ "id": "productive-monte",
"metadata": {},
"source": [
"(cohensd)=\n",
@@ -2564,7 +2564,7 @@
},
{
"cell_type": "markdown",
- "id": "premier-jumping",
+ "id": "eligible-switch",
"metadata": {},
"source": [
"(dinterpretation)=\n",
@@ -2574,7 +2574,7 @@
},
{
"cell_type": "markdown",
- "id": "light-start",
+ "id": "technological-nursery",
"metadata": {},
"source": [
"| d-value | rough interpretation |\n",
@@ -2586,7 +2586,7 @@
},
{
"cell_type": "markdown",
- "id": "opponent-intellectual",
+ "id": "successful-delight",
"metadata": {},
"source": [
"### Cohen's $d$ from one sample\n",
@@ -2601,7 +2601,7 @@
{
"cell_type": "code",
"execution_count": 624,
- "id": "pressed-medicaid",
+ "id": "subsequent-complex",
"metadata": {},
"outputs": [
{
@@ -2628,7 +2628,7 @@
},
{
"cell_type": "markdown",
- "id": "vietnamese-swimming",
+ "id": "clean-classics",
"metadata": {},
"source": [
"What does this effect size mean? Overall, then, the psychology students in Dr Zeppo's class are achieving grades (mean = 72.3\\%) that are about .5 standard deviations higher than the level that you'd expect (67.5\\%) if they were performing at the same level as other students. Judged against Cohen's rough guide, this is a moderate effect size."
@@ -2636,7 +2636,7 @@
},
{
"cell_type": "markdown",
- "id": "latin-trick",
+ "id": "located-sweden",
"metadata": {},
"source": [
"### Cohen's $d$ from a Student $t$ test\n",
@@ -2658,7 +2658,7 @@
},
{
"cell_type": "markdown",
- "id": "pacific-remainder",
+ "id": "novel-microwave",
"metadata": {},
"source": [
"However, there are other possibilities, which I'll briefly describe. Firstly, you may have reason to want to use only one of the two groups as the basis for calculating the standard deviation. This approach (often called Glass' $\\Delta$) only makes most sense when you have good reason to treat one of the two groups as a purer reflection of \"natural variation\" than the other. This can happen if, for instance, one of the two groups is a control group. Secondly, recall that in the usual calculation of the pooled standard deviation we divide by $N-2$ to correct for the bias in the sample variance; in one version of Cohen's $d$ this correction is omitted. Instead, we divide by $N$. This version makes sense primarily when you're trying to calculate the effect size in the sample; rather than estimating an effect size in the population. Finally, there is a version based on @Hedges1985, who point out there is a small bias in the usual (pooled) estimation for Cohen's $d$. Thus they introduce a small correction, by multiplying the usual value of $d$ by $(N-3)/(N-2.25)$. \n",
@@ -2669,7 +2669,7 @@
{
"cell_type": "code",
"execution_count": 654,
- "id": "executed-organizer",
+ "id": "extensive-stick",
"metadata": {},
"outputs": [
{
@@ -2719,7 +2719,7 @@
},
{
"cell_type": "markdown",
- "id": "directed-regard",
+ "id": "recent-gothic",
"metadata": {},
"source": [
"### Cohen's $d$ from a Welch test\n",
@@ -2749,7 +2749,7 @@
{
"cell_type": "code",
"execution_count": 657,
- "id": "white-figure",
+ "id": "twenty-colleague",
"metadata": {},
"outputs": [
{
@@ -2784,7 +2784,7 @@
},
{
"cell_type": "markdown",
- "id": "wired-latest",
+ "id": "normal-resistance",
"metadata": {},
"source": [
"### Cohen's $d$ from a paired-samples test\n",
@@ -2802,7 +2802,7 @@
{
"cell_type": "code",
"execution_count": 663,
- "id": "fatal-jaguar",
+ "id": "minus-pressure",
"metadata": {},
"outputs": [
{
@@ -2831,7 +2831,7 @@
},
{
"cell_type": "markdown",
- "id": "loving-infrared",
+ "id": "annoying-turning",
"metadata": {},
"source": [
"The only wrinkle is figuring out whether this is the measure you want or not. To the extent that you care about the practical consequences of your research, you often want to measure the effect size relative to the *original* variables, not the *difference* scores (e.g., the 1\\% improvement in Dr Chico's class is pretty small when measured against the amount of between-student variation in grades), in which case you use the same versions of Cohen's $d$ that you would use for a Student or Welch test. For instance, when we do that for Dr Chico's class, "
@@ -2840,7 +2840,7 @@
{
"cell_type": "code",
"execution_count": 666,
- "id": "crazy-savage",
+ "id": "aging-abuse",
"metadata": {},
"outputs": [
{
@@ -2874,7 +2874,7 @@
},
{
"cell_type": "markdown",
- "id": "miniature-theme",
+ "id": "protective-northwest",
"metadata": {},
"source": [
"what we see is that the overall effect size is quite small, when assessed on the scale of the original variables."
@@ -2882,7 +2882,7 @@
},
{
"cell_type": "markdown",
- "id": "metropolitan-horizon",
+ "id": "attended-worcester",
"metadata": {},
"source": [
"(shapiro)=\n",
@@ -2893,7 +2893,7 @@
},
{
"cell_type": "markdown",
- "id": "seasonal-casino",
+ "id": "introductory-iceland",
"metadata": {},
"source": [
"### QQ plots\n",
@@ -2904,7 +2904,7 @@
{
"cell_type": "code",
"execution_count": 706,
- "id": "oriented-mobile",
+ "id": "educated-contamination",
"metadata": {
"tags": [
"hide-input"
@@ -2954,7 +2954,7 @@
},
{
"cell_type": "markdown",
- "id": "reasonable-research",
+ "id": "swiss-daily",
"metadata": {},
"source": [
" ```{glue:figure} qq_fig\n",
@@ -2967,7 +2967,7 @@
},
{
"cell_type": "markdown",
- "id": "talented-california",
+ "id": "personal-tender",
"metadata": {},
"source": [
"And the results are shown in {numref}(`fig-qq`), above.\n",
@@ -2978,7 +2978,7 @@
{
"cell_type": "code",
"execution_count": 704,
- "id": "buried-aberdeen",
+ "id": "decreased-television",
"metadata": {
"tags": [
"hide-input"
@@ -3011,7 +3011,7 @@
},
{
"cell_type": "markdown",
- "id": "alleged-thirty",
+ "id": "phantom-warehouse",
"metadata": {},
"source": [
" ```{glue:figure} qqskew_fig\n",
@@ -3026,7 +3026,7 @@
{
"cell_type": "code",
"execution_count": 707,
- "id": "technical-lafayette",
+ "id": "domestic-denver",
"metadata": {
"tags": [
"hide-input"
@@ -3071,7 +3071,7 @@
},
{
"cell_type": "markdown",
- "id": "actual-supplier",
+ "id": "parallel-reply",
"metadata": {},
"source": [
" ```{glue:figure} qqheavy_fig\n",
@@ -3085,7 +3085,7 @@
},
{
"cell_type": "markdown",
- "id": "entire-spring",
+ "id": "developmental-ballet",
"metadata": {},
"source": [
"### Shapiro-Wilk tests\n",
@@ -3103,7 +3103,7 @@
},
{
"cell_type": "markdown",
- "id": "hourly-desktop",
+ "id": "through-performance",
"metadata": {},
"source": [
"```{figure} ../img/ttest2/shapirowilkdist.png\n",
@@ -3118,7 +3118,7 @@
},
{
"cell_type": "markdown",
- "id": "going-leeds",
+ "id": "smoking-hospital",
"metadata": {},
"source": [
"To run the test in Python, we use the `scipy.stats.shapiro` method. It has only a single argument `x`, which is a numeric vector containing the data whose normality needs to be tested. For example, when we apply this function to our `normal_data`, we get the following:"
@@ -3127,7 +3127,7 @@
{
"cell_type": "code",
"execution_count": 708,
- "id": "grateful-grocery",
+ "id": "thrown-seminar",
"metadata": {},
"outputs": [
{
@@ -3149,7 +3149,7 @@
},
{
"cell_type": "markdown",
- "id": "sporting-improvement",
+ "id": "packed-bidder",
"metadata": {},
"source": [
"So, not surprisingly, we have no evidence that these data depart from normality. When reporting the results for a Shapiro-Wilk test, you should (as usual) make sure to include the test statistic $W$ and the $p$ value, though given that the sampling distribution depends so heavily on $N$ it would probably be a politeness to include $N$ as well."
@@ -3157,7 +3157,7 @@
},
{
"cell_type": "markdown",
- "id": "afraid-relevance",
+ "id": "reduced-combat",
"metadata": {},
"source": [
"(wilcox)=\n",
@@ -3170,7 +3170,7 @@
},
{
"cell_type": "markdown",
- "id": "computational-shape",
+ "id": "introductory-concern",
"metadata": {},
"source": [
"### Two sample Wilcoxon test\n",
@@ -3180,8 +3180,8 @@
},
{
"cell_type": "code",
- "execution_count": 710,
- "id": "experimental-prophet",
+ "execution_count": 728,
+ "id": "judicial-swiss",
"metadata": {},
"outputs": [
{
@@ -3205,62 +3205,62 @@
" \n",
" \n",
" | \n",
- " scores | \n",
- " group | \n",
+ " score_A | \n",
+ " score_B | \n",
"
\n",
" \n",
"
\n",
" \n",
" 0 | \n",
" 6.4 | \n",
- " A | \n",
+ " 14.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 10.7 | \n",
- " A | \n",
+ " 10.4 | \n",
"
\n",
" \n",
" 2 | \n",
" 11.9 | \n",
- " A | \n",
+ " 12.9 | \n",
"
\n",
" \n",
" 3 | \n",
" 7.3 | \n",
- " A | \n",
+ " 11.7 | \n",
"
\n",
" \n",
" 4 | \n",
" 10.0 | \n",
- " A | \n",
+ " 13.0 | \n",
"
\n",
" \n",
"\n",
""
],
"text/plain": [
- " scores group\n",
- "0 6.4 A\n",
- "1 10.7 A\n",
- "2 11.9 A\n",
- "3 7.3 A\n",
- "4 10.0 A"
+ " score_A score_B\n",
+ "0 6.4 14.5\n",
+ "1 10.7 10.4\n",
+ "2 11.9 12.9\n",
+ "3 7.3 11.7\n",
+ "4 10.0 13.0"
]
},
- "execution_count": 710,
+ "execution_count": 728,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "df = pd.read_csv(\"https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/awesome.csv\")\n",
+ "df = pd.read_csv(\"https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/awesome2.csv\")\n",
"df"
]
},
{
"cell_type": "markdown",
- "id": "growing-better",
+ "id": "driven-pursuit",
"metadata": {},
"source": [
"As long as there are no ties (i.e., people with the exact same awesomeness score), then the test that we want to do is surprisingly simple. All we have to do is construct a table that compares every observation in group $A$ against every observation in group $B$. Whenever the group $A$ datum is larger, we place a check mark in the table:"
@@ -3268,7 +3268,7 @@
},
{
"cell_type": "markdown",
- "id": "forced-biography",
+ "id": "instant-dollar",
"metadata": {},
"source": [
"\n",
@@ -3284,15 +3284,40 @@
},
{
"cell_type": "markdown",
- "id": "miniature-enclosure",
+ "id": "recreational-comparison",
+ "metadata": {},
+ "source": [
+ "We then count up the number of checkmarks. This is our test statistic, $W$.[^note15] The actual sampling distribution for $W$ is somewhat complicated, and I'll skip the details. For our purposes, it's sufficient to note that the interpretation of $W$ is qualitatively the same as the interpretation of $t$ or $z$. That is, if we want a two-sided test, then we reject the null hypothesis when $W$ is very large or very small; but if we have a directional (i.e., one-sided) hypothesis, then we only use one or the other. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 733,
+ "id": "revolutionary-biodiversity",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "WilcoxonResult(statistic=1.0, pvalue=0.125)"
+ ]
+ },
+ "execution_count": 733,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from scipy.stats import wilcoxon\n",
+ " \n",
+ "wilcoxon(df['score_A'], df['score_B'])\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "statewide-dealing",
"metadata": {},
"source": [
- "We then count up the number of checkmarks. This is our test statistic, $W$.[^note15] The actual sampling distribution for $W$ is somewhat complicated, and I'll skip the details. For our purposes, it's sufficient to note that the interpretation of $W$ is qualitatively the same as the interpretation of $t$ or $z$. That is, if we want a two-sided test, then we reject the null hypothesis when $W$ is very large or very small; but if we have a directional (i.e., one-sided) hypothesis, then we only use one or the other. \n",
- "\n",
- "The structure of the `wilcox.test()` function should feel very familiar to you by now. When you have your data organised in terms of an outcome variable and a grouping variable, then you use the `formula` and `data` arguments, so your command looks like this:\n",
- "```{r}\n",
- "wilcox.test( formula = scores ~ group, data = awesome)\n",
- "```\n",
"Just like we saw with the `t.test()` function, there is an `alternative` argument that you can use to switch between two-sided tests and one-sided tests, plus a few other arguments that we don't need to worry too much about at an introductory level. \n",
"Similarly, the `wilcox.test()` function allows you to use the `x` and `y` arguments when you have your data stored separately for each group. For instance, suppose we use the data from the `awesome2.Rdata` file:\n",
"```{r}\n",
diff --git a/Chapters/05.02-ttest.ipynb b/Chapters/05.02-ttest.ipynb
index 24f5c541..63a5e3cc 100644
--- a/Chapters/05.02-ttest.ipynb
+++ b/Chapters/05.02-ttest.ipynb
@@ -230,7 +230,7 @@
},
{
"cell_type": "markdown",
- "id": "drawn-pacific",
+ "id": "roman-canvas",
"metadata": {},
"source": [
"### Constructing the hypothesis test\n",
@@ -255,7 +255,7 @@
{
"cell_type": "code",
"execution_count": 253,
- "id": "common-investing",
+ "id": "rapid-brooklyn",
"metadata": {
"tags": [
"hide-input"
@@ -327,7 +327,7 @@
},
{
"cell_type": "markdown",
- "id": "shared-element",
+ "id": "organized-rescue",
"metadata": {},
"source": [
"```{glue:figure} ztesthyp-fig\n",
@@ -344,7 +344,7 @@
},
{
"cell_type": "markdown",
- "id": "religious-female",
+ "id": "pacific-liabilities",
"metadata": {},
"source": [
"The next step is to figure out what we would be a good choice for a diagnostic test statistic; something that would help us discriminate between $H_0$ and $H_1$. Given that the hypotheses all refer to the population mean $\\mu$, you'd feel pretty confident that the sample mean $\\bar{X}$ would be a pretty useful place to start. What we could do, is look at the difference between the sample mean $\\bar{X}$ and the value that the null hypothesis predicts for the population mean. In our example, that would mean we calculate $\\bar{X} - 67.5$. More generally, if we let $\\mu_0$ refer to the value that the null hypothesis claims is our population mean, then we'd want to calculate\n",
@@ -399,7 +399,7 @@
{
"cell_type": "code",
"execution_count": 254,
- "id": "wooden-housing",
+ "id": "interim-mongolia",
"metadata": {
"tags": [
"hide-input"
@@ -478,7 +478,7 @@
},
{
"cell_type": "markdown",
- "id": "regulation-mentor",
+ "id": "varying-sleeping",
"metadata": {},
"source": [
"```{glue:figure} ztest-fig\n",
@@ -495,7 +495,7 @@
},
{
"cell_type": "markdown",
- "id": "rental-cambridge",
+ "id": "pretty-guidance",
"metadata": {},
"source": [
"And what this meant, way back in the days where people did all their statistics by hand, is that someone could publish a table like this:"
@@ -503,7 +503,7 @@
},
{
"cell_type": "markdown",
- "id": "municipal-married",
+ "id": "arctic-cooperation",
"metadata": {},
"source": [
"| || critical z value |\n",
@@ -517,7 +517,7 @@
},
{
"cell_type": "markdown",
- "id": "liked-comedy",
+ "id": "cognitive-preliminary",
"metadata": {},
"source": [
"which in turn meant that researchers could calculate their $z$-statistic by hand, and then look up the critical value in a text book. That was an incredibly handy thing to be able to do back then, but it's kind of unnecessary these days, since it's trivially easy to do it with software like Python."
@@ -525,7 +525,7 @@
},
{
"cell_type": "markdown",
- "id": "saved-luther",
+ "id": "offshore-siemens",
"metadata": {},
"source": [
"### A worked example using Python\n",
@@ -536,7 +536,7 @@
{
"cell_type": "code",
"execution_count": 262,
- "id": "conditional-prescription",
+ "id": "irish-prophet",
"metadata": {},
"outputs": [
{
@@ -558,7 +558,7 @@
},
{
"cell_type": "markdown",
- "id": "amino-vegetation",
+ "id": "thermal-alexandria",
"metadata": {},
"source": [
"Then, I create variables corresponding to known population standard deviation ($\\sigma = 9.5$), and the value of the population mean that the null hypothesis specifies ($\\mu_0 = 67.5$):"
@@ -567,7 +567,7 @@
{
"cell_type": "code",
"execution_count": 261,
- "id": "solved-steam",
+ "id": "suitable-throat",
"metadata": {},
"outputs": [],
"source": [
@@ -577,7 +577,7 @@
},
{
"cell_type": "markdown",
- "id": "hundred-lying",
+ "id": "peaceful-brazilian",
"metadata": {},
"source": [
"Let's also create a variable for the sample size. We could count up the number of observations ourselves, and type `N = 20` at the command prompt, but counting is tedious and repetitive. Let's get Python to do the tedious repetitive bit by using the `len()` function, which tells us how many elements there are in a vector:"
@@ -586,7 +586,7 @@
{
"cell_type": "code",
"execution_count": 263,
- "id": "accurate-canberra",
+ "id": "floppy-macintosh",
"metadata": {},
"outputs": [
{
@@ -607,7 +607,7 @@
},
{
"cell_type": "markdown",
- "id": "supported-estate",
+ "id": "statutory-kuwait",
"metadata": {},
"source": [
"Next, let's calculate the (true) standard error of the mean:"
@@ -616,7 +616,7 @@
{
"cell_type": "code",
"execution_count": 266,
- "id": "seventh-clarity",
+ "id": "antique-affiliate",
"metadata": {},
"outputs": [
{
@@ -638,7 +638,7 @@
},
{
"cell_type": "markdown",
- "id": "union-bobby",
+ "id": "usual-lodge",
"metadata": {},
"source": [
"And finally, we calculate our $z$-score:"
@@ -647,7 +647,7 @@
{
"cell_type": "code",
"execution_count": 268,
- "id": "photographic-wichita",
+ "id": "positive-knock",
"metadata": {},
"outputs": [
{
@@ -668,7 +668,7 @@
},
{
"cell_type": "markdown",
- "id": "social-genius",
+ "id": "suited-speaker",
"metadata": {},
"source": [
"At this point, we would traditionally look up the value 2.26 in our table of critical values. Our original hypothesis was two-sided (we didn't really have any theory about whether psych students would be better or worse at statistics than other students) so our hypothesis test is two-sided (or two-tailed) also. Looking at the little table that I showed earlier, we can see that 2.26 is bigger than the critical value of 1.96 that would be required to be significant at $\\alpha = .05$, but smaller than the value of 2.58 that would be required to be significant at a level of $\\alpha = .01$. Therefore, we can conclude that we have a significant effect, which we might write up by saying something like this:\n",
@@ -681,7 +681,7 @@
{
"cell_type": "code",
"execution_count": 286,
- "id": "according-henry",
+ "id": "unable-sperm",
"metadata": {},
"outputs": [
{
@@ -703,7 +703,7 @@
},
{
"cell_type": "markdown",
- "id": "proprietary-rouge",
+ "id": "medium-hollywood",
"metadata": {},
"source": [
"`NormalDist().cdf()` calculates the \"cumulative density function\" for a normal distribution. Translated to something slightly less opaque, this means that `NormalDist().cdf()` gives us the probability that a random variable X will be less than or equal to a given value. In our case, the given value for the lower tail of the distribution was our z-score, $2.259$. So `NormalDist().cdf(-z_score)` gives us the probability that a random value draw from a normal distribution would be less than or equal to $-2.259$.\n",
@@ -716,7 +716,7 @@
{
"cell_type": "code",
"execution_count": 288,
- "id": "congressional-accent",
+ "id": "cross-opposition",
"metadata": {},
"outputs": [
{
@@ -739,7 +739,7 @@
},
{
"cell_type": "markdown",
- "id": "municipal-newman",
+ "id": "incredible-music",
"metadata": {},
"source": [
"(zassumptions)=\n",
@@ -756,7 +756,7 @@
},
{
"cell_type": "markdown",
- "id": "toxic-freeware",
+ "id": "solar-hughes",
"metadata": {},
"source": [
"(onesamplettest)=\n",
@@ -768,7 +768,7 @@
{
"cell_type": "code",
"execution_count": 289,
- "id": "hired-richardson",
+ "id": "accepting-blogger",
"metadata": {},
"outputs": [
{
@@ -789,7 +789,7 @@
},
{
"cell_type": "markdown",
- "id": "prescription-miracle",
+ "id": "rising-nightlife",
"metadata": {},
"source": [
"In other words, while I can't say that I know that $\\sigma = 9.5$, I *can* say that $\\hat\\sigma = 9.52$. \n",
@@ -800,7 +800,7 @@
{
"cell_type": "code",
"execution_count": 291,
- "id": "egyptian-layer",
+ "id": "surgical-reasoning",
"metadata": {
"tags": [
"hide-input"
@@ -873,7 +873,7 @@
},
{
"cell_type": "markdown",
- "id": "economic-russian",
+ "id": "charitable-water",
"metadata": {},
"source": [
"\n",
@@ -893,7 +893,7 @@
},
{
"cell_type": "markdown",
- "id": "cheap-rochester",
+ "id": "loaded-overview",
"metadata": {},
"source": [
"### Introducing the $t$-test\n",
@@ -910,7 +910,7 @@
{
"cell_type": "code",
"execution_count": 325,
- "id": "played-rubber",
+ "id": "hidden-cancer",
"metadata": {
"tags": [
"hide-input"
@@ -966,7 +966,7 @@
},
{
"cell_type": "markdown",
- "id": "yellow-shanghai",
+ "id": "muslim-purchase",
"metadata": {},
"source": [
"```{glue:figure} ttestdist-fig\n",
@@ -982,7 +982,7 @@
},
{
"cell_type": "markdown",
- "id": "cordless-theory",
+ "id": "dietary-failure",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -995,7 +995,7 @@
{
"cell_type": "code",
"execution_count": 359,
- "id": "accepted-corrections",
+ "id": "silent-shooting",
"metadata": {},
"outputs": [
{
@@ -1016,7 +1016,7 @@
},
{
"cell_type": "markdown",
- "id": "sacred-charles",
+ "id": "administrative-mediterranean",
"metadata": {},
"source": [
"So that seems straightforward enough. Our calculation resulted in a $t$-statistic of 2.54, and a $p$-value of 0.36. Now what do we *do* with this output? Well, since we're pretending that we actually care about my toy example, we're overjoyed to discover that the result is statistically significant (i.e. $p$ value below .05), and we will probably want to report our result. We could report the result by saying something like this:\n",
@@ -1033,7 +1033,7 @@
{
"cell_type": "code",
"execution_count": 352,
- "id": "technological-vegetable",
+ "id": "smart-federation",
"metadata": {},
"outputs": [
{
@@ -1055,7 +1055,7 @@
},
{
"cell_type": "markdown",
- "id": "noble-telescope",
+ "id": "measured-pencil",
"metadata": {},
"source": [
"Now at least we have the bare minimum of what is necessary to report our results. Still, it would be sweet if we could get those confidence intervals as well. `scipy` actually has all the tools we need, and why these are not just built into the `ttest_1samp()` method is beyond me. To find the confidence interval, we need to:\n",
@@ -1071,7 +1071,7 @@
{
"cell_type": "code",
"execution_count": 356,
- "id": "structural-librarian",
+ "id": "figured-lotus",
"metadata": {},
"outputs": [
{
@@ -1099,7 +1099,7 @@
},
{
"cell_type": "markdown",
- "id": "supposed-exclusion",
+ "id": "ethical-dylan",
"metadata": {},
"source": [
"Whew. Now at least we have everything we need for a full report of our results.\n",
@@ -1113,7 +1113,7 @@
},
{
"cell_type": "markdown",
- "id": "dimensional-advocate",
+ "id": "matched-involvement",
"metadata": {},
"source": [
"(ttestoneassumptions)=\n",
@@ -1129,7 +1129,7 @@
},
{
"cell_type": "markdown",
- "id": "italian-cookie",
+ "id": "mighty-tracy",
"metadata": {},
"source": [
"(studentttest)=\n",
@@ -1140,7 +1140,7 @@
},
{
"cell_type": "markdown",
- "id": "breeding-sitting",
+ "id": "ready-overview",
"metadata": {},
"source": [
"### The data\n",
@@ -1151,7 +1151,7 @@
{
"cell_type": "code",
"execution_count": 360,
- "id": "constant-powder",
+ "id": "atmospheric-ghana",
"metadata": {},
"outputs": [
{
@@ -1232,7 +1232,7 @@
},
{
"cell_type": "markdown",
- "id": "present-afghanistan",
+ "id": "cathedral-constitution",
"metadata": {},
"source": [
"As we can see, there's a single data frame with two variables, `grade` and `tutor`. The `grade` variable is a numeric vector, containing the grades for all $N = 33$ students taking Dr Harpo's class; the `tutor` variable is a factor that indicates who each student's tutor was. The first five observations in this data set are shown above, and below is a nice little table with some summary statistics:"
@@ -1241,7 +1241,7 @@
{
"cell_type": "code",
"execution_count": 437,
- "id": "strange-creation",
+ "id": "surgical-front",
"metadata": {
"tags": [
"hide-input"
@@ -1320,7 +1320,7 @@
},
{
"cell_type": "markdown",
- "id": "unnecessary-shannon",
+ "id": "impressive-pressing",
"metadata": {},
"source": [
"To give you a more detailed sense of what's going on here, I've plotted histograms showing the distribution of grades for both tutors {numref}`fig-harpohist`. Inspection of these histograms suggests that the students in Anastasia's class may be getting slightly better grades on average, though they also seem a little more variable."
@@ -1329,7 +1329,7 @@
{
"cell_type": "code",
"execution_count": 408,
- "id": "careful-climb",
+ "id": "cross-ribbon",
"metadata": {
"tags": [
"hide-input"
@@ -1371,7 +1371,7 @@
},
{
"cell_type": "markdown",
- "id": "graphic-optics",
+ "id": "dense-armenia",
"metadata": {},
"source": [
" ```{glue:figure} harpohist_fig\n",
@@ -1384,7 +1384,7 @@
},
{
"cell_type": "markdown",
- "id": "considerable-indicator",
+ "id": "mathematical-employer",
"metadata": {},
"source": [
"{numref}`fig-ttestci` is a simpler plot showing the means and corresponding confidence intervals for both groups of students."
@@ -1393,7 +1393,7 @@
{
"cell_type": "code",
"execution_count": 413,
- "id": "alert-drill",
+ "id": "nervous-grammar",
"metadata": {
"tags": [
"hide-input"
@@ -1430,7 +1430,7 @@
},
{
"cell_type": "markdown",
- "id": "molecular-crazy",
+ "id": "parental-detail",
"metadata": {},
"source": [
" ```{glue:figure} ttestci-fig\n",
@@ -1444,7 +1444,7 @@
},
{
"cell_type": "markdown",
- "id": "mineral-barbados",
+ "id": "sustainable-carrier",
"metadata": {},
"source": [
"### Introducing the test\n",
@@ -1464,7 +1464,7 @@
{
"cell_type": "code",
"execution_count": 436,
- "id": "serious-scotland",
+ "id": "exotic-singing",
"metadata": {
"tags": [
"hide-input"
@@ -1528,7 +1528,7 @@
},
{
"cell_type": "markdown",
- "id": "powerful-riding",
+ "id": "colonial-survey",
"metadata": {},
"source": [
" ```{glue:figure} ttesthyp_fig\n",
@@ -1541,7 +1541,7 @@
},
{
"cell_type": "markdown",
- "id": "undefined-newton",
+ "id": "considerable-arctic",
"metadata": {},
"source": [
"To construct a hypothesis test that handles this scenario, we start by noting that if the null hypothesis is true, then the difference between the population means is *exactly* zero, \n",
@@ -1563,7 +1563,7 @@
},
{
"cell_type": "markdown",
- "id": "creative-warrant",
+ "id": "civilian-brown",
"metadata": {},
"source": [
"### A \"pooled estimate\" of the standard deviation\n",
@@ -1594,7 +1594,7 @@
},
{
"cell_type": "markdown",
- "id": "informal-nicholas",
+ "id": "graduate-cabinet",
"metadata": {},
"source": [
"### The same pooled estimate, described differently\n",
@@ -1622,7 +1622,7 @@
},
{
"cell_type": "markdown",
- "id": "major-praise",
+ "id": "experienced-storm",
"metadata": {},
"source": [
"(indsamplesttest_formula)=\n",
@@ -1645,7 +1645,7 @@
},
{
"cell_type": "markdown",
- "id": "generic-tonight",
+ "id": "olive-strength",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -1658,7 +1658,7 @@
{
"cell_type": "code",
"execution_count": 454,
- "id": "local-delay",
+ "id": "regular-probe",
"metadata": {},
"outputs": [
{
@@ -1684,7 +1684,7 @@
},
{
"cell_type": "markdown",
- "id": "voluntary-rebecca",
+ "id": "disturbed-halifax",
"metadata": {},
"source": [
"This is fairly straightforward, and just as it was for the one-sample $t$-test, `scipy` does very little to format its results or give you any information over the bare minimum. You get a $t$-statistic and a $p$-value and that's that. Luckily, as was the case with the one-sampel $t$-test, getting the other elements we need to report our results isn't too bad. We will need the $t$-statistic, the $p$-value, the mean of each group, and the degrees of freedom. The first two we already have, and the last two are easy to get. As we discussed [above](indsamplesttest_formula), the degrees of freedom for an independent samples $t$-test is $N-2$, so..."
@@ -1693,7 +1693,7 @@
{
"cell_type": "code",
"execution_count": 465,
- "id": "sharing-benefit",
+ "id": "subsequent-galaxy",
"metadata": {},
"outputs": [
{
@@ -1717,7 +1717,7 @@
},
{
"cell_type": "markdown",
- "id": "cross-drove",
+ "id": "adjusted-handbook",
"metadata": {},
"source": [
"You probably noticed that in addition to telling `ttest_ind` which means I wanted to compare, I also added the argument `equal_var = True` to the command. This wasn't strictly necessary in this case, because by default this argument is set to `True`. But I made it explicit anyway, because we will be using this argument again later. By saying `equal_var = True`, what we're really doing is telling Python to use the *Student* independent samples $t$-test. More on this later."
@@ -1725,7 +1725,7 @@
},
{
"cell_type": "markdown",
- "id": "comprehensive-departure",
+ "id": "practical-stephen",
"metadata": {},
"source": [
"In any case, the difference between the two groups is significant (just barely), so we might write up the result using text like this:\n",
@@ -1735,7 +1735,7 @@
},
{
"cell_type": "markdown",
- "id": "accessible-conflict",
+ "id": "exposed-pizza",
"metadata": {},
"source": [
" \n",
@@ -1767,7 +1767,7 @@
},
{
"cell_type": "markdown",
- "id": "worldwide-absorption",
+ "id": "terminal-titanium",
"metadata": {},
"source": [
"(studentassumptions)= \n",
@@ -1783,7 +1783,7 @@
},
{
"cell_type": "markdown",
- "id": "impaired-cleveland",
+ "id": "moderate-flower",
"metadata": {},
"source": [
"(welchttest)=\n",
@@ -1817,7 +1817,7 @@
{
"cell_type": "code",
"execution_count": 530,
- "id": "diagnostic-thirty",
+ "id": "eight-garlic",
"metadata": {
"tags": [
"hide-input"
@@ -1888,7 +1888,7 @@
},
{
"cell_type": "markdown",
- "id": "great-georgia",
+ "id": "southwest-guide",
"metadata": {},
"source": [
" ```{glue:figure} ttesthyp2_fig\n",
@@ -1901,7 +1901,7 @@
},
{
"cell_type": "markdown",
- "id": "complete-circular",
+ "id": "durable-medicaid",
"metadata": {},
"source": [
"### Doing the test in Python\n",
@@ -1912,7 +1912,7 @@
{
"cell_type": "code",
"execution_count": 531,
- "id": "incorporated-thread",
+ "id": "vietnamese-hammer",
"metadata": {},
"outputs": [
{
@@ -1933,7 +1933,7 @@
},
{
"cell_type": "markdown",
- "id": "talented-director",
+ "id": "victorian-threat",
"metadata": {},
"source": [
"Not too difficult, right? Not surprisingly, the output has exactly the same format as it did last time too: a test statistic $t$, and a $p$-value. So that's all pretty easy. \n",
@@ -1948,7 +1948,7 @@
},
{
"cell_type": "markdown",
- "id": "level-pressing",
+ "id": "metallic-april",
"metadata": {},
"source": [
"(pairedsamplesttest)=\n",
@@ -1959,7 +1959,7 @@
},
{
"cell_type": "markdown",
- "id": "little-antique",
+ "id": "geographic-thread",
"metadata": {},
"source": [
"### The data\n",
@@ -1970,7 +1970,7 @@
{
"cell_type": "code",
"execution_count": 563,
- "id": "existing-nylon",
+ "id": "simplified-success",
"metadata": {},
"outputs": [],
"source": [
@@ -1980,7 +1980,7 @@
},
{
"cell_type": "markdown",
- "id": "adverse-failing",
+ "id": "alternative-canvas",
"metadata": {},
"source": [
"The data frame `chico` contains three variables: an `id` variable that identifies each student in the class, the `grade_test1` variable that records the student grade for the first test, and the `grade_test2` variable that has the grades for the second test. Here's the first five students:"
@@ -1989,7 +1989,7 @@
{
"cell_type": "code",
"execution_count": 534,
- "id": "hungarian-showcase",
+ "id": "cleared-funds",
"metadata": {},
"outputs": [
{
@@ -2073,7 +2073,7 @@
},
{
"cell_type": "markdown",
- "id": "rational-honey",
+ "id": "skilled-copying",
"metadata": {},
"source": [
"At a glance, it does seem like the class is a hard one (most grades are between 50\\% and 60\\%), but it does look like there's an improvement from the first test to the second one. If we take a quick look at the descriptive statistics"
@@ -2082,7 +2082,7 @@
{
"cell_type": "code",
"execution_count": 535,
- "id": "about-brazil",
+ "id": "technical-remove",
"metadata": {},
"outputs": [
{
@@ -2178,7 +2178,7 @@
},
{
"cell_type": "markdown",
- "id": "simple-cornwall",
+ "id": "rough-chess",
"metadata": {},
"source": [
"we see that this impression seems to be supported. Across all 20 students[^note12] the mean grade for the first test is 57\\%, but this rises to 58\\% for the second test. Although, given that the standard deviations are 6.6\\% and 6.4\\% respectively, it's starting to feel like maybe the improvement is just illusory; maybe just random variation. This impression is reinforced when you see the means and confidence intervals plotted in {numref}`pairedta` panel A. If we were to rely on this plot alone, we'd come to the same conclusion that we got from looking at the descriptive statistics that the `describe()` method produced. Looking at how wide those confidence intervals are, we'd be tempted to think that the apparent improvement in student performance is pure chance."
@@ -2187,7 +2187,7 @@
{
"cell_type": "code",
"execution_count": 595,
- "id": "revised-metadata",
+ "id": "liked-resort",
"metadata": {
"tags": [
"hide-input"
@@ -2237,7 +2237,7 @@
},
{
"cell_type": "markdown",
- "id": "disabled-charger",
+ "id": "scheduled-processor",
"metadata": {},
"source": [
"```{glue:figure} pairedta_fig\n",
@@ -2251,7 +2251,7 @@
},
{
"cell_type": "markdown",
- "id": "close-brother",
+ "id": "orange-meditation",
"metadata": {},
"source": [
"Nevertheless, this impression is wrong. To see why, take a look at the scatterplot of the grades for test 1 against the grades for test 2. shown in {numref}`fig-pairedta` panel B. \n",
@@ -2264,7 +2264,7 @@
{
"cell_type": "code",
"execution_count": 596,
- "id": "accessory-surface",
+ "id": "compressed-guard",
"metadata": {},
"outputs": [],
"source": [
@@ -2273,7 +2273,7 @@
},
{
"cell_type": "markdown",
- "id": "martial-gossip",
+ "id": "handy-blond",
"metadata": {},
"source": [
"Notice that I assigned the output to a variable called `df['improvement]`. That has the effect of creating a new column called `improvement` inside the `chico` data frame. Now that we've created and stored this `improvement` variable, we can draw a histogram showing the distribution of these improvement scores, shown in {numref}`fig-pairedta` panel C. \n",
@@ -2285,7 +2285,7 @@
{
"cell_type": "code",
"execution_count": 598,
- "id": "corresponding-synthetic",
+ "id": "consolidated-third",
"metadata": {},
"outputs": [
{
@@ -2310,7 +2310,7 @@
},
{
"cell_type": "markdown",
- "id": "annual-chorus",
+ "id": "executed-fiber",
"metadata": {},
"source": [
"we see that it is 95\\% certain that the true (population-wide) average improvement would lie between 0.95\\% and 1.86\\%. So you can see, qualitatively, what's going on: there is a real \"within student\" improvement (everyone improves by about 1\\%), but it is very small when set against the quite large \"between student\" differences (student grades vary by about 20\\% or so). "
@@ -2318,7 +2318,7 @@
},
{
"cell_type": "markdown",
- "id": "operating-resistance",
+ "id": "average-sentence",
"metadata": {},
"source": [
"### What is the paired samples $t$-test?\n",
@@ -2357,7 +2357,7 @@
},
{
"cell_type": "markdown",
- "id": "civic-society",
+ "id": "everyday-staff",
"metadata": {},
"source": [
"### Doing the test in Python \n",
@@ -2368,7 +2368,7 @@
{
"cell_type": "code",
"execution_count": 600,
- "id": "increasing-valve",
+ "id": "aggregate-template",
"metadata": {},
"outputs": [
{
@@ -2389,7 +2389,7 @@
},
{
"cell_type": "markdown",
- "id": "embedded-receiver",
+ "id": "turkish-phone",
"metadata": {},
"source": [
"However, suppose you're lazy and you don't want to go to all the effort of creating a new variable. Or perhaps you just want to keep the difference between one-sample and paired-samples tests clear in your head. In that case, `scipy` also has a built-in method for conducting paired $t$-tests called `ttest_rel` (the `_rel` part is for \"related\"). Using this method, we get:"
@@ -2398,7 +2398,7 @@
{
"cell_type": "code",
"execution_count": 599,
- "id": "minute-miniature",
+ "id": "little-raise",
"metadata": {},
"outputs": [
{
@@ -2420,7 +2420,7 @@
},
{
"cell_type": "markdown",
- "id": "ignored-division",
+ "id": "following-skill",
"metadata": {},
"source": [
"Either way, the result is exactly the same, which is strangely comforting, actually. Not only that, but the result confirms our intuition. There’s an average improvement of 1.4% from test 1 to test 2, and this is significantly different from 0 ($t$(19) = 6.48, $p$ < .001). In fact, $p$ is quite a bit less than one, since the $p$-value has been given in scientific notation. The exact $p$-value is $3.32^{-06}$, that is, $p$ = 0.0000032."
@@ -2428,7 +2428,7 @@
},
{
"cell_type": "markdown",
- "id": "focused-cross",
+ "id": "divided-doctrine",
"metadata": {},
"source": [
"## One sided tests\n",
@@ -2439,7 +2439,7 @@
{
"cell_type": "code",
"execution_count": 606,
- "id": "cleared-framing",
+ "id": "dressed-lecture",
"metadata": {},
"outputs": [
{
@@ -2464,7 +2464,7 @@
},
{
"cell_type": "markdown",
- "id": "fantastic-penny",
+ "id": "exclusive-abuse",
"metadata": {},
"source": [
"The $t$-statistics are exactly the same, which makes sense, if you think about it, because the calculation of the $t$ is based on the mean and standard deviation, and these do not change. The $p$-value, on the other hand, is lower for the one-sided test. The only thing that changes between the two tests is the _expectation_ that we bring to data. The way that the $p$-value is calculated depends on those expectations, and they are the reason for choosing one test over the other. It should go without saying, but maybe is worth saying anyway, that our reasons for choosing one test over the other should be theoretical, and not based on which test is more likely to give us the $p$-value we want!\n",
@@ -2476,7 +2476,7 @@
{
"cell_type": "code",
"execution_count": 613,
- "id": "urban-giving",
+ "id": "junior-vessel",
"metadata": {},
"outputs": [
{
@@ -2510,7 +2510,7 @@
},
{
"cell_type": "markdown",
- "id": "controversial-imperial",
+ "id": "center-wayne",
"metadata": {},
"source": [
"What about the paired samples $t$-test? Suppose we wanted to test the hypothesis that grades go *up* from test 1 to test 2 in Dr. Chico's class, and are not prepared to consider the idea that the grades go down. Again, we can use the `alternative` argument to specify the one-sided test, and it works the same way it does for the independent samples $t$-test. Since we are comparing test 1 to test 2 by substracting one from the other, it makes a difference whether we subract test 1 from test 2, or test 2 from test 1. So, to test the hypothesis that grades for test 2 are higher than test 2, we will need to enter the grades from test 2 first; otherwise we are testing the opposite hypothesis: "
@@ -2519,7 +2519,7 @@
{
"cell_type": "code",
"execution_count": 616,
- "id": "excessive-interaction",
+ "id": "cellular-soundtrack",
"metadata": {},
"outputs": [
{
@@ -2545,7 +2545,7 @@
},
{
"cell_type": "markdown",
- "id": "numerical-episode",
+ "id": "furnished-effect",
"metadata": {},
"source": [
"(cohensd)=\n",
@@ -2564,7 +2564,7 @@
},
{
"cell_type": "markdown",
- "id": "registered-expense",
+ "id": "changing-partnership",
"metadata": {},
"source": [
"(dinterpretation)=\n",
@@ -2574,7 +2574,7 @@
},
{
"cell_type": "markdown",
- "id": "german-piano",
+ "id": "agricultural-spare",
"metadata": {},
"source": [
"| d-value | rough interpretation |\n",
@@ -2586,7 +2586,7 @@
},
{
"cell_type": "markdown",
- "id": "blessed-silly",
+ "id": "animated-giving",
"metadata": {},
"source": [
"### Cohen's $d$ from one sample\n",
@@ -2601,7 +2601,7 @@
{
"cell_type": "code",
"execution_count": 624,
- "id": "statutory-magic",
+ "id": "documentary-melissa",
"metadata": {},
"outputs": [
{
@@ -2628,7 +2628,7 @@
},
{
"cell_type": "markdown",
- "id": "inclusive-effort",
+ "id": "conceptual-profession",
"metadata": {},
"source": [
"What does this effect size mean? Overall, then, the psychology students in Dr Zeppo's class are achieving grades (mean = 72.3\\%) that are about .5 standard deviations higher than the level that you'd expect (67.5\\%) if they were performing at the same level as other students. Judged against Cohen's rough guide, this is a moderate effect size."
@@ -2636,7 +2636,7 @@
},
{
"cell_type": "markdown",
- "id": "russian-packing",
+ "id": "indian-drove",
"metadata": {},
"source": [
"### Cohen's $d$ from a Student $t$ test\n",
@@ -2658,7 +2658,7 @@
},
{
"cell_type": "markdown",
- "id": "metallic-traveler",
+ "id": "efficient-lounge",
"metadata": {},
"source": [
"However, there are other possibilities, which I'll briefly describe. Firstly, you may have reason to want to use only one of the two groups as the basis for calculating the standard deviation. This approach (often called Glass' $\\Delta$) only makes most sense when you have good reason to treat one of the two groups as a purer reflection of \"natural variation\" than the other. This can happen if, for instance, one of the two groups is a control group. Secondly, recall that in the usual calculation of the pooled standard deviation we divide by $N-2$ to correct for the bias in the sample variance; in one version of Cohen's $d$ this correction is omitted. Instead, we divide by $N$. This version makes sense primarily when you're trying to calculate the effect size in the sample; rather than estimating an effect size in the population. Finally, there is a version based on @Hedges1985, who point out there is a small bias in the usual (pooled) estimation for Cohen's $d$. Thus they introduce a small correction, by multiplying the usual value of $d$ by $(N-3)/(N-2.25)$. \n",
@@ -2669,7 +2669,7 @@
{
"cell_type": "code",
"execution_count": 654,
- "id": "political-certificate",
+ "id": "norman-delivery",
"metadata": {},
"outputs": [
{
@@ -2719,7 +2719,7 @@
},
{
"cell_type": "markdown",
- "id": "acknowledged-george",
+ "id": "streaming-spelling",
"metadata": {},
"source": [
"### Cohen's $d$ from a Welch test\n",
@@ -2749,7 +2749,7 @@
{
"cell_type": "code",
"execution_count": 657,
- "id": "human-protocol",
+ "id": "better-wright",
"metadata": {},
"outputs": [
{
@@ -2784,7 +2784,7 @@
},
{
"cell_type": "markdown",
- "id": "portuguese-discussion",
+ "id": "young-scanner",
"metadata": {},
"source": [
"### Cohen's $d$ from a paired-samples test\n",
@@ -2802,7 +2802,7 @@
{
"cell_type": "code",
"execution_count": 663,
- "id": "mexican-tennessee",
+ "id": "exotic-adaptation",
"metadata": {},
"outputs": [
{
@@ -2831,7 +2831,7 @@
},
{
"cell_type": "markdown",
- "id": "streaming-enlargement",
+ "id": "hindu-subscription",
"metadata": {},
"source": [
"The only wrinkle is figuring out whether this is the measure you want or not. To the extent that you care about the practical consequences of your research, you often want to measure the effect size relative to the *original* variables, not the *difference* scores (e.g., the 1\\% improvement in Dr Chico's class is pretty small when measured against the amount of between-student variation in grades), in which case you use the same versions of Cohen's $d$ that you would use for a Student or Welch test. For instance, when we do that for Dr Chico's class, "
@@ -2840,7 +2840,7 @@
{
"cell_type": "code",
"execution_count": 666,
- "id": "textile-consequence",
+ "id": "jewish-costs",
"metadata": {},
"outputs": [
{
@@ -2874,7 +2874,7 @@
},
{
"cell_type": "markdown",
- "id": "emotional-webster",
+ "id": "honest-hammer",
"metadata": {},
"source": [
"what we see is that the overall effect size is quite small, when assessed on the scale of the original variables."
@@ -2882,7 +2882,7 @@
},
{
"cell_type": "markdown",
- "id": "automotive-stephen",
+ "id": "underlying-postage",
"metadata": {},
"source": [
"(shapiro)=\n",
@@ -2893,7 +2893,7 @@
},
{
"cell_type": "markdown",
- "id": "double-cooperation",
+ "id": "undefined-secretariat",
"metadata": {},
"source": [
"### QQ plots\n",
@@ -2904,7 +2904,7 @@
{
"cell_type": "code",
"execution_count": 706,
- "id": "coordinated-brain",
+ "id": "different-spanish",
"metadata": {
"tags": [
"hide-input"
@@ -2954,7 +2954,7 @@
},
{
"cell_type": "markdown",
- "id": "collected-clearing",
+ "id": "square-capitol",
"metadata": {},
"source": [
" ```{glue:figure} qq_fig\n",
@@ -2967,7 +2967,7 @@
},
{
"cell_type": "markdown",
- "id": "important-beginning",
+ "id": "surrounded-hanging",
"metadata": {},
"source": [
"And the results are shown in {numref}(`fig-qq`), above.\n",
@@ -2978,7 +2978,7 @@
{
"cell_type": "code",
"execution_count": 704,
- "id": "authorized-opinion",
+ "id": "outstanding-banking",
"metadata": {
"tags": [
"hide-input"
@@ -3011,7 +3011,7 @@
},
{
"cell_type": "markdown",
- "id": "associate-inspection",
+ "id": "recent-reverse",
"metadata": {},
"source": [
" ```{glue:figure} qqskew_fig\n",
@@ -3026,7 +3026,7 @@
{
"cell_type": "code",
"execution_count": 707,
- "id": "parental-convergence",
+ "id": "growing-chain",
"metadata": {
"tags": [
"hide-input"
@@ -3071,7 +3071,7 @@
},
{
"cell_type": "markdown",
- "id": "collect-donor",
+ "id": "perceived-ozone",
"metadata": {},
"source": [
" ```{glue:figure} qqheavy_fig\n",
@@ -3085,7 +3085,7 @@
},
{
"cell_type": "markdown",
- "id": "nutritional-arrival",
+ "id": "isolated-metallic",
"metadata": {},
"source": [
"### Shapiro-Wilk tests\n",
@@ -3103,7 +3103,7 @@
},
{
"cell_type": "markdown",
- "id": "surprising-freeware",
+ "id": "orange-platform",
"metadata": {},
"source": [
"```{figure} ../img/ttest2/shapirowilkdist.png\n",
@@ -3118,7 +3118,7 @@
},
{
"cell_type": "markdown",
- "id": "received-working",
+ "id": "fifth-english",
"metadata": {},
"source": [
"To run the test in Python, we use the `scipy.stats.shapiro` method. It has only a single argument `x`, which is a numeric vector containing the data whose normality needs to be tested. For example, when we apply this function to our `normal_data`, we get the following:"
@@ -3127,7 +3127,7 @@
{
"cell_type": "code",
"execution_count": 708,
- "id": "printable-sucking",
+ "id": "analyzed-dream",
"metadata": {},
"outputs": [
{
@@ -3149,7 +3149,7 @@
},
{
"cell_type": "markdown",
- "id": "international-electronics",
+ "id": "searching-creator",
"metadata": {},
"source": [
"So, not surprisingly, we have no evidence that these data depart from normality. When reporting the results for a Shapiro-Wilk test, you should (as usual) make sure to include the test statistic $W$ and the $p$ value, though given that the sampling distribution depends so heavily on $N$ it would probably be a politeness to include $N$ as well."
@@ -3157,7 +3157,7 @@
},
{
"cell_type": "markdown",
- "id": "terminal-water",
+ "id": "boxed-model",
"metadata": {},
"source": [
"(wilcox)=\n",
@@ -3170,7 +3170,7 @@
},
{
"cell_type": "markdown",
- "id": "brutal-depth",
+ "id": "emerging-intake",
"metadata": {},
"source": [
"### Two sample Wilcoxon test\n",
@@ -3180,39 +3180,87 @@
},
{
"cell_type": "code",
- "execution_count": 725,
- "id": "indian-wilson",
+ "execution_count": 728,
+ "id": "difficult-virus",
"metadata": {},
"outputs": [
{
- "ename": "ParserError",
- "evalue": "Error tokenizing data. C error: Expected 1 fields in line 118, saw 2\n",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mParserError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"https://github.com/ethanweed/pythonbook/blob/main/Data/awesome2.csv\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m/opt/anaconda3/envs/pythonbook/lib/python3.9/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)\u001b[0m\n\u001b[1;32m 608\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 609\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 610\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 611\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 612\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m/opt/anaconda3/envs/pythonbook/lib/python3.9/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 466\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 467\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mparser\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 468\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mparser\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 469\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 470\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m/opt/anaconda3/envs/pythonbook/lib/python3.9/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 1055\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1056\u001b[0m \u001b[0mnrows\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mvalidate_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"nrows\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1057\u001b[0;31m \u001b[0mindex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcol_dict\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1058\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mindex\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m/opt/anaconda3/envs/pythonbook/lib/python3.9/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread\u001b[0;34m(self, nrows)\u001b[0m\n\u001b[1;32m 2059\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2060\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2061\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2062\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2063\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_first_chunk\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.read\u001b[0;34m()\u001b[0m\n",
- "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\u001b[0;34m()\u001b[0m\n",
- "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[0;34m()\u001b[0m\n",
- "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._tokenize_rows\u001b[0;34m()\u001b[0m\n",
- "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.raise_parser_error\u001b[0;34m()\u001b[0m\n",
- "\u001b[0;31mParserError\u001b[0m: Error tokenizing data. C error: Expected 1 fields in line 118, saw 2\n"
- ]
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " score_A | \n",
+ " score_B | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 6.4 | \n",
+ " 14.5 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 10.7 | \n",
+ " 10.4 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 11.9 | \n",
+ " 12.9 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 7.3 | \n",
+ " 11.7 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 10.0 | \n",
+ " 13.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " score_A score_B\n",
+ "0 6.4 14.5\n",
+ "1 10.7 10.4\n",
+ "2 11.9 12.9\n",
+ "3 7.3 11.7\n",
+ "4 10.0 13.0"
+ ]
+ },
+ "execution_count": 728,
+ "metadata": {},
+ "output_type": "execute_result"
}
],
"source": [
- "df = pd.read_csv(\"https://github.com/ethanweed/pythonbook/blob/main/Data/awesome2.csv\")\n",
+ "df = pd.read_csv(\"https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/awesome2.csv\")\n",
"df"
]
},
{
"cell_type": "markdown",
- "id": "removed-isolation",
+ "id": "allied-dancing",
"metadata": {},
"source": [
"As long as there are no ties (i.e., people with the exact same awesomeness score), then the test that we want to do is surprisingly simple. All we have to do is construct a table that compares every observation in group $A$ against every observation in group $B$. Whenever the group $A$ datum is larger, we place a check mark in the table:"
@@ -3220,7 +3268,7 @@
},
{
"cell_type": "markdown",
- "id": "sealed-round",
+ "id": "aerial-questionnaire",
"metadata": {},
"source": [
"\n",
@@ -3236,7 +3284,7 @@
},
{
"cell_type": "markdown",
- "id": "cathedral-equity",
+ "id": "standard-orbit",
"metadata": {},
"source": [
"We then count up the number of checkmarks. This is our test statistic, $W$.[^note15] The actual sampling distribution for $W$ is somewhat complicated, and I'll skip the details. For our purposes, it's sufficient to note that the interpretation of $W$ is qualitatively the same as the interpretation of $t$ or $z$. That is, if we want a two-sided test, then we reject the null hypothesis when $W$ is very large or very small; but if we have a directional (i.e., one-sided) hypothesis, then we only use one or the other. "
@@ -3244,8 +3292,8 @@
},
{
"cell_type": "code",
- "execution_count": 721,
- "id": "front-fifteen",
+ "execution_count": 737,
+ "id": "musical-danish",
"metadata": {},
"outputs": [
{
@@ -3254,7 +3302,7 @@
"(1.0, 0.125)"
]
},
- "execution_count": 721,
+ "execution_count": 737,
"metadata": {},
"output_type": "execute_result"
}
@@ -3262,40 +3310,26 @@
"source": [
"from scipy.stats import wilcoxon\n",
" \n",
- "A = df.loc[df['group'] == 'A']['scores']\n",
- "B = df.loc[df['group'] == 'B']['scores']\n",
- "\n",
- "w, p = wilcoxon(A, B)\n",
- "w, p"
+ "w,p = wilcoxon(df['score_A'], df['score_B'], )\n",
+ "w,p"
]
},
{
"cell_type": "markdown",
- "id": "olive-deficit",
+ "id": "operating-tourism",
"metadata": {},
"source": [
- "Just like we saw with the `t.test()` function, there is an `alternative` argument that you can use to switch between two-sided tests and one-sided tests, plus a few other arguments that we don't need to worry too much about at an introductory level. \n",
- "Similarly, the `wilcox.test()` function allows you to use the `x` and `y` arguments when you have your data stored separately for each group. For instance, suppose we use the data from the `awesome2.Rdata` file:\n",
- "```{r}\n",
- "load( file.path(projecthome, \"data/awesome2.Rdata\" ))\n",
- "score.A\n",
- "score.B\n",
- "```\n",
- "When your data are organised like this, then you would use a command like this: \n",
- "```{r}\n",
- "wilcox.test( x = score.A, y = score.B )\n",
- "```\n",
- "The output that R produces is pretty much the same as last time.\n",
+ "\n",
"\n",
"\n",
"### One sample Wilcoxon test\n",
"\n",
"\n",
"What about the **_one sample Wilcoxon test_** (or equivalently, the paired samples Wilcoxon test)? Suppose I'm interested in finding out whether taking a statistics class has any effect on the happiness of students. Here's my data:\n",
- "```{r}\n",
- "load( file.path(projecthome, \"data/happy.Rdata\" ))\n",
- "print( happiness )\n",
- "```\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
"What I've measured here is the happiness of each student `before` taking the class and `after` taking the class; the `change` score is the difference between the two. Just like we saw with the $t$-test, there's no fundamental difference between doing a paired-samples test using `before` and `after`, versus doing a one-sample test using the `change` scores. As before, the simplest way to think about the test is to construct a tabulation. The way to do it this time is to take those change scores that are positive valued, and tabulate them against all the complete sample. What you end up with is a table that looks like this:\n",
"\n",
"```{r echo=FALSE}\n",
diff --git a/Data/happiness.csv b/Data/happiness.csv
new file mode 100644
index 00000000..b4f138d4
--- /dev/null
+++ b/Data/happiness.csv
@@ -0,0 +1,11 @@
+before,after,change
+30,6,-24
+43,29,-14
+21,11,-10
+24,31,7
+23,17,-6
+40,2,-38
+29,31,2
+56,21,-35
+38,8,-30
+16,21,5