We're going to learn many terms and concepts this semester. This page catalogs many of the important ones, with pointers to the resources in which they are introduced.
:::{glossary}
ablation study A study in which we turn off different components of a complex model to see how much each one contributes to the overall model's performance.
Introduced in {video}`week11:ablation`.
aggregate
A function that computes a single value from a series (or matrix) of values.
Often used to compute a {term}statistic
.
Introduced in {video}`week2:group-aggregate`.
aleatoric uncertainty
Uncertainty that arises due to inherent randomness, such that further information will not make us more certain.
Contrast {term}epistemic uncertainty
.
arithmetic mean
The most common type of {term}mean
, computed from a sequence of observations as
Bayesianism A school of thought for statistical inference and the interpretation of probability that is concerned with using probability to quantify uncertainty or coherent states of belief. In statistical inference, this results in methods that quantify knowledge with probability distributions, and update those distributions based on the results of an experiment or data analysis.
Not to be confused with {term}`Bayes' Theorem`, which is a fundamental building block of Bayesian inference but has many other uses as well.
Bayes' theorem
A theorem or identity in probability theory that allows us to reverse a {term}conditional probability
:
$$\P[B|A] = \frac{\P[A|B] \P[B]}{\P[A]}$$
Statisticians of all schools of thought make use of Bayes' theorem — all it does is relate $\P[A|B]$ to $\P[B|A]$, allowing us to (with additional information) reverse a conditional probability.
Introduced in {video}`week4:joint-conditional`.
bootstrap A technique for estimating sampling distributions by repeatedly resampling the available sample with replacement.
Introduced in {video}`week4:bootstrap`.
central limit theorem
The theorem that describes the sampling distribution of the sample mean.
If we take a random sample
classification
A {term}supervised learning
problem where the goal is to predict a discrete class for an instance.
This is often binary classification, where instances are categorized into one of two classes.
This is the major topic of {module}`week10`.
conditional probability
The conditional probability
Introduced in {video}`week4:joint-conditional` and discussed in [Notes on Probability](prob-conditional).
confidence interval
An interval used to estimate the precision of an estimate.
A 95% confidence interval is an interval computed from a procedure (including both taking a sample and computing a statistic from that sample) that, when repeated, will return an interval containing the true parameter value 95% of the time.
Discussed in {video}week4:confidence
, {reading}week4:confidence-in-confidence
, and Handbook section 1.3.5.2.
A confidence interval is **not** a probabilistic statement about either the population mean $\mu$ or the sample mean $\bar{x}$.
correlation The extent to which two variables change with each other. If one variable usually increases when the other one increases, the variables are correlated; if one decreases when the other increases, they are anticorrelated.
Correlation is measured with the correlation coefficient:
$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2}\sqrt{\sum(y_i - \bar{y})^2}}$$
This is equivalent to the **{term}`covariance`** scaled by the {term}`standard deviations <standard deviation>` of the variables:
$$\Cor(X, Y) = \frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}$$
Defined in {video}`week6:correlation` and [Notes on Probability](prob-variance). Used extensively in [Assignment 4](a4-covariance).
covariance A non-normalized measure of the extent to which two variables change with each other:
$$\Cov(X, Y) = \E[(X - \E[X]) (Y - \E[Y])]$$
Defined in {video}`week6:correlation` and [Notes on Probability](prob-variance). Used extensively in [Assignment 4](a4-covariance).
cumulative distribution function
A function describing a distribution by defining the fraction of elements that are less than a particular value (
Discussed in [Notes on Probability](random-variables). See {term}`empirical CDF`.
degrees of freedom
The number of observations in a series that can independently vary to affect a calculation.
This is usually the number of observations, minus the number of intermediate statistics.
For example, the degrees of freedom for the sample standard deviation for
$$s = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n - 1}}$$
Introduced in {video}`week5:t-test`.
disaggregation When we take something that is usually aggregated over the total population (e.g. the completion rate for students at a school) and instead aggregate it over subsets of the population (e.g. computing a completion rate for each racial group). Practiced in Assignment 6.
elementary event In probability theory, an individual distinct outcome of a process we are modeling as random.
Introduced in {video}`week4:probability` and [Notes on Probability](resources/probability.md).
embedding
As a noun, a vector-space representation of a data point or instance.
This is often a lower-dimensional representation produced through some form of matrix decomposition such as SVD.
Introduced in {module}week13
.
As a verb, to convert an instance to such a representation.
empirical CDF
A {term}cumulative distribution function
computed from data.
Introduced in {video}`week2:distributions`.
encoding
How we record a piece of data (especially an observation of a {term}variable
) in the computer system.
Introduced in {video}`week2:encodings`.
entropy A measure of the “uninformitiveness” or uncertainty represented by a probability distribution. For a discrete distribution, it is computed as:
$$H(X) = - \sum_x \P[x] \log_2 \P[x]$$
The entropy is the expected number of bits required to record a draw from the distribution (or a message resolving the uncertainty)
using an efficient {term}`encoding`, assuming the recipient knows the distribution and the encoding.
Introduced in {video}`week13:entropy`.
environment variable A string variable associated with a process by the operating system. Often used for configuring the behavior of software, such as the number of threads to use in parallel computation. Child processes inherit their parents' environment variables.
Environment variables for the current process can be accessed and set in Python via the dictionary `os.environ`.
In the Unix shell, set an environment variable with:
export MY_VAR="contents"
In PowerShell, set it with:
$env:MY_VAR="contents"
Set an environment variable *before* running commands that need to be governed by it.
epistemic uncertainty
Uncertainty that arises due to incomplete knowledge about a process or future outcomes.
Contrast {term}aleatoric uncertainty
.
Euclidean norm
See {term}L₂ Norm
.
event
In probability theory, an outcome that for which we want to estimate the probability.
Formally, given a set
Introduced in {video}`week4:probability` and [Notes on Probability](resources/probability.md).
estimand An unknown quantity that we try to estimate. See Estimator.
estimate n. A value computed to approximate the value of some estimand. See Estimator.
*v.* The process of computing an estimate for an estimand.
estimator
A computation (or computed value) that we use to try to estimate an unknown value.
Formally, an estimator is a computation to produce an estimate of an estimand.
The sample mean
Introduced in {video}`week4:introduction`.
expected value
The mean of a {term}random variable
Discussed in {video}`week4:continuous` and [Notes on Probability](resources/probability.md).
frequentism A school of thought for statistical inference and the interpretation of probability that is concerned with probabilities as descriptions of the long-run behavior of a random process: how frequent would various outcomes be if the process were repeated infinitely many times? In statistical inference, this results in methods that are characterized by their behavior if a sampling procedure or experiment were repeated, such as confidence intervals (defined in terms of the behavior of calculating them over multiple samples) and p-values (the probability that a random sample would produce a statistic at least as large as the observed statistic if the sampling procedure were repeated).
geometric mean
A measure of central tendency where sums are replaced by products. It is less sensitive to large outliers than the {term}arithmetic mean
(the usual kind of mean). It is computed by:
$$\sqrt[n]{\prod_i x_i}$$
Or alternatively (so long as $\forall i. x_i \ne 0$):
$$e^{\frac{1}{n}\sum_i \operatorname{log}(x_i)}$$
HARKing
“Hypothesizing After Results are Known”, a statistical error where we formulate our hypotheses to test after looking
at the data. A {term}null hypothesis significance test
computes the probability
week5:hypotest
.
heteroskedasticity
Having unequal {term}variance
. The opposite of {term}homoskedasticity
.
homoskedasticity
Having the same {term}variance
. The opposite of {term}heteroskedasticity
.
hyperparameter A value that controls a model's training or prediction behavior that is not learned from the data. Examples include learning rates, iteration counts, and regularization terms. These hyperparameters usually control one of three things:
- A configurable aspect of the model's *structure*, such as the number of dimensions in a {term}`dimensionality reduction`.
- A configurable aspect of the model's *objective function*, such as the regularization strength.
- A configurable aspect of the model's *optimization process*, such as the number of iterations to run for an {term}`iterative method`.
In programming, we would usually call these “parameters”, but that term is taken by the statistical or machine learning notion of a
{term}`parameter`, so we call these “hyperparameters”.
inference
As we primarily use it in this class, inference is the act of learning from the data; in particular, when we are trying to learn something about the world or the data generating process from the data we observe. It contrasts with {term}prediction
, discussed in {video}week8:pred-inf
and at length in {module}week4
.
In machine learning deployment, inference is often used to refer to using the model to score or classify new instances at runtime, as opposed to the training stage of the model.
Inference can also be used to refer to learning the model parameters itself, but we won't be using it this way to avoid confusion.
instance One entity of the data for a modeling or prediction problem. Typically one row of the training or testing data; each row is an observation of an instance. In general, however, it is one entity about which we are trying to learn or predict, such as one transaction.
iterative method An computational method that works by computing an initial solution (or guess) and iteratively refining it, usually until some stopping condition is met (often the number of iterations, or a convergence criteria such as the change from one iteration to the next dropping below a threshold).
{py:func}`scipy.optimize.minimize` as demonstrated in {video}`week9:optimizing-loss` is an example of an iterative method.
joint probability
The joint probability
Introduced in {video}`week4:joint-conditional` and [Notes on Probability](resources/probability.md).
L₁ Norm A measure of the magnitude of a vector, sometimes called the Manhattan distance. It is the sum of the absolute values of the elements in the vector:
$$\| \mathbf{x} \|_1 = \sum_i |x_i|$$
L₂ Norm A measure of the magnitude of a vector, also called the Euclidean norm or Euclidean length. It is square root of the sum of squares of the elements in the vector:
$$\| \mathbf{x} \|_2 = \sqrt{\sum_i x_i^2}$$
label
An observed outcome for an {term}instance
, used for supervized learning. Sometimes called a {term}supervision signal
.
leakage When your predictive model benefits from information that would not be available when the model is in actual use. Setting aside test data until the model is ready for final evaluation helps reduce leakage.
linear model
A model of the form
Linear models are introduced in {module}`week8`.
logistic function
A sigmoid function that maps unbounded real values to the range
$$\mathrm{logistic}(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$$
The logistic function is the invert of the {term}`logit function`.
Logistic regressions are introduced in {module}`week10`.
logit function
The inverse of the {term}logistic function
:
$$\mathrm{logit(x)} = \mathrm{logistic}^{-1}(x) = \operatorname{log} \frac{x}{1-x} = \operatorname{log} x - \operatorname{log} (1-x)$$
Applying *logit* to a probability yields the {term}`log odds`.
log odds
The logarithm of the {term}odds
. Introduced in {video}week10:logistic
.
majority-class classifier A classifier that classifies every data point with the most common class from the training data. If 72% of the training data is in class A, the majority-class classifier will classify every test point as A, no matter what its input feature values are.
Described in {video}`week10:baselines`.
marginal probability
The probability of a single event, or distribution of a single dimension,
Described in {video}`week4:joint-conditional` and [Notes on Probability](resources/probability.md).
matrix A two-dimensional array of numbers. Alternatively, a linear map between vector spaces.
matrix decomposition A decomposition of a matrix into other matrices, such that multiplying the decomposition back together yields the original matrix or an approximation thereof. An example is the singular value decomposition (SVD):
$$M = P \Sigma Q^T$$
where $P \in \Reals^{m \times k}$ and $Q \in \Reals^{n \times k}$ are orthogonal, and $\Sigma \in \Reals^{k \times k}$ is diagonal.
Introduced in {video}`week13:decomp`.
mean
A measure of central tendency; the expected value of a random variable. Without any further specifier, such as geometric or harmonic, the mean is taken to refer to the arithmetic mean. The sample mean
$$\bar{x} = \frac{1}{n} \sum_i x_i$$
The mean of a vector or data series can be computed with {py:func}`numpy.mean` or {py:meth}`pandas.Series.mean`.
Introduced in {video}`week2:descriptive-statistics`.
naïve Bayes
A classification technique that uses Bayes' theorem to classify instances given (counts of) discrete features. Given a sequence of tokens
$$\P[Y=y|T] \propto \P[T|Y=y] P[Y=y]$$
The "naïve" term comes from the simplifying assumption that tokens are conditionally independent of each other given the class, so that $\P[T|Y=y]$ can be
computed from $\P[t|Y=y]$:
$$\P[T|Y=y] = \prod_{t \in T} \P[t | Y=y]$$
Naïve Bayes is a good baseline model for many text classification tasks.
It is implemented (for arbitrarily many classes) by {py:class}`sklearn.naive_bayes.MultinomialNB`, and introduced in {video}`week12:classifying-text`.
null hypothesis
A formalization of the idea of “no effect”, used for {term}null hypothesis significance testing <null hypothesis significance test>
and typically
denoted p-value
.
null hypothesis significance test
A significance test that assesses whether the data provide evidence to reject the {term}null hypothesis
p-value
, the probability of seeing an effect at least as large as the one observed if the
null hypothesis is true, and rejecting the null hypothesis if this probability is sufficiently small.
objective function A function describing a model's performance that is used as the goal for learning its parameters. This can be a loss function (where the goal is to minimize it) or a utility function (which should be maximized).
Defined in {video}`week11:eval-intro`, and introduced in {video}`week9:optimizing-loss`.
operationalization The mapping of a goal or question to a specific, measurable quantity (or measurement procedure). When we operationalize a question, we translate it into the precise computations and measurements we will use to attempt to answer it.
Introduced in {video}`week1:asking-questions`.
odds An alternative way of framing probability, as the ratio of the likelihood for or against an event:
$$\Odds(A) = \frac{\P[A]}{\P[A^c]}$$
The {term}`log odds` is a particularly convenient way of working with odds, and is $\log \P[A] - \log (1 - \P[A])$.
See the [{{mnote}} probability notes](prob-odds).
odds ratio The ratio of the odds of two different outcomes.
$$\operatorname{OR}(A, B) = \frac{\Odds(A)}{\Odds{B}}$$
See the [{{mnote}} probability notes](resources/probability.md#odds).
overfitting When a model learns too much from its training data, so it cannot do an effective job of predicting future unseen data.
Introduced in {video}`week9:overfitting`.
parameter In inferential statistics: a “true” value in the population, such as the mean flipper length of Chinstrap penguins. The goal of inferential statistics is often to estimate parameters, because we typically do not have direct access to them.
Introduced in {video}`week4:sampling`.
In *model fitting*: a variable in a statistical or machine learning model whose value is learned from the data.
Contrast {term}`hyperparameter`, a variable that controls the model or the model-fitting process but is not learned from the data.
population The complete set of entities we want to study. This is not only all entities that do exist, but under some philosophies, all entities that could exist. For example, the set of all possible adult Chinstrap penguins would be the population.
Discussed in more detail in {video}`week4:sampling`.
prediction
Using a model to estimate or predict a score or label from explanatory variables for instances that were not seen during training.
Contrasts with {term}inference
as one of the major goals of modeling, discussed in {video}week8:pred-inf
.
probability mass The amount of probability on a particular event. Discussed in :mdoc: Notes on Probability.
Typically the null hypothesis is an appropriate formalization of “nothing interesting”, so the *p*-value is the probability of seeing an effect as large as the one observed if there is no true effect to observe.
Discussed in {video}`week5:hypotest`.
random variable
A variable that takes on random values, usually as the result of a random process or because we are using randomness and probability to model uncertainty about the variable's actual value in any particular case. For our purposes, random variables may be discrete (integer-valued) or continuous (real-valued), but are always numeric. We denote random variables with capital letters (
The probability distribution of a continous random variable is defined by a distribution function $F_X(x) = \P[X < x]$.
Two common operations on a random variable are to take its {term}`expected value` or compute its {term}`variance`.
Formally, a random variable is a function $f_X: E \to \Reals$, where $E$ is the set of {term}`elementary events <elementary event>` from a probability space $(E, \Field, \P)$, and $F_X(x) = \P[F_X(e) < x]$. For the purposes of this class, we will not need this distinction.
Discussed in [Notes on Probability](resources/probability.md) and {video}`week4:continuous`.
regression
A modeling or prediction problem where we try to estimate or predict a continuous variable
This is the focus of {module}`week8`.
regularization A penalty term added to a loss function, typically penalizing large values. Used to encourage sparsity or to require coefficients to be supported by larger quantities of data.
Introduced in {video}`week11:regularization`.
residual
The error in estimating a variable with a model. For a model fitting an estimator
Introduced in {video}`week8:single-regression`.
sample : n. A subset of the population, for which we have observations.
Discussed in more detail in {video}`week4:sampling`.
sample size
The number of items in the sample. Often denoted
sampling distribution
The distribution of a statistic when it is computed over many repeated samples of the same size from the same population.
The sampling distribution of the sample mean from a population with mean
Discussed in {video}`week4:sampling`.
statistic
A value computed from a set of observations.
For example, the sample mean
Discussed in {video}`week4:introduction`.
standard deviation
A measure of the spread of a {term}random variable
. It is the square root of the mean squared deviation from the mean:
$$\sigma_X = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n}}$$
Sometimes we compute the **sample standard deviation**:
$$s = \sqrt{\frac{\sum_i (x_i - \bar{x})^2}{n - 1}}$$
The sample standard deviation is an {term}`unbiased estimator` of the population standard deviation;
computing the standard deviation (divided by $n$ instead of $n-1$) technically has a small bias when
used to estimate the population standard deviation, but in reasonably large data sets this difference
is miniscule, and often is [not very important](https://dansblog.netlify.app/posts/2021-10-11-n-sane-in-the-membrane/)
(there are usually more impactful discrepancies between the sample estimate and population s.d. than this bias).
The standard deviation is the square root of the {term}`variance`.
Standard deviations can be computed with:
- {py:meth}`pandas.Series.std` (computes sample $s$, pass `ddof=0` to compute population $\sigma$)
- {py:func}`numpy.std` (computes population $\sigma$, pass `ddof=1` to compute sample $s$)
Introduced in {video}`week2:descriptive-statistics`.
standard error
The standard deviation of the sampling distribution of a statistic. The standard error of the mean (Pandas method {py:meth}pandas.Series.sem
) is
Discussed in {video}`week4:confidence`.
standardization
Normalizing a variable to be in units of ``standard deviations from the mean'', instead of the original units. This is done by
subtracting the mean and dividing by the standard deviation (in this formula,
$$\tilde{x}_i = \frac{x_i - \bar{x}}{s}$$
Demonstrated in [One Sample notebook](resources/tutorials/OneSample.ipynb).
supervision signal
The label or outcome observations used for supervised machine learning. See {term}label
.
This term is introduced in {video}`week13:unsupervised-intro`.
supervized learning
Training a model to predict an observed outcome or {term}label
. We use this when we have known outcomes for training and evaluation data,
and want to build a model that will predict those outcomes for future data before they are observed (or when they cannot be observed).
Contrast {term}`unsupervised learning`.
test set
A portion of your data set that is held back to evaluate the effectiveness of the final model.
Contrast with {term}training set
.
Sometimes erroneously called the {term}validation set
.
Data is typically split into three pieces:
1. The test set
2. The tuning or validation set
3. The training set
Once model tuning is done, the model may be retrained on the union of the training and tuning sets, or it may be used as-is.
We can think of these either as three separate sets, or as a sequence of splits:
- Split the initial data into train and test data
- Re-split the training data into tuning data a “train'” set
Introduced in {video}`week8:prediction-accuracy` and discussed in more detail in {video}`week11:workflow`.
See also [Training, validation, and test sets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) on Wikipedia, and
[this answer on Cross Validated](https://stats.stackexchange.com/a/96869/389).
training set
The portion of your data set on which you train your model.
Contrast with {term}test set
and {term}tuning set
.
See {term}test set
for more details.
t-test A statistical test for means of normally-distributed data. T-tests come in three varieties:
1. One-sample *t*-test that tests whether a single mean is different from zero (or another fixed value $\mu_0$). $H_0: \mu=0$
2. Two-sample independent *t*-test that tests whether the means of two independent samples are the same. $H_0: \mu_1 = \mu_2$
3. Paired *t*-test that tests, for a sample of paired observations, whether the mean difference between observations for each sample is zero (the measurements are, on average, the same). $H_0: \mu_{x_{i1} - x_{i2}} = 0$
Discussed in {video}`week5:hypotest`, {video}`week5:t-test`, and associated readings.
tuning set
A portion of your data set that you use to compare the performance of different candidate models, for hyperparameter tuning, feature selection, and similar tasks.
Distinct from the {term}test set
, which is only used once to test the performance of your final model.
Often called a validation set, but I avoid this term because it is ambiguous.
See {term}`test set` for more details.
unbiased estimator
An {term}estimator
whose expected value is the population parameter.
unsupervised learning
Learning when we do not have a specific observed outcome to predict; this typically tries to learn patterns or structure in the training data,
but no external ground truth is available to know if the patterns it learns are “correct”. Contrast {term}supervised learning
. Introduced in
{video}week13:unsupervised-intro
.
validation set
A widely-used name for the {term}tuning set
. Sometimes validation and test are switched, so an author will talk about trying out different models with their test set and doing the final evaluation with a validation set. I avoid the term due to this confusion.
See {term}`test set` for more details.
variable
In statistics, a particular data value that can be observed. For example, the a penguin's mass is a variable for penguin entities.
A {term}random variable
is a variable that takes on random values (or unknown values, where we model the unknowns with randomness).
In **programming**, a name used to refer to a piece of data. The following Python code assigns the value 3 to the variable `x`:
```python
x = 3
```
variance A measure of the spread of a random variable (which may be observable quantities in the population).
$$\Var(X) = \E[(X - \E[X])^2]$$
Variance is the square of the {term}`standard deviation`, and is sometimes written $\sigma^2$. It is also related to the {term}`covariance`: $\Var(X) = \Cov(X, X)$.
Variance can be computed with:
- {py:meth}`pandas.Series.var` (computes sample variance, pass `ddof=0` to compute population variance)
- {py:func}`numpy.var` (computes population variance, pass `ddof=1` to compute sample variance)
vector
A sequence or array of numbers;
vectorization Writing a computation so that mathematical operations are done across entire arrays at a time, rather than looping over individual data points in Python code.
:::