#[fit] Ai 1
#[fit] Learning a Model
#[fit] Validation and Regularization
- SMALL World vs BIG World
- Approximation
- THE REAL WORLD HAS NOISE
- Complexity amongst Models
- Validation
- (0) Recap of key concepts from earlier
- (1) Validation and Cross Validation
- (2) Regularization
- (3) Multiple Features
##[fit] 0. From before
- Small World given a map or model of the world, how do we do things in this map?
- BIG World compares maps or models. Asks: whats the best map?
(Behaim Globe, 21 inches (51 cm) in diameter and was fashioned from a type of papier-mache and coated with gypsum. (wikipedia))
#[fit]RISK: What does it mean to FIT?
Minimize distance from the line?
Minimize squared distance from the line. Empirical Risk Minimization.
##[fit]Get intercept
#[fit] HYPOTHESIS SPACES
For example, a polynomial looks so:
All polynomials of a degree or complexity
$$ \cal{H}1: h_1(x) = \theta_0 + \theta_1 x $$ $$ \cal{H}{20}: h_{20}(x) = \sum_{i=0}^{20} \theta_i x^i$$
A sample of 30 points of data. Which fit is better? Line in
#Statement of the Learning Problem
The sample must be representative of the population!
A: In-sample risk is small B: Population, or out-of-sample risk is WELL estimated by in-sample risk. Thus the out of sample risk is also small.
Which fit is better now? The line or the curve?
#UNDERFITTING (Bias) vs OVERFITTING (Variance)
#TRAIN AND TEST
#MODEL COMPARISON: A Large World approach
- want to choose which Hypothesis set is best
- it should be the one that minimizes risk
- but minimizing the training risk tells us nothing: interpolation
- we need to minimize the training risk but not at the cost of generalization
- thus only minimize till test set risk starts going up
##[fit]1. Validation and ##[fit]Cross Validation
##[fit] Do we still have a test set?
Trouble:
- no discussion on the error bars on our error estimates
- "visually fitting" a value of
$$d \implies$$ contaminated test set.
The moment we use it in the learning process, it is not a test set.
#[fit]VALIDATION
- train-test not enough as we fit for
$$d$$ on test set and contaminate it - thus do train-validate-test
- we wrongly already attempted to fit
$$d$$ on our previous test set. - choose the
$$d, g^{-*}$$ combination with the lowest validation set risk. - $$R_{val}(g^{-}, d^)$$ has an optimistic bias since
$$d$$ effectively fit on validation set
- finally retrain on the entire train+validation set using the appropriate
$$d^*$$ - works as training for a given hypothesis space with more data typically reduces the risk even further.
What if we, just by chance had an iffy validation set?
This problem is dire when we are in low data situations. In large data situations, not so much.
We then do
Key Idea: Repeat the validation process on different pieces of left out data. Make these left-out parts not overlap so that the risks/errors/mse calculated on each are not correlated.
#[fit]CROSS-VALIDATION
#[fit]CROSS-VALIDATION
#is
- a resampling method
- robust to outlier validation set
- allows for larger training sets
- allows for error estimates
Here we find
- validation process as one that estimates
$$R_{out}$$ directly, on the validation set. It's critical use is in the model selection process. - once you do that you can estimate
$$R_{out}$$ using the test set as usual, but now you have also got the benefit of a robust average and error bars. - key subtlety: in the risk averaging process, you are actually averaging over different
$$g^-$$ models, with different parameters.
Consider a "small-world" approach to deal with finding the right model, where we'll choose a Hypothesis set that includes very complex models, and then find a way to subset this set.
This method is called
##[fit] 2. Regularization
##REGULARIZATION: A SMALL WORLD APPROACH
Keep higher a-priori complexity and impose a
##complexity penalty
on risk instead, to choose a SUBSET of
Consider the set of 10th order polynomials:
$$\begin{array}{l}{\mathcal{H}{10}=\left{h(x)=w{0}+w_{1} \Phi_{1}(x)+w_{2} \Phi_{2}(x)+w_{3} \Phi_{3}(x)+\cdots+w_{10} \Phi_{10}(x)\right}} \end{array}$$
Now suppose we just set some of these to 0, then we get
$$\begin{array}{l} {\mathcal{H}{2}=\left{\begin{array}{c}{h(x)=w{0}+w_{1} \Phi_{1}(x)+w_{2} \Phi_{2}(x)+w_{3} \Phi_{3}(x)+\cdots+w_{10} \Phi_{10}(x)} \ {\text { such that: } w_{3}=w_{4}=\cdots=w_{10}=0} \ \end{array}\right.}\end{array}$$
This is called a hard-order constraint.
$$\mathcal{H}{C}=\left{\begin{array}{c}{h(x)=w{0}+w_{1} \Phi_{1}(x)+w_{2} \Phi_{2}(x)+w_{3} \Phi_{3}(x)+\cdots+w_{10} \Phi_{10}(x)} \ {\text { such that: } \sum_{q=0}^{10} w_{q}^{2} \leq C} \end{array}\right.$$ a soft budget constraint
- Optimal
$$\mathbf{w}$$ tries to get as 'close' to$$\mathbf{w}_{lin}$$ . Thus, optimal$$\mathbf{w}$$ will use full budget and be on the surface$$\mathbf{w}^{T} \mathbf{w}=C$$ . - Surface
$$\mathbf{w}^{T} \mathbf{w}=C$$ , at optimal$$\mathbf{w}$$ , should be perpindicular to$$\nabla E_{\text {in. }}$$ - Normal to surface
$$\mathbf{w}^{T} \mathbf{w}=C$$ is the vector$$\mathbf{w}$$ . - Surface is
$$\perp \nabla E_{\text {in }}$$ and thus must be "tangent"
$$\nabla E_{\text {in }}\left(\mathbf{w}{\text {reg }}\right)=-2 \lambda{C} \mathbf{w}_{\text {reg }}$$
$$\begin{array}{l}{\qquad E_{\mathrm{in}}(\mathbf{w}) \quad \text { is minimized, subject to: } \mathbf{w}^{\mathrm{T}} \mathbf{w} \leq C} \ {\Leftrightarrow \quad \nabla E_{\mathrm{in}}\left(\mathbf{w}{\mathrm{reg}}\right)+2 \lambda{C} \mathbf{w}{\mathrm{reg}}=\mathbf{0}} \ {\left.\Leftrightarrow \nabla\left(E{\mathrm{in}}(\mathbf{w})+\lambda_{C} \mathbf{w}^{\mathrm{T}} \mathbf{w}\right)\right|{\mathbf{w}=\mathbf{w}{\mathrm{rgg}}}=\mathbf{0}} \ {\Leftrightarrow \quad E_{\mathrm{in}}(\mathbf{w})+\lambda_{C} \mathbf{w}^{\mathrm{T}} \mathbf{w} \quad \text { is minimized, unconditionally }} \ {\text { There is a correspondence: } C \uparrow \quad \lambda_{C} \downarrow}\end{array}$$
#[fit]REGULARIZATION
As we increase
Lasso uses
- Regularization is a subsetting now,
- of a complex hypothesis set.
- If you subset too much, you underfit
- but if you do not do it enough, you overfit
##[fit] 3. Lots of features
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
14.890 3606 283 2 34 11 Male No Yes Caucasian 333
106.02 6645 483 3 82 15 Female Yes Yes Asian 903
104.59 7075 514 4 71 11 Male No No Asian 580
148.92 9504 681 3 36 11 Female No No Hispanic 964
55.882 4897 357 2 68 16 Male No Yes Caucasian 331
If the predictor takes only two values, then we create an indicator or dummy variable that takes on two possible numerical values. If more than 2 values, then need N-1 columns:
Ethnicity = {Caucasian, Asian, Hispanic}
We presented polynomial regression as if it was not linear regression. But it is.
Linearity refers to the coefficients, bot the features.
Here is another example: interaction terms with a categorical variable:
Here we interact
As you can see, the number of features can balloon. In many modern problems: startup with few customers but lots of data on them, there are already more predictors than members in your sample.
We then get the :
- data is sparser in higher dimensions
- volume moves to the outside
to cover same fractional volume, you need to go bigger on length in higher dims
- remember dimensionality in our problems refers to the number of features we have
- each feature (or feature combination which we shall just call a new feature) is a dimension
- thus each member of our sample is a point in this feature space
- notions of distance and volume become hard in this high-dimensional space
- indeed its easier to find "simple models" in this high dimensional space