- Define a linear classifier (logistic regression)
- Define an objective function (likelihood)
- Optimize it with gradient descent to learn parameters
- Predict the class with highest probability under the model
- h(x) = sign(θ^Tx) is indifferentiable
- Use a differentiable function instead:
- pθ(y=1|x) = 1 / (1 + exp(-θ^Tx)) = sigmoid(θ^Tx)
- sigmoid = logistic(u) = 1 / (1 + e^(-u))
- Simple linear classification. Despite the name logistic regression.
- Data: Inputs are continuous vectors of length M. Outputs are discrete.
-
$D = {x^{(i)}, y^{(i)}}_{i=1}^N$ where$x \in R^M$ and$y \in {0,1}$
-
- Model: Logistic function applied to dot product of parameters with input vector.
$p_\theta(y=1|x) = \frac{1}{1+\exp{-\theta^Tx}}$
- Learning: Finds the parameters that minimize some objective function.
$\theta^* = argmin_\theta J(\theta)$
- Prediction: Output is the most probable class
$\hat{y} = argmax_{y \in {0, 1}} p_\theta(y|x)$
- Assume: Use x
- Model:
- φ = σ(θ^Tx) where σ(u) = 1 / (1 + exp(-u))
- y ~ Bernoulli(φ)
- p(y|x) =
- σ(θ^Tx) if y = 1
- 1 - σ(θ^Tx) if y = 0
- Objective function:
- l(θ) = Σ^N_i=1 log p(y^(i)|x(i), θ)
- J(θ) = - 1 / N l(θ) = 1 / N Σ^N_i=1 - log p(y^(i)|x(i)) negative average conditional log-likelihood for logistic regression is convex
- θMLE = argmax l(θ) = argmin -l(θ) = argmin -1/Nl(θ)
- dJ^(i)(θ)/d(θm)
- = d/dθm (- log p(y^(i)|x(i), θ)) =
- d/dθm -log[σ(θ^Tx^(i))] if y^(i)=1
- d/dθm -log[1 - σ(θ^Tx^(i))] if y^(i)= 0
- = - (y^(i) - σ(θ^Tx^(i)))x^(i)_m
- = - (truth - probability of y = 1) mth feature
- = d/dθm (- log p(y^(i)|x(i), θ)) =
- ▽J^(i)(θ) = - (y^(i) - σ(θ^Tx^(i)))x^(i)
- ▽J(θ) = 1 / N Σ▽J^(i)(θ)
- Find θ^ by gradient descent of SGD
- Predict
- y^
- = argmax{y∈{0,1}} p(y|x)
- 1 if p(y=1|x) >= 0.5
- 0 otherwise
- = sign(θ^Tx) ∈ {0, 1}
- y^
- Learning: finds the parameters that minimize some objective function.
$\theta^* = argmin_\theta J(\theta)$
- We minimize the negative log conditional likelihood:
$J(\theta) = - \log{\prod_{i=1}^N p_\theta(y^{(i)}|x^{(i)})}$
- Approaches
- Gradient Descent
- take larger - more certain - steps opposite the gradient
- Stochastic Gradient Descent
- take many small steps opposite the gradient
- Newton's Method
- use second derivatives to better follow curvature
- Gradient Descent
- Gradient Descent
- Compute true gradient exactly from all N examples
- Stochastic Gradient Descent
- Approximate true gradient by the gradient of one randomly chosen example
- Mini-Batch SGD
- Approximate true gradient by the average gradient of S randomly chosen examples
- Ex: Polar Bears, Sea Lions, Whales
- Three hyperplanes:
- θp·x = 0
- θs·x = 0
- θw·x = 0
- P(y=p|x) ∝ exp(θp^Tx) / Z(x)
- P(y=s|x) ∝ exp(θs^Tx) / Z(x)
- P(y=w|x) ∝ exp(θw^Tx) / Z(x)
- Z(x,θ) = exp(θp·x) + exp(θs·x) + exp(θw·x)
- Def: Multiclass Classification
- x ∈ R^M, y ∈ {1,2,...,k}
- Model: p(y|x) = exp(θy·x) / Σ_{K ∈ Y} exp(θK·x)
- where θ = K × M matrix = [θ1, θ1, θK]
- φ = [φ1, φ2, ..., φK]^T
- φK = P(y=K|x)
- y ~ Categorical(φ)
- Learning by MLE
- Negative Conditional Log-likelihood
- l(θ) = log[Π^N_{i=1} P(y^(i)|x^(i), θ)] = Σ^N_{i=1} log P(...)
- J(θ) = -1/N l(θ) convex
- θ = argmin J(θ) can use GD or SGD
- Compute derivatives
- Negative Conditional Log-likelihood
- Optimization with SGD
- Gradient dJ^(i)(θ)/dθkm = d/dθkm(-log p(y^(i)|x^(i),θ) = -(Π(y^(i)=k) - P(y^(i)|x^(i),θ))x^(i)_m
- Π(y^(i)=k) = 1 if y^(i)=k; 0 otherwise
- dJ^(i)(θ)/dθk = ▽θkJ^(i)(θ) = -(Π(y^(i)=k) - P(y^(i)|x^(i),θ))x^(i)
- = [d·/dθk1, d·/dθk2, d·/dθkm] compute for each k
- Predict most probable class