Lecture 10 Binary Logistic Regression + Multinomial Logistic Regression

Define a linear classifier (logistic regression)
Define an objective function (likelihood)
Optimize it with gradient descent to learn parameters
Predict the class with highest probability under the model

Logistic Regression Background

h(x) = sign(θ^Tx) is indifferentiable
Use a differentiable function instead:
- pθ(y=1|x) = 1 / (1 + exp(-θ^Tx)) = sigmoid(θ^Tx)
- sigmoid = logistic(u) = 1 / (1 + e^(-u))
Simple linear classification. Despite the name logistic regression.

Logistic Regression

Data: Inputs are continuous vectors of length M. Outputs are discrete.
- $D = {x^{(i)}, y^{(i)}}_{i=1}^N$ where $x \in R^M$ and $y \in {0,1}$
Model: Logistic function applied to dot product of parameters with input vector.
- $p_\theta(y=1|x) = \frac{1}{1+\exp{-\theta^Tx}}$
Learning: Finds the parameters that minimize some objective function.
- $\theta^* = argmin_\theta J(\theta)$
Prediction: Output is the most probable class
- $\hat{y} = argmax_{y \in {0, 1}} p_\theta(y|x)$

Binary Logistic Regression

Assume: Use x
Model:
- φ = σ(θ^Tx) where σ(u) = 1 / (1 + exp(-u))
- y ~ Bernoulli(φ)
- p(y|x) =
  - σ(θ^Tx) if y = 1
  - 1 - σ(θ^Tx) if y = 0
Objective function:
- l(θ) = Σ^N_i=1 log p(y^(i)|x(i), θ)
- J(θ) = - 1 / N l(θ) = 1 / N Σ^N_i=1 - log p(y^(i)|x(i)) negative average conditional log-likelihood for logistic regression is convex
- θMLE = argmax l(θ) = argmin -l(θ) = argmin -1/Nl(θ)
- dJ^(i)(θ)/d(θm)
  - = d/dθm (- log p(y^(i)|x(i), θ)) =
    - d/dθm -log[σ(θ^Tx^(i))] if y^(i)=1
    - d/dθm -log[1 - σ(θ^Tx^(i))] if y^(i)= 0
  - = - (y^(i) - σ(θ^Tx^(i)))x^(i)_m
  - = - (truth - probability of y = 1) mth feature
- ▽J^(i)(θ) = - (y^(i) - σ(θ^Tx^(i)))x^(i)
- ▽J(θ) = 1 / N Σ▽J^(i)(θ)
Find θ^ by gradient descent of SGD
Predict
- y^
  - = argmax{y∈{0,1}} p(y|x)
  - 1 if p(y=1|x) >= 0.5
  - 0 otherwise
  - = sign(θ^Tx) ∈ {0, 1}

Maximum Conditional Likelihood Estimation

Learning: finds the parameters that minimize some objective function.
- $\theta^* = argmin_\theta J(\theta)$
We minimize the negative log conditional likelihood:
- $J(\theta) = - \log{\prod_{i=1}^N p_\theta(y^{(i)}|x^{(i)})}$
Approaches
- Gradient Descent
  - take larger - more certain - steps opposite the gradient
- Stochastic Gradient Descent
  - take many small steps opposite the gradient
- Newton's Method
  - use second derivatives to better follow curvature

Mini-Batch SGD

Gradient Descent
- Compute true gradient exactly from all N examples
Stochastic Gradient Descent
- Approximate true gradient by the gradient of one randomly chosen example
Mini-Batch SGD
- Approximate true gradient by the average gradient of S randomly chosen examples

Multinomial Logistic Regression

Ex: Polar Bears, Sea Lions, Whales
Three hyperplanes:
- θp·x = 0
- θs·x = 0
- θw·x = 0
P(y=p|x) ∝ exp(θp^Tx) / Z(x)
P(y=s|x) ∝ exp(θs^Tx) / Z(x)
P(y=w|x) ∝ exp(θw^Tx) / Z(x)
Z(x,θ) = exp(θp·x) + exp(θs·x) + exp(θw·x)
Def: Multiclass Classification
- x ∈ R^M, y ∈ {1,2,...,k}
- Model: p(y|x) = exp(θy·x) / Σ_{K ∈ Y} exp(θK·x)
  - where θ = K × M matrix = [θ1, θ1, θK]
  - φ = [φ1, φ2, ..., φK]^T
  - φK = P(y=K|x)
  - y ~ Categorical(φ)
- Learning by MLE
  - Negative Conditional Log-likelihood
    - l(θ) = log[Π^N_{i=1} P(y^(i)|x^(i), θ)] = Σ^N_{i=1} log P(...)
    - J(θ) = -1/N l(θ) convex
    - θ = argmin J(θ) can use GD or SGD
  - Compute derivatives
- Optimization with SGD
  - Gradient dJ^(i)(θ)/dθkm = d/dθkm(-log p(y^(i)|x^(i),θ) = -(Π(y^(i)=k) - P(y^(i)|x^(i),θ))x^(i)_m
  - Π(y^(i)=k) = 1 if y^(i)=k; 0 otherwise
  - dJ^(i)(θ)/dθk = ▽θkJ^(i)(θ) = -(Π(y^(i)=k) - P(y^(i)|x^(i),θ))x^(i)
    - = [d·/dθk1, d·/dθk2, d·/dθkm] compute for each k
- Predict most probable class