Skip to content

Latest commit

 

History

History
251 lines (185 loc) · 12.6 KB

README.md

File metadata and controls

251 lines (185 loc) · 12.6 KB

Machine Learning

A mix of class notes, online resources and hw assignments from the Machine Learning class at Columbia.

Table of Contents

Online Resources

Matrix Differentiation

Determinant

  • Tells us the size by which the unit vectors will scale or decrease by i.e. be transformed by.
  • The determinant tells us the factor by which areas are stretch or shrunk by.
    • great than one, then the area is growing.
    • less than one, then the area is shrinking.
   i j
  [1 0]
  [0 1]
  [3 1]
  [0 2]

^ has a derminant of 6 i.e. (3*2)-(1*0)
That means 

  [1 0]
  [0 1]

will grow 6 times it's size; the area will be 6 times the original size of the unit vectors.
  • If the determinant is zero, there is no inverse.

Intergral Calculus

Unsupervised Models

Kmeans

K-means is the simplest and most fundamental clustering algorithm.

Input: x1, . . . , xn, where x ∈ Rd.

Treats each x as a vector in Rd

Output: Vector c of cluster assignments, and K mean vectors µ.

  • c = (c1, . . . , cn), ci ∈ {1, . . . , K}
    • If ci = cj = k, then xi and xj are clustered together in cluster k.
    • c is an integer that represents each cluster.
  • µ = (µ1, . . . , µK), µk ∈ Rd (same space as xi)
    • Each µk (called a centroid) is a set of k d-dimensional vectors that's part of some c.
    • µ number of vectors for each cluster.
    • µk defines the cluster for the k's cluster.

The K-means objective function can be written as

kmeans_objective source: https://www.saedsayad.com/clustering_kmeans.htm

Algorithm

We have a double sum, a sum over each data point. For every single data point, xi, there's an additional sum over each cluster µk. Each ith data point will sum over k clusters. We'll determine the euclidean distance of each ith data point to the center of the cluster c i.e. centroid.

This objective is not convex; can't find the optimal µ and c. There are many theories that say you can. All we could do is derive an algorithm for a local minimum.

We can’t optimize the K-means objective function exactly by taking derivatives and setting to zero, so we use an iterative algorithm.

Can't do everything at once with a single algorithm. The algorithm will require iteration to modify values of whatever that is you're trying to learn.

Coordinate Descent

Split the parameters into two unknown sets µ and c. You could split them anyway you want. We're going to split them into two sets µ and c clusters. We'll then observe that even though we can't find the optimal value µ and c together, if we fix one, we could find the best value for the other one. And if we fixed c, we'll be able to find the best µ conditioned on c.

We split the variables into two unknown sets µ and c. We can’t find their best values at the same time to minimize L. However, we will see that

  • Fixing µ we can find the best c exactly.
  • Fixing c we can find the best µ exactly.

This optimization approach is called coordinate descent: Hold one set of parameters fixed, and optimize the other set. Then switch which set is fixed.

  1. Clusters the data into k groups where k is predefined.
  2. Select k points at random as cluster centers.
  3. Assign closest xi data points to their closest cluster center according to the Euclidean distance function.
    • Given µ, find the best value ci ∈ {1, . . . , K} for (x1, . . . , xi).
  4. Calculate the centroid or mean of all objects in each cluster. Out put the updated µ and c.
    • Given c, find the best vector µk ∈ Rd for k = 1, . . . , K.
  5. Repeat steps 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

There’s a circular way of thinking about why we need to iterate:

3. Given a particular µ, we may be able to find the best c, but once wechange c we can probably find a better µ.
4. Then find the best µ for the new-and-improved c found in #1, but now that we’ve changed µ, there is probably a better c.

We have to iterate because the values of µ and c depend on each other. This happens very frequently in unsupervised models.

This is not a convex problem. Each time a new µ is initialized, we'll get different results.

kmeans_objective

Because there are only K options for each ci , there are no derivatives. Simply calculate all the possible values for ci and pick the best (smallest) one.

The outline of why this converges is straightforward:

  1. Every update to ci or µk decreases L compared to the previous value.
  2. Therefore, L is monotonically decreasing.
  3. L ≥ 0, so Step 3 converges to some point (but probably not to 0).

When c stops changing, the algorithm has converged to a local optimal solution. This is a result of L not being convex.

Non-convexity means that different initializations will give different results:

  • Often the results will be similar in quality, but no guarantees.
  • In practice, the algorithm can be run multiple times with different initializations. Then use the result with the lowest L.

Maximum Likelihood using EM Algorithm

Expectation/Maximization algorithm. Closely related to variational inference

  • probabilistic objective function
  • discussed for least squares, linear regression (gradient method) and the bayes classifier. That model is nice, because we could find the respective θML analytically by writing an equation and plugging in data to solve.

Markov Chain

Markov chain is a discrete state space with a stochastic process satisfying the Markov property. Alessandro Molina

Discrete state space

  • A sequence of random variables X1, X2, ..., Xn.
  • A Markov chain has either discrete state space (set of possible values of the random variables) or discrete index set (often representing time).
  • State space is a countable set S of Xi and cab be anything: letters, numbers, basketball scores or weather conditions.
  • The state space is a matrix that will be an N x N matrix, such that entry (i, j) is the probability of transitioning from state i to state j.

Stochastic process

  • The transition matrix must be a stochastic matrix, a matrix whose entries in each row must add up to exactly 1.
  • Each row of the matrix represents its own probability distribution.

Markov property

  • The probability of moving to the next state depends only on the present state
  • The distribution on the next position only depends on the current position.
  • The knowledge of the previous state is all that is necessary to determine the probability distribution of the current state, satisfying the rule of conditional independence (or said other way: you only need to know the current state to determine the next state).

So, the model is characterized by a state space, a transition matrix describing the probabilities of particular transitions, and an initial state across the state space, given in the initial distribution. Sejal Jaiswal

A simple Markov chain on the random variable, representing the random variable Weather = {Sunny, Rainy, Snowy} and showing the probability of the random variable switching to other states in the next time instance

The transition matrix represents the same information as in the dictionary, but in a more compact way. For this reason, the transition matrix is the standard way of representing Markov chains.

Markov chains are important mathematical tools that effectively aid the simplification of predicting stochastic processes by viewing the future as independent of the past, given the present state of the process.

Nonnegative Matrix Factorization

NMF is a technique that splits your data up into a set of individual signals and weights to apply to those signals to recreate your original data. It works best when your data is truly a sum of individual signals, but it will still work to some degree when the data isn’t quite like that. There are some caveats when it comes to applying NMF; the data needs to be positive, and laid out in an appropriate matrix, but what is the relationship between the number of observations/columns in the data matrix and the number of individual signals you can decompose the data into?. http://www.billconnelly.net/?p=534

nonnegative-matrix-factorization We use notation and think about the problem slightly differently from PMF

  • Data X has nonnegative entries. None missing, but likely many zeros.
  • The learned factorization W and H also have nonnegative entries.
  • The value Xij ≈ ∑kWikHkj, but we won’t write this with vector notation

What Problems can NMF solve?

  • Text data:
    • Word term frequencies
    • X_ij contains the number of times word i appears in document j.
  • Image data:
    • Face identification data sets.
    • Put each vectorized N×M image of a face on a column of X.
  • Other discrete grouped data:
    • Quantize continuous sets of features using K-means.
    • Xij counts how many times group j uses cluster i.
    • For example: group = song, features = d×n spectral information matrix.

Squared error objective algorithm

  • Multiplicative Update for ‖X−WH‖2 nonnegative-matrix-factorization-squared_error_objective

  • Randomly initialize H and W with nonnegative values.

  • Iterate the following, first for all values in H, then all in W: nonnegative-matrix-factorization-algorithm

    Visual of the Multiplicative Update ‖X−WH‖2
    • Use element-wise multiplication/division across three columns below. nonnegative-matrix-factorization-visual
    • Use matrix multiplication within each outlined box.
    Squared error objective in python
    H = (H * np.dot(W.T, X)) / (np.dot(W.T, W).dot(H) + err)
    W = (W * np.dot(X, H.T)) / (np.dot(W, H).dot(H.T) + err)

Divergence objective algorithm

  • Multiplicative Update for D(X‖WH): nonnegative-matrix-factorization-divergence-objective

  • Randomly initialize H and W with nonnegative values.

  • Iterate the following, first for all values in H, then all in W: nonnegative-matrix-factorization-divergence-penalty

    Visual of the Multiplicative Update D(X‖WH)
    • Visualizing the update for the divergence penalty is more complicated.
    • Use the color-coded definition below. nonnegative-matrix-factorization-divergence-visual
    • "Purple" is the data matrix “dot-divided” by the approximation of it.