Kernel Machine Learning

Project Status: Beta

Current Version: 2.571

About

KernelML is brute force optimizer that can be used to train machine learning models. The package uses a combination of a neuroevolution algorithms, heuristics, and monte carlo simulations to optimize a parameter vector with a user-defined loss function.

Installation

pip install kernelml --upgrade

Examples

Kernel Mapping

Lets take the problem of clustering longitude and latitude coordinates. Clustering methods such as K-means use Euclidean distances to compare observations. However, The Euclidean distances between the longitude and latitude data points do not map directly to Haversine distance. That means if you normalize the coordinate between 0 and 1, the distance won't be accurately represented in the clustering model. A possible solution is to find a projection for latitude and longitude so that the Haversian distance to the centroid of the data points is equal to that of the projected latitude and longitude in Euclidean space. Please see kernelml-haversine-to-euclidean.py.

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = [x*np.pi/180 for x in [lon1, lat1, lon2, lat2]] 

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

def euclid_dist_to_centroid(x,y,w):
        lon1, lat1, lon2, lat2 = x[:,0:1],x[:,1:2],x[:,2:3],x[:,3:4]
        euclid = np.sqrt((w[0]*lon1-w[0]*lon2)**2+(w[1]*lat1-w[1]*lat2)**2) 
        haver = y
        return np.sum((euclid-haver)**2)

Non Linear Coefficients - Power Transformation

Another, simpler problem is to find the optimal values of non-linear coefficients, i.e, power transformations in a least squares linear model. The reason for doing this is simple: integer power transformations rarely capture the best fitting transformation. By allowing the power transformation to be any real number, the accuracy will improve and the model will generalize to the validation data better. Please see kernelml-power-transformation-example.py.

def poly_least_sqs_loss(x,y,w):
    hypothesis = w[0]*x[:,0:1] + w[1]*(x[:,1:2]) + w[2]*(x[:,1:2])**w[3]
    loss = hypothesis-y 
    return np.sum(loss**2)/len(y)

Non Linear Coefficients - Ensemble Model - Sinusoids

The optimizer returns a history of parameters for every iteration. Each parameter in the history fits the data slightly differently. Using a combination of the predicted values from these parameters, ensembled together, can improve results. In this example, the phase, time shift, and the scaling for the cosine term will update in that order.

def sin_least_sqs_loss(x,y,w):
    hypothesis = w[0]*x[:,0:1] + np.cos(x[:,1:2]*w[1]-w[2])*w[3]
    loss = hypothesis-y
    return np.sum(loss**2)/len(y)

The predicted output from each parameter was used as a feature in a unifying model. The train and validation plots below show how using multiple parameter sets can fit complex shapes.

Please see the example code in kernelml-time-series-example.py. Ensemble model results may vary due to the random nature of parameter updates.

Negative Loglikelihood Loss

Kernelml can minimize non-linear loss functions such as minimizing the negative loglikelihood for logistic regression. In addition, the optimizer fit a distribution to an empirical histogram using negative loglikelihood as the loss function. Please see kernelml-likelihood-distribution-fitting.py.

def logistic_liklihood_loss(x,y,w):
    hypothesis = x.dot(w)
    hypothesis = 1/(1+np.exp(-1*hypothesis))
    hypothesis[hypothesis<=0.00001] = 0.00001
    loss = -1*((1-y).T.dot(np.log(1-hypothesis)) + y.T.dot(np.log(hypothesis)))/len(y)
    return loss.flatten()[0]

Enhanced Ridge Regression

Add a parameter for L2 regularization and allow the alpha parameter to fluxuate from a target value. The added flexibility can improve generalization to the validation data. Please see kernelml-enhanced-ridge-example.py.

def ridge_least_sqs_loss(x,y,w):
    alpha,w = w[-1][0],w[:-1]
    penalty = 0
    value = 1
    if alpha<=value:
        penalty = 10*abs(value-alpha)
    hypothesis = x.dot(w)
    loss = hypothesis-y 
    return np.sum(loss**2)/len(y) + alpha*np.sum(w[1:]**2) + penalty*np.sum(w[1:]**2)

Parameter Tuning "Rules of Thumb"

There are many potential strategies for choosing optimization parameters. However, the choice of different setting involves balancing the simulation_factor, mutation_factor, and breed_factor.

The mutation factor should be inversely proportional to the simulation factor * the number of parameters. The more data points and the less parameters, the more likely that large mutations will not be successful.

Methods

kml = kernelml.KernelML(prior_sampler_fcn=None,
                 sampler_fcn=None,
                 intermediate_sampler_fcn=None,
                 parameter_transform_fcn=None,
                 batch_size=None)

prior_sampler_fcn: the function defines how the initial parameter set is sampled
sampler_fcn: the function defines how the parameters are sampled between interations
intermediate_sampler_fcn: this function defines how the priors are set between runs
parameter_transform_fcn: this function transforms the parameter vector before processing
batch_size: defines the random sample size before each run (default is all samples)

# Begins the optimization process, returns (list of parameters by run),(list of losses by run)
parameters_by_run,loss_by_run = kml.optimize(self,X,y,loss_function,num_param,args=[],
                                    kml_model=None,
                                    runs=1,
                                    total_iterations=100,
                                    n_parameter_updates=100,
                                    bias=1,
                                    variance=1,
                                    analyze_n_parameters=20,
                                    update_magnitude=100,
                                    convergence_z_score=1,
                                    init_random_sample_num=1000,
                                    random_sample_num=100,
                                    prior_uniform_low=-1,
                                    prior_uniform_high=1,
                                    plot_feedback=False,
                                    print_feedback=False)

X: input matrix
y: output vector
loss_function: f(x,y,w), outputs loss
num_param: number of parameters in the loss function
arg: list of extra data to be passed to the loss function
kml_model: transfer learn weight from another kml.model

Iteration Parameters

runs: number of runs
total_iterations: number of iterations (+bias)
n_parameter_updates: the number of parameter updates per iteration (+bias)

Learning Rate Parameters

The optimizer's parameters can be automatically adjusted by adjusting the following parameters:

simulation_factor: increases the (coefficient) population size
mutate_factor: increases the amount of coefficient augmentation, increases the search magnitude
breed_factor: increases the combination of coefficient sets, searches the subspace of simulated coefficients

Automatically adjusted parameters

analyze_n_parameters: the number of parameters analyzed (+variance)
update_magnitude: the magnitude of the updates - corresponds to magnitude of loss function (+variance)
init_random_sample_num: the number of initial simulated parameters (+bias)
random_sample_num: the number of intermediate simulated parameters (+bias)

Optinal Parameters

convergence_z_score: the z score - defines when the algorithm converges
prior_uniform_low: default pior random sampler - uniform distribution - low
prior_uniform_high: default pior random sampler - uniform distribution - high
plot_feedback: provides real-time plots of parameters and losses
print_feedback: real-time feedback of parameters,losses, and convergence

Access Model Parameters and Losses

params = kml.model.get_param_by_iter()
errors = kml.model.get_loss_by_iter()
update_history = kml.model.get_parameter_update_history()
best_w = kml.model.get_best_param()

Parallel Processing with Ipyrallel

Initialize the parallel engines, set the direct view block to true, and then import the require libraries to the engines.

from ipyparallel import Client
rc = Client(profile='default')
dview = rc[:]

dview.block = True

with dview.sync_imports():
    import numpy as np
    from scipy import stats
    
kml.use_ipyparallel(dview)

Convergence

The model saves the best parameter and user-defined loss after each iteration. The model also record a history of all parameter updates. The question is how to use this data to define convergence. One possible solution is:

#Convergence algorithm
convergence = (best_parameter-np.mean(param_by_iter[-10:,:],axis=0))/(np.std(param_by_iter[-10:,:],axis=0))

if np.all(np.abs(convergence)<1):
    print('converged')
    break

The formula creates a Z-score using the last 10 parameters and the best parameter. If the Z-score for all the parameters is less than 1, then the algorithm can be said to have converged. This convergence solution works well when there is a theoretical best parameter set.

Override Random Sampling Functions

The default random sampling functions for the prior and posterior distributions can be overrided. User defined random sampling function must have the same parameters as the default. Please see the default random sampling functions below.

    #inital parameter sampler (default)
    def prior_sampler_uniform_distribution(weights,num_param,num_samples):
        return np.random.uniform(low=self.low,high=self.high,size=(num_param,num_samples))

    #multivariate normal sampler (default)
    def sampler_multivariate_normal_distribution(best_param,
                                                param_by_iter,
                                                error_by_iter,
                                                parameter_update_history,
                                                random_sample_num=100):
        covariance = np.diag(np.var(parameter_update_history[:,:],axis=1))
        best = param_by_iter[np.where(error_by_iter==np.min(error_by_iter))[0]]
        mean = best.flatten()
        try:
            return np.random.multivariate_normal(mean, covariance, (random_sample_num)).T
        except:
            print(best,np.where(error_by_iter==np.min(error_by_iter)))
            
    #intermediate sampler
    def intermediate_uniform_distribution(weights,num_param,num_samples):
        result = []
        for i in range(num_param):
            x = np.random.uniform(weights[i]-0.1*weights[i],weights[i]+0.1*weights[i],size=(1,num_samples)).T
            result.append(x)
        result = np.squeeze(np.array(result))
        return result         
        
    #mini batch random choice sampler
    def mini_batch_random_choice(X,y,batch_size):
        all_samples = np.arange(0,X.shape[0])
        rand_sample = np.random.choice(all_samples,size=batch_size,replace=False)
        X_batch = X[rand_sample]
        y_batch = y[rand_sample]
        return X_batch,y_batch

Parameter Transforms

The default parameter tranform function can be overrided. The parameters can be transformed before the loss calculations and parameter updates. For example, the parameters can be transformed if some parameters must be integers, positive, or within a certain range.

# The default parameter transform return the parameter set unchanged 
def default_parameter_transform(w):
    # rows,columns = (parameter set,iteration)
    return w

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_v2.571.md

README_v2.571.md

Kernel Machine Learning

About

Table of contents

Installation

Examples

Kernel Mapping

Non Linear Coefficients - Power Transformation

Non Linear Coefficients - Ensemble Model - Sinusoids

Negative Loglikelihood Loss

Enhanced Ridge Regression

Parameter Tuning "Rules of Thumb"

Methods

Iteration Parameters

Learning Rate Parameters

Automatically adjusted parameters

Optinal Parameters

Access Model Parameters and Losses

Parallel Processing with Ipyrallel

Convergence

Override Random Sampling Functions

Parameter Transforms

Files

README_v2.571.md

Latest commit

History

README_v2.571.md

File metadata and controls

Kernel Machine Learning

About

Table of contents

Installation

Examples

Kernel Mapping

Non Linear Coefficients - Power Transformation

Non Linear Coefficients - Ensemble Model - Sinusoids

Negative Loglikelihood Loss

Enhanced Ridge Regression

Parameter Tuning "Rules of Thumb"

Methods

Iteration Parameters

Learning Rate Parameters

Automatically adjusted parameters

Optinal Parameters

Access Model Parameters and Losses

Parallel Processing with Ipyrallel

Convergence

Override Random Sampling Functions

Parameter Transforms