Skip to content

Latest commit

 

History

History
582 lines (477 loc) · 59 KB

README.md

File metadata and controls

582 lines (477 loc) · 59 KB

Hanhan_Data_Science_Practice

BIG DATA! - Fantastic

  • Why Spark is great?!

  • How to run Spark through terminal command line

    • Download Spark here: https://spark.apache.org/downloads.html
    • Unpack that somewhere you like. Set an environment variable so you can find it easily later (CSH and BASH versions): setenv SPARK_HOME /home/you/spark-1.5.1-bin-hadoop2.6/, export SPARK_HOME=/home/you/spark-1.5.1-bin-hadoop2.6/
    • Then ${SPARK_HOME}/bin/spark-submit --master local [your code file path] [your large data file path as input, this one only exist when you have sys.argv[1]]
  • Automation

R PRACTICE

Note: The Spark R Notebook I am using is community editon, because R version maybe lower, many package in R Basics have not been supported.

PYTHON PRACTICE

Note: I'm using Spark Python Notebook, some features are unique there. Because my own machine could not install the right numpy version for pandas~

  • Multi-Label Problem

  • Factorization Machines

    • Large dataset can be sparse, with Factorization, you can extract important or hidden features
    • With a lower dimension dense matrix, factorization could represent a similar relationship between the target and the predictors
    • The drawback of linear regression and logistic regression is, they only learn the effects of all features individually, instead of in combination
    • For example, you have Fields Color, Category, Temperature, and Features Pink, Ice-cream, Cold, each feature have different values
      • Linear regression: w0 + wPink * xPink + wCold * xCold + wIce-cream * xIce-cream
      • Factorization Machines (FMs): w0 + wPink * xPink + wCold * xCold + wIce-cream * xIce-cream + dot_product(Pink, Cold) + dot_product(Pink, Ice-cream) + dot_product(Cold, Ice-cream)
        • dot-product: a.b = |a|*|b|cosθ, when θ=0, cosθ=1 and the dot product reaches to the highest value. In FMs, dor product is used to measure the similarity
        • dot_product(Pink, Cold) = v(Pink1)*v(Cold1) + v(Pink2)*v(Cold2) + v(Pink3)*v(Cold3), here k=3. This formula means dot product for 2 features in size 3
      • Field-aware factorization Machines (FFMs)
        • Not quite sure what does "latent effects" meantioned in the tutorial so far, but FFMs has awared the fields, instead of using dot_product(Pink, Cold) + dot_product(Pink, Ice-cream) + dot_product(Cold, Ice-cream), it's using Fields here, dot_product(Color_Pink, Temperature_Cold) + dot_product(Color_Pink, Category_Ice-cream) + dot_product(Temperature_Cold, Category_Ice-cream), Color & Temperature, Color & category, Temperature & Category
    • xLearn library
      • Sample input (has to be this format, libsvm format): https://github.com/aksnzhy/xlearn/blob/master/demo/classification/criteo_ctr/small_train.txt
      • Detailed documentation about parameters, functions: http://xlearn-doc.readthedocs.io/en/latest/python_api.html
      • Personally, I think this library is a little bit funny. First of all, you have to do all the work to convert sparse data into dense format (libsvm format), then ffm will do the work, such as extract important features and do the prediction. Not only how it works is in the blackbox, but also it creates many output files during validation and testing stages. You's better run evrything through terminal, so that you can see more information during the execution. I was using IPython, totally didin't know what happened.
      • But it's fast! You can also set multi-threading in a very easy way. Check its documentation.
    • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/Factorization_Machines.ipynb
      • My code is better than reference
    • Reference: https://www.analyticsvidha.com/blog/2018/01/factorization-machines/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
  • RGF

    • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/try_RGF.ipynb
      • Looks like the evaluation result is, too bad, even with Grid Search Cross Validation
    • reference: https://www.analyticsvidhya.com/blog/2018/02/introductory-guide-regularized-greedy-forests-rgf-python/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
      • It's missing code in the reference and it's lack of evaluation step
    • RGF vs. Gradient Boosting
      • Boosting add weights to misclassified observations for next base algorithm, in each iteration. RGF changes forest structure by one step to minimize the logloss, and also adjust the leaf weights for the entire forest to minimize the logloss, in each iteration
      • RGF searches optimum structure changes
        • The search is within the newly created k trees (default k=1), otherwise the computation can be expensive
        • Also for computational efficiency, only do 2 types of operations:
          • split an existing leaf node
          • create a new tree
        • With the weights of all leaf nodes fixed, it will try all possible structure changes and find the one with lowest logloss
      • Weights Optimization
        • After every 100 new leaf nodes added, the weights for all nodes will be adjusted. k=100 by default
        • When k is very large, it's similar to adjust weights at the end; when k is very small, it can be computational expensive since it's similar to adjust all nodes' weights after each new leaf node added
      • It doesn't need to set tree size, since it is determined through logloss minimizing process, automatically. What you can set is max leaf nodes and regularization as L1 or L2
      • RGF may has simpler model to train, compared with boosting methods, since boosting methods require small learning rate and large amount of estimators
  • Regression Spline

    • Still, EXPLORE DATA first, when you want to try regression, check independent variables (features) and dependent variable (label) relationship first to see whether there is linear relationship
    • Linear Regression, a linear formula between X and y, deals with linear relationship; Polynomial Regression converts that linear formula into a polnomial one, and it can deal with non-linear relationship.
    • When we increase the power value in polynomial regression, it will be easier to become over-fitting. Also with higher degree of polynomial function, the change of one y value in the training data can affect the fit of data points far away (non-local problem).
    • Regression Spline (non-linear method)
      • It's trying to overcome the problems in polynomial regression. When we apply a polynomial function to the whole dataset, it may impose the global data structure, so how about fit different portion of data with different functions
      • It divides the dataset into multiple bins, and fits each bin with different models regression spline
      • Points where the division occurs are called "Knots". The function used for each bin are known as "Piecewise function". More knots lead to the more flexible piecewise functions. When there are k knots, we will have k+1 piecewise functions.
      • Piecewise Step Functions: having a function remains constant at each bin
      • Piecewise Polynomials: each bin is using a lower degree polynomial function to fit. You can consider Piecewise Step Function as Piecewise Polynomials with degree as 0
      • A piecewise polynomial of degree m with m-1 continuous derivates is a "spline". This means:
        • Continuous plot at each knot
        • derivates at each knot are the same
        • Cubic and Natural Cubic Splines
          • You can try Cubic Spline (polinomial function has degree=3) to add these constraints so that the plot can be more smooth. Cubic Spline has k knots with k+4 degree of freedom (this means there are k+4 variables are free to change)
          • Boundrary knots can be unpredictable, to smooth them out, you can use Natural Cubic Spline
      • Choose the number and locations of knots
        • Option 1 - Place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable
        • Option 2 - cross validation to help decide the number of knots:
          • remove a portion of data
          • fit a spline with x number of knots on the rest of the data
          • predict the removed data with the spline, choose the k with the smallest RMSE
      • Another method to produce splines is called “smoothing splines”. It works similar to Ridge/Lasso regularisation as it penalizes both loss function and a smoothing function
    • My Code [R Version]: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/learn_splines.R regression splies R

DIMENSIONAL REDUCTION

DATA PREPROCESSING

TREE BASED MODELS

GRAPH THEORY

Graph Theory with python networkx

Graph Lab for a basic Search Engine

Graph Database with Neo4j

  • How to install
    • Download Neo4j server: https://neo4j.com/download/
      • Extract the tar folder, better to use command line to extract tar, otherwise the folder might be damaged and you cannot run later commands. Type tar -xf neo4j-enterprise-2.3.1-unix.tar.gz, change .tar file name to what you have downloaded.
      • Set password for the first time login. Go to extracted folder, in your terminal, type:
        • If it's Mac/Linux, type bin/neo4j-admin set-initial-password [my_initial_password], for Mac or Linux, needs single quotes for your password here.
          • NOTE: if this step failed, go to folder data, remove data/dbms/auth file, then try to set the password again
          • Don't remove auth.ini file If it's Windows, type bin\neo4j-admin set-initial-password [my_initial_password], different from Mac/Linux, you don't need single quotes for your password for Windows.
        • After setting the password, when you opened http://localhost:7474/browser/, the first time you need to type the password to connect to the host. Othewise queries won't be executed.
      • Run neo4j console
        • If it's Max/Linux, type ./bin/neo4j console If it's windows, type bin\neo4j console
      • Open Neo4j browser, http://localhost:7474/browser/, and type your initial password to connect
      • At the top of the console, type :play movies and click "Run" button at the right side.
      • Then you will see queries you can try, click the small "run" button at the left side of the query, run the query, and you can lean how to create the dataset, how to do graph search and other graph operations
      • Ctl + C to stop
  • How to run Neo4j console after installation
    • In your terminal, cd to your downloaded Neo4j package from above
    • Type ./bin/neo4j console
    • Then go to http://localhost:7474/browser/ in your browser
    • The first time when you use it will require password
  • I tried python code here: https://github.com/neo4j-examples/movies-python-bolt
    • I'm not a big fan of this code, it's better to use Neo4j console to get familar with Cypher query language first
    • sudo pip install neo4j-driver, if you want to run python
    • pip install Flask, if you want to run python
  • Neo4j Webinar
    • Neo4j is index-free adjacency, which means, without the index, through 1 node can find other nodes through adjacency relationship like what we can do in graph
    • It has python sample.

Neptune - AWS Graph Database

ADVANCED TOOLS

TPOT

HungaBunga

Lazy Predict

MLBox

CatBoost

Dask - Machine Learning in Parallel

  • Dask is used for parallel processing, it's similar to Spark but copies part of sklearn, numpy, pandas and Spark, rather than having its own libraries like Spark.
  • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/try_dask.ipynb
    • Cannot say I'm a big fan of Dask now.
    • For data preprocessing, it's no better than using pandas and numpy, since they are much faster than dask and have more functions. I tried, it took me so much time and finally decided to change back to numpy & pandas.
    • But you can convert the preprocessed data into dask dataframe
    • The parallel processing for machine learning didn't make me feel it's much faster than sklearn.
    • Although dask-ml mentioned it supports both sklearn grid search and dask-ml grid search, but when I was using sklearn grid search, it gave large amount of error and could not tell what caused the error.
    • I think for larger dataset, Spark must be faster if its machine learning supports the methods. We can also convert pandas dataframe to Saprk dataframe to overcome the shortage of data preprocessing functions.

Featuretools - Basic Auto Feature Engineering

CLOUD for DATA SCIENCE

KAGGLE PRACICE

-- Notes

  • Dimensional Reduction: I tried FAMD model first, since it supposed to handle the mix of categorical and numerical data. But my laptop didn't have enough memory to finish this. Then I changed to PCA, but needed to convert categorical data into numerical data myself first. After running PCA, it shows that the first 150-180 columns comtain the major info of the data.
  • About FAMD: FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability. This method allows one to study the similarities between individuals taking into account mixed variables and to study the relationships between all the variables. It also provides graphical outputs such as the representation of the individuals, the correlation circle for the continuous variables and representations of the categories of the categorical variables, and also specific graphs to visulaize the associations between both type of variables. https://cran.r-project.org/web/packages/FactoMineR/FactoMineR.pdf
  • The predictive analysis part in R code is slow for SVM and NN by using my laptop (50GB disk memory availabe). Even though 150 features have been chosen from 228 features
  • Spark Python is much faster, but need to convert the .csv file data into LabeledPoint for training data, and SparseVector for testing data.
  • In my Spark Python code, I have tried SVM with SGD, Logistic Regression with SGD and Logistic Regression with LBFGS, but when I tune the parameters for SVM and Logistic Regression with SGD, they always returned an empty list wich should show those people who will buy insurance. Logistic Regression with LBFGS gives better results.

OTHER