Skip to content
Sergio Mora edited this page Aug 27, 2020 · 4 revisions

6. Dataset transformations

scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

Combining such transformers, either in parallel or series is covered in Pipelines and composite estimators. Pairwise metrics, Affinities and Kernels covers transforming feature spaces into affinity matrices, while Transforming the prediction target (y) considers transformations of the target space (e.g. categorical labels) for use in scikit-learn.

6.1. Pipelines and composite estimators

6.1.1. Pipeline: chaining estimators 6.1.2. Transforming target in regression 6.1.3. FeatureUnion: composite feature spaces 6.1.4. ColumnTransformer for heterogeneous data 6.1.5. Visualizing Composite Estimators

6.2. Feature extraction

6.2.1. Loading features from dicts 6.2.2. Feature hashing 6.2.3. Text feature extraction 6.2.4. Image feature extraction

6.3. Preprocessing data

6.3.1. Standardization, or mean removal and variance scaling -> sergiomora03

6.3.2. Non-linear transformation -> abdala9512

6.3.3. Normalization

6.3.4. Encoding categorical features

6.3.5. Discretization

6.3.6. Imputation of missing values

6.3.7. Generating polynomial features

6.3.8. Custom transformers

6.4. Imputation of missing values

6.4.1. Univariate vs. Multivariate Imputation 6.4.2. Univariate feature imputation 6.4.3. Multivariate feature imputation 6.4.4. References 6.4.5. Nearest neighbors imputation 6.4.6. Marking imputed values

6.5. Unsupervised dimensionality reduction

6.5.1. PCA: principal component analysis 6.5.2. Random projections 6.5.3. Feature agglomeration

6.6. Random Projection

6.6.1. The Johnson-Lindenstrauss lemma 6.6.2. Gaussian random projection 6.6.3. Sparse random projection

6.7. Kernel Approximation

6.7.1. Nystroem Method for Kernel Approximation 6.7.2. Radial Basis Function Kernel 6.7.3. Additive Chi Squared Kernel 6.7.4. Skewed Chi Squared Kernel 6.7.5. Mathematical Details

6.8. Pairwise metrics, Affinities and Kernels

6.8.1. Cosine similarity 6.8.2. Linear kernel 6.8.3. Polynomial kernel 6.8.4. Sigmoid kernel 6.8.5. RBF kernel 6.8.6. Laplacian kernel 6.8.7. Chi-squared kernel

6.9. Transforming the prediction target (y)

6.9.1. Label binarization 6.9.2. Label encoding

Source: https://scikit-learn.org/stable/data_transforms.html

  • 6.3.1. Standardization, or mean removal and variance scaling

  • 6.3.2. Non-linear transformation

  • 6.3.3. Normalization

  • 6.3.4. Encoding categorical features

  • 6.3.5. Discretization

  • 6.3.6. Imputation of missing values

  • 6.3.7. Generating polynomial features

  • 6.3.8. Custom transformers

Clone this wiki locally