diff --git a/README.rst b/README.rst index 07a5a7017..757866897 100644 --- a/README.rst +++ b/README.rst @@ -44,22 +44,18 @@ Installation from source It is possible to install the latest version of the package, available in the develop branch, by cloning this repository and doing a manual installation. -.. code:: +.. code:: bash git clone https://github.com/GAA-UAM/scikit-fda.git - cd scikit-fda/ - pip install -r requirements.txt # Install dependencies - python setup.py install + pip install ./scikit-fda Make sure that your default Python version is currently supported, or change the python and pip commands by specifying a version, such as ``python3.6``: -.. code:: +.. code:: bash git clone https://github.com/GAA-UAM/scikit-fda.git - cd scikit-fda/ - python3.6 -m pip install -r requirements.txt # Install dependencies - python3.6 setup.py install + python3.6 -m pip install ./scikit-fda Requirements ------------ @@ -88,11 +84,11 @@ The people involved at some point in the development of the package can be found in the `contributors file `_. -Citation -======== -If you find this project useful, please cite: +.. Citation + ======== + If you find this project useful, please cite: -.. todo:: Include citation to scikit-fda paper. + .. todo:: Include citation to scikit-fda paper. License ======= diff --git a/docs/conf.py b/docs/conf.py index 18a8d6170..177af1566 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -17,9 +17,6 @@ # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # -# import os -# import sys -# sys.path.insert(0, '/home/miguel/Desktop/fda/fda') import os import sys @@ -79,7 +76,8 @@ # General information about the project. project = 'scikit-fda' -copyright = '2017, Author' +copyright = ('2019, Grupo de Aprendizaje Automático - ' + + 'Universidad Autónoma de Madrid') author = 'Author' # The language for content autogenerated by Sphinx. Refer to documentation diff --git a/docs/index.rst b/docs/index.rst index f5f999aff..f451e872b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -5,18 +5,78 @@ Welcome to scikit-fda's documentation! ====================================== +This package offers classes, methods and functions to give support to +Functional Data Analysis in Python. Includes a wide range of utils to work with +functional data, and its representation, exploratory analysis, or +preprocessing, among other tasks such as inference, classification, regression +or clustering of functional data. + +In the `project page `_ hosted by +Github you can find more information related to the development of the package. + + .. toctree:: - :includehidden: - :maxdepth: 4 + :maxdepth: 2 :caption: Contents: :titlesonly: apilist + + +.. toctree:: + :maxdepth: 1 + :titlesonly: + auto_examples/index -Indices and tables -================== +An exhaustive list of all the contents of the package can be found in the +:ref:`genindex`. + +Installation +------------ + +Currently, scikit-fda is available in Python 3.6 and 3.7, regardless of the +platform. The stable version can be installed via +`PyPI `_: + +.. code-block:: bash + + pip install scikit-fda + + +It is possible to install the latest version of the package, available in +the develop branch, by cloning this repository and doing a manual installation. + +.. code-block:: bash + + git clone https://github.com/GAA-UAM/scikit-fda.git + pip install ./scikit-fda + + +In this type of installation make sure that your default Python version is +currently supported, or change the python and pip commands by specifying a +version, such as python3.6. + + +Contributions +------------- + +All contributions are welcome. You can help this project grow in multiple ways, +from creating an issue, reporting an improvement or a bug, to doing a +repository fork and creating a pull request to the development branch. +The people involved at some point in the development of the package can be +found in the `contributors file +`_. + +.. Citation + -------- + If you find this project useful, please cite: + + .. todo:: Include citation to scikit-fda paper. + +License +------- -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` +The package is licensed under the BSD 3-Clause License. A copy of the +`license `_ +can be found along with the code or in the project page. diff --git a/docs/modules/datasets.rst b/docs/modules/datasets.rst index 6946376c9..4121e988d 100644 --- a/docs/modules/datasets.rst +++ b/docs/modules/datasets.rst @@ -17,6 +17,7 @@ The following functions are used to retrieve specific functional datasets: skfda.datasets.fetch_medflies skfda.datasets.fetch_weather skfda.datasets.fetch_aemet + skfda.datasets.fetch_octane Those functions return a dictionary with at least a "data" field containing the instance data, and a "target" field containing the class labels or regression values, diff --git a/docs/modules/exploratory/outliers.rst b/docs/modules/exploratory/outliers.rst index 290a1e377..ef79c6367 100644 --- a/docs/modules/exploratory/outliers.rst +++ b/docs/modules/exploratory/outliers.rst @@ -4,12 +4,15 @@ Outlier detection Functional outlier detection is the identification of functions that do not seem to behave like the others in the dataset. There are several ways in which a function may be different from the others. For example, a function may have a different shape than the others, or its values could be more extreme. Thus, outlyingness is difficult to -categorize exactly as each outlier detection method looks at different features of the functions in order to +categorize exactly as each outlier detection method looks at different features of the functions in order to identify the outliers. Each of the outlier detection methods in scikit-fda has the same API as the outlier detection methods of `scikit-learn `_. +Interquartilic Range Outlier Detector +------------------------------------ + One of the most common ways of outlier detection is given by the functional data boxplot. An observation is marked as an outlier if it has points :math:`1.5 \cdot IQR` times outside the region containing the deepest 50% of the curves (the central region), where :math:`IQR` is the interquartilic range. @@ -18,7 +21,11 @@ as an outlier if it has points :math:`1.5 \cdot IQR` times outside the region co :toctree: autosummary skfda.exploratory.outliers.IQROutlierDetector - + + +DirectionalOutlierDetector +-------------------------- + Other more novel way of outlier detection takes into account the magnitude and shape of the curves. Curves which have a very different shape or magnitude are considered outliers. @@ -26,11 +33,11 @@ a very different shape or magnitude are considered outliers. :toctree: autosummary skfda.exploratory.outliers.DirectionalOutlierDetector - + For this method, it is necessary to compute the mean and variation of the directional outlyingness, which can be done with the following function. .. autosummary:: :toctree: autosummary - skfda.exploratory.outliers.directional_outlyingness_stats \ No newline at end of file + skfda.exploratory.outliers.directional_outlyingness_stats diff --git a/skfda/_neighbors/base.py b/skfda/_neighbors/base.py index 5e73364cd..499d18cb8 100644 --- a/skfda/_neighbors/base.py +++ b/skfda/_neighbors/base.py @@ -97,11 +97,11 @@ def multivariate_metric(x, y, _check=False, **kwargs): class NeighborsBase(ABC, BaseEstimator): """Base class for nearest neighbors estimators.""" - @abstractmethod def __init__(self, n_neighbors=None, radius=None, weights='uniform', algorithm='auto', leaf_size=30, metric='l2', metric_params=None, n_jobs=None, multivariate_metric=False): + """Initializes the nearest neighbors estimator""" self.n_neighbors = n_neighbors self.radius = radius @@ -166,6 +166,7 @@ def fit(self, X, y=None): metric = lp_distance else: metric = self.metric + sklearn_metric = _to_multivariate_metric(metric, self._sample_points) else: @@ -203,7 +204,7 @@ def kneighbors(self, X=None, n_neighbors=None, return_distance=True): Indices of the nearest points in the population matrix. Examples: - Firstly, we will create a toy dataset with 2 classes + Firstly, we will create a toy dataset. >>> from skfda.datasets import make_sinusoidal_process >>> fd1 = make_sinusoidal_process(phase_std=.25, random_state=0) @@ -260,7 +261,7 @@ def kneighbors_graph(self, X=None, n_neighbors=None, mode='connectivity'): A[i, j] is assigned the weight of edge that connects i to j. Examples: - Firstly, we will create a toy dataset with 2 classes. + Firstly, we will create a toy dataset. >>> from skfda.datasets import make_sinusoidal_process >>> fd1 = make_sinusoidal_process(phase_std=.25, random_state=0) @@ -329,7 +330,7 @@ def radius_neighbors(self, X=None, radius=None, return_distance=True): within a ball of size ``radius`` around the query points. Examples: - Firstly, we will create a toy dataset with 2 classes. + Firstly, we will create a toy dataset. >>> from skfda.datasets import make_sinusoidal_process >>> fd1 = make_sinusoidal_process(phase_std=.25, random_state=0) diff --git a/skfda/_neighbors/classification.py b/skfda/_neighbors/classification.py index c8f63482d..228ea4e2a 100644 --- a/skfda/_neighbors/classification.py +++ b/skfda/_neighbors/classification.py @@ -59,8 +59,9 @@ class KNeighborsClassifier(NeighborsBase, NeighborsMixin, KNeighborsMixin, Doesn't affect :meth:`fit` method. multivariate_metric : boolean, optional (default = False) Indicates if the metric used is a sklearn distance between vectors (see - :class:`sklearn.neighbors.DistanceMetric`) or a functional metric of - the module :mod:`skfda.misc.metrics`. + :class:`~sklearn.neighbors.DistanceMetric`) or a functional metric of + the module `skfda.misc.metrics` if ``False``. + Examples -------- Firstly, we will create a toy dataset with 2 classes @@ -96,6 +97,7 @@ class KNeighborsClassifier(NeighborsBase, NeighborsMixin, KNeighborsMixin, :class:`~skfda.ml.regression.KNeighborsRegressor` :class:`~skfda.ml.regression.RadiusNeighborsRegressor` :class:`~skfda.ml.clustering.NearestNeighbors` + Notes ----- @@ -254,6 +256,7 @@ class RadiusNeighborsClassifier(NeighborsBase, NeighborsMixin, :class:`~skfda.ml.regression.RadiusNeighborsRegressor` :class:`~skfda.ml.clustering.NearestNeighbors` + Notes ----- See Nearest Neighbors in the sklearn online documentation for a discussion @@ -358,6 +361,7 @@ class and return a :class:`FData` object with only one sample :class:`~skfda.ml.regression.RadiusNeighborsRegressor` :class:`~skfda.ml.clustering.NearestNeighbors` + """ def __init__(self, metric='l2', mean='mean'): diff --git a/skfda/_neighbors/outlier.py b/skfda/_neighbors/outlier.py new file mode 100644 index 000000000..8ce41cb49 --- /dev/null +++ b/skfda/_neighbors/outlier.py @@ -0,0 +1,361 @@ + + +from sklearn.base import OutlierMixin +from .base import (NeighborsBase, NeighborsMixin, KNeighborsMixin, + _to_multivariate_metric) + +from ..misc.metrics import lp_distance + + +class LocalOutlierFactor(NeighborsBase, NeighborsMixin, KNeighborsMixin, + OutlierMixin): + """Unsupervised Outlier Detection. + + Unsupervised Outlier Detection using Local Outlier Factor (LOF). + + The anomaly score of each sample is called Local Outlier Factor. + It measures the local deviation of density of a given sample with + respect to its neighbors. + + It is local in that the anomaly score depends on how isolated the object + is with respect to the surrounding neighborhood. + + More precisely, locality is given by k-nearest neighbors, whose distance + is used to estimate the local density. + + By comparing the local density of a sample to the local densities of + its neighbors, one can identify samples that have a substantially lower + density than their neighbors. These are considered outliers. + + Parameters + ---------- + n_neighbors : int, optional (default=20) + Number of neighbors to use by default for :meth:`kneighbors` queries. + If n_neighbors is larger than the number of samples provided, + all samples will be used. + algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional + Algorithm used to compute the nearest neighbors: + + - 'ball_tree' will use :class:`BallTree` + - 'kd_tree' will use :class:`KDTree` + - 'brute' will use a brute-force search. + - 'auto' will attempt to decide the most appropriate algorithm + based on the values passed to :meth:`fit` method. + + leaf_size : int, optional (default=30) + Leaf size passed to :class:`BallTree` or :class:`KDTree`. This can + affect the speed of the construction and query, as well as the memory + required to store the tree. The optimal value depends on the + nature of the problem. + metric : string or callable, (default + :func:`lp_distance `) + the distance metric to use for the tree. The default metric is + the L2 distance. See the documentation of the metrics module + for a list of available metrics. + metric_params : dict, optional (default=None) + Additional keyword arguments for the metric function. + contamination : float in (0., 0.5), optional (default='auto') + The amount of contamination of the data set, i.e. the proportion + of outliers in the data set. When fitting this is used to define the + threshold on the decision function. If "auto", the decision function + threshold is determined as in the original paper [BKNS2000]_. + novelty : boolean, default False + By default, LocalOutlierFactor is only meant to be used for outlier + detection (novelty=False). Set novelty to True if you want to use + LocalOutlierFactor for novelty detection. In this case be aware that + that you should only use predict, decision_function and score_samples + on new unseen data and not on the training set. + n_jobs : int or None, optional (default=None) + The number of parallel jobs to run for neighbors search. + ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. + ``-1`` means using all processors. See :term:`Glossary ` + for more details. + Affects only :meth:`kneighbors` and :meth:`kneighbors_graph` methods. + multivariate_metric : boolean, optional (default = False) + Indicates if the metric used is a sklearn distance between vectors (see + :class:`~sklearn.neighbors.DistanceMetric`) or a functional metric of + the module `skfda.misc.metrics` if ``False``. + + Attributes + ---------- + negative_outlier_factor_ : numpy array, shape (n_samples,) + The opposite LOF of the training samples. The higher, the more normal. + Inliers tend to have a LOF score close to 1 + (``negative_outlier_factor_`` close to -1), while outliers tend to have + a larger LOF score. + The local outlier factor (LOF) of a sample captures its + supposed 'degree of abnormality'. + It is the average of the ratio of the local reachability density of + a sample and those of its k-nearest neighbors. + n_neighbors_ : integer + The actual number of neighbors used for :meth:`kneighbors` queries. + offset_ : float + Offset used to obtain binary labels from the raw scores. + Observations having a negative_outlier_factor smaller than `offset_` + are detected as abnormal. + The offset is set to -1.5 (inliers score around -1), except when a + contamination parameter different than "auto" is provided. In that + case, the offset is defined in such a way we obtain the expected + number of outliers in training. + + Examples: + + **Local Outlier Factor (LOF) for outlier detection**. + + >>> from skfda._neighbors.outlier import LocalOutlierFactor + + Creation of simulated dataset with 2 outliers to be used with LOF. + + >>> from skfda.datasets import make_sinusoidal_process + >>> fd_clean = make_sinusoidal_process(n_samples=25, error_std=0, + ... phase_std=0.1, random_state=0) + >>> fd_outliers = make_sinusoidal_process( + ... n_samples=2, error_std=0, phase_mean=0.5, random_state=5) + >>> fd = fd_outliers.concatenate(fd_clean) # Dataset with 2 outliers + + Detection of outliers with LOF. + + >>> lof = LocalOutlierFactor() + >>> is_outlier = lof.fit_predict(fd) + >>> is_outlier # -1 for anomalies/outliers and +1 for inliers + array([-1, -1, 1, 1, 1, 1, 1, 1, ..., 1, 1, 1, 1]) + + The negative outlier factor stored. + + >>> lof.negative_outlier_factor_.round(2) + array([-7.11, -1.54, -1. , -0.99, ..., -0.97, -1. , -0.99]) + + **Novelty detection with LOF**. + + Creation of a dataset without outliers. + + >>> fd_train = make_sinusoidal_process(n_samples=25, error_std=0, + ... phase_std=0.1, random_state=9) + + Fit of LOF using the dataset without outliers. + + >>> lof = LocalOutlierFactor(novelty=True) + >>> lof.fit(fd_train) + LocalOutlierFactor(algorithm='auto', ..., novelty=True) + + Detection of annomalies for new samples. + + >>> lof.predict(fd) # Predict with samples not used in fit + array([-1, -1, 1, 1, 1, 1, 1, 1, ..., 1, 1, 1, 1]) + + + References + ---------- + .. [BKNS2000] Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, + J. (2000, May). LOF: identifying density-based local outliers. In ACM + sigmod record. + + Notes + ----- + This estimator wraps the scikit-learn class + :class:`~sklearn.neighbors.LocalOutlierFactor` employing functional + metrics and data instead of the multivariate ones. + + See also + -------- + :class:`~skfda.ml.classification.KNeighborsClassifier` + :class:`~skfda.ml.classification.RadiusNeighborsClassifier` + :class:`~skfda.ml.classification.NearestCentroids` + :class:`~skfda.ml.regression.KNeighborsRegressor` + :class:`~skfda.ml.regression.RadiusNeighborsRegressor` + :class:`~skfda.ml.clustering.NearestNeighbors` + """ + + def __init__(self, n_neighbors=20, algorithm='auto', + leaf_size=30, metric='l2', metric_params=None, + contamination='auto', novelty=False, + n_jobs=1, multivariate_metric=False): + """Initialize the Local Outlier Factor estimator.""" + + super().__init__(n_neighbors=n_neighbors, algorithm=algorithm, + leaf_size=leaf_size, metric=metric, + metric_params=metric_params, n_jobs=n_jobs, + multivariate_metric=multivariate_metric) + self.contamination = contamination + self.novelty = novelty + + def _init_estimator(self, sklearn_metric): + """Initialize the sklearn nearest neighbors estimator. + + Args: + sklearn_metric: (pyfunc or 'precomputed'): Metric compatible with + sklearn API or matrix (n_samples, n_samples) with precomputed + distances. + + Returns: + Sklearn LocalOutlierFactor estimator initialized. + + """ + from sklearn.neighbors import LocalOutlierFactor as _LocalOutlierFactor + + return _LocalOutlierFactor( + n_neighbors=self.n_neighbors, algorithm=self.algorithm, + leaf_size=self.leaf_size, metric=sklearn_metric, + metric_params=self.metric_params, contamination=self.contamination, + novelty=self.novelty, n_jobs=self.n_jobs) + + def _store_fit_data(self): + """Store the parameters created during the fit.""" + self.negative_outlier_factor_ = self.estimator_.negative_outlier_factor_ + self.n_neighbors_ = self.estimator_.n_neighbors_ + self.offset_ = self.estimator_.offset_ + + def fit(self, X, y=None): + """Fit the model using X as training data. + + Parameters + ---------- + X : :class:`~skfda.FDataGrid` or array_like + Training data. FDataGrid containing the samples, + or array with shape [n_samples, n_samples] if metric='precomputed'. + y : Ignored + not used, present for API consistency by convention. + Returns + ------- + self : object + """ + + super().fit(X, y) + self._store_fit_data() + + return self + + def predict(self, X=None): + """Predict the labels (1 inlier, -1 outlier) of X according to LOF. + + This method allows to generalize prediction to *new observations* (not + in the training set). Only available for novelty detection (when + novelty is set to True). + + If X is None, returns the same as fit_predict(X_train). + + Parameters + ---------- + X : :class:`~skfda.FDataGrid` or array_like + FDataGrid containing the query sample or samples to compute the + Local Outlier Factor w.r.t. to the training samples or array with + the distances to the training samples if metric='precomputed'. + + Returns + ------- + is_inlier : array, shape (n_samples,) + Returns -1 for anomalies/outliers and +1 for inliers. + """ + + self._check_is_fitted() + X_multivariate = self._transform_to_multivariate(X) + + return self.estimator_.predict(X_multivariate) + + def fit_predict(self, X, y=None): + """Fits the model to the training set X and returns the labels. + + Label is 1 for an inlier and -1 for an outlier according to the LOF + score and the contamination parameter. + + Parameters + ---------- + X : :class:`~skfda.FDataGrid` or array_like + Training data. FDataGrid containing the samples, + or array with shape [n_samples, n_samples] if metric='precomputed'. + y : Ignored + not used, present for API consistency by convention. + Returns + ------- + is_inlier : array, shape (n_samples,) + Returns -1 for anomalies/outliers and 1 for inliers. + """ + + # In this estimator fit_predict cannot be wrapped as fit().predict() + + if self.metric == 'precomputed': + self.estimator_ = self._init_estimator(self.metric) + res = self.estimator_.fit_predict(X, y) + else: + self._sample_points = X.sample_points + self._shape = X.data_matrix.shape[1:] + + if not self.multivariate_metric: + # Constructs sklearn metric to manage vector + if self.metric == 'l2': + metric = lp_distance + else: + metric = self.metric + sklearn_metric = _to_multivariate_metric(metric, + self._sample_points) + else: + sklearn_metric = self.metric + + self.estimator_ = self._init_estimator(sklearn_metric) + X_multivariate = self._transform_to_multivariate(X) + res = self.estimator_.fit_predict(X_multivariate, y) + + self._store_fit_data() + + return res + + def decision_function(self, X): + """Shifted opposite of the Local Outlier Factor of X. + + Bigger is better, i.e. large values correspond to inliers. + The shift offset allows a zero threshold for being an outlier. + Only available for novelty detection (when novelty is set to True). + The argument X is supposed to contain *new data*: if X contains a + point from training, it considers the later in its own neighborhood. + Also, the samples in X are not considered in the neighborhood of any + point. + + Parameters + ---------- + X : :class:`~skfda.FDataGrid` or array_like + FDataGrid containing the query sample or samples to compute the + Local Outlier Factor w.r.t. to the training samples. + + Returns + ------- + shifted_opposite_lof_scores : array, shape (n_samples,) + The shifted opposite of the Local Outlier Factor of each input + samples. The lower, the more abnormal. Negative scores represent + outliers, positive scores represent inliers. + """ + self._check_is_fitted() + X_multivariate = self._transform_to_multivariate(X) + + return self.estimator_.decision_function(X_multivariate) + + def score_samples(self, X): + """Opposite of the Local Outlier Factor of X. + + It is the opposite as bigger is better, i.e. large values correspond + to inliers. + + Only available for novelty detection (when novelty is set to True). + The argument X is supposed to contain *new data*: if X contains a + point from training, it considers the later in its own neighborhood. + Also, the samples in X are not considered in the neighborhood of any + point. + + The score_samples on training data is available by considering the + the ``negative_outlier_factor_`` attribute. + + Parameters + ---------- + X : :class:`~skfda.FDataGrid` or array_like + FDataGrid containing the query sample or samples to compute the + Local Outlier Factor w.r.t. to the training samples. + + Returns + ------- + opposite_lof_scores : array, shape (n_samples,) + The opposite of the Local Outlier Factor of each input samples. + The lower, the more abnormal. + """ + self._check_is_fitted() + X_multivariate = self._transform_to_multivariate(X) + + return self.estimator_.score_samples(X_multivariate) diff --git a/skfda/_neighbors/regression.py b/skfda/_neighbors/regression.py index 8300215ee..715d87935 100644 --- a/skfda/_neighbors/regression.py +++ b/skfda/_neighbors/regression.py @@ -111,6 +111,7 @@ class KNeighborsRegressor(NeighborsBase, NeighborsRegressorMixin, :class:`~skfda.ml.regression.RadiusNeighborsRegressor` :class:`~skfda.ml.clustering.NearestNeighbors` + Notes ----- See Nearest Neighbors in the sklearn online documentation for a discussion @@ -280,6 +281,7 @@ class RadiusNeighborsRegressor(NeighborsBase, NeighborsRegressorMixin, :class:`~skfda.ml.regression.KNeighborsRegressor` :class:`~skfda.ml.clustering.NearestNeighbors` + Notes ----- See Nearest Neighbors in the sklearn online documentation for a discussion diff --git a/skfda/_neighbors/unsupervised.py b/skfda/_neighbors/unsupervised.py index 9e2fbee1a..b786cd425 100644 --- a/skfda/_neighbors/unsupervised.py +++ b/skfda/_neighbors/unsupervised.py @@ -88,6 +88,7 @@ class NearestNeighbors(NeighborsBase, NeighborsMixin, KNeighborsMixin, :class:`~skfda.ml.regression.KNeighborsRegressor` :class:`~skfda.ml.regression.RadiusNeighborsRegressor` + Notes ----- See Nearest Neighbors in the sklearn online documentation for a discussion diff --git a/skfda/datasets/__init__.py b/skfda/datasets/__init__.py index ec3dcc9ab..c2e84fc5a 100644 --- a/skfda/datasets/__init__.py +++ b/skfda/datasets/__init__.py @@ -2,7 +2,8 @@ fetch_ucr, fetch_phoneme, fetch_growth, fetch_tecator, fetch_medflies, - fetch_weather, fetch_aemet) + fetch_weather, fetch_aemet, + fetch_octane) from ._samples_generators import (make_gaussian_process, make_sinusoidal_process, make_multimodal_samples, diff --git a/skfda/datasets/_real_datasets.py b/skfda/datasets/_real_datasets.py index ca5767837..d51ded976 100644 --- a/skfda/datasets/_real_datasets.py +++ b/skfda/datasets/_real_datasets.py @@ -531,3 +531,71 @@ def fetch_aemet(return_X_y: bool = False): if hasattr(fetch_aemet, "__doc__"): # docstrings can be stripped off fetch_aemet.__doc__ += _aemet_descr + _param_descr + + +_octane_descr = """ + Near infrared (NIR) spectra of gasoline samples, with wavelengths ranging + from 1102nm to 1552nm with measurements every two nm. + This dataset contains six outliers to which ethanol was added, which is + required in some states. See [RDEH2006]_ and [HuRS2015]_ for further + details. + + The data is labeled according to this different composition. + + Source: + Esbensen K. (2001). Multivariate data analysis in practice. 5th edn. + Camo Software, Trondheim, Norway. + + References: + .. [RDEH2006] Rousseeuw, Peter & Debruyne, Michiel & Engelen, Sanne & + Hubert, Mia. (2006). Robustness and Outlier Detection in + Chemometrics. Critical Reviews in Analytical Chemistry. 36. + 221-242. 10.1080/10408340600969403. + .. [HuRS2015] Hubert, Mia & Rousseeuw, Peter & Segaert, Pieter. (2015). + Multivariate functional outlier detection. Statistical Methods and + Applications. 24. 177-202. 10.1007/s10260-015-0297-8. + +""" + +def fetch_octane(return_X_y: bool = False): + """Load near infrared spectra of gasoline samples. + + This function fetchs the octane dataset from the R package 'mrfDepth' + from CRAN. + + """ + DESCR = _octane_descr + + # octane file from mrfDepth R package + raw_dataset = fetch_cran("octane", "mrfDepth", version="1.0.11") + data = raw_dataset['octane'][..., 0].T + + # The R package only stores the values of the curves, but the paper + # describes the rest of the data. According to [RDEH2006], Section 5.4: + + # "wavelengths ranging from 1102nm to 1552nm with measurements every two + # nm."" + sample_points = np.linspace(1102, 1552, 226) + + # "The octane data set contains six outliers (25, 26, 36–39) to which + # alcohol was added". + target = np.zeros(len(data), dtype=int) + target[24] = target[25] = target [35:39] = 1 # Outliers 1 + + axes_labels = ["wavelength (nm)", "absorbances"] + + curves = FDataGrid(data, + sample_points=sample_points, + dataset_label="Octane", + axes_labels=axes_labels) + + if return_X_y: + return curves, target + else: + return {"data": curves, + "target": target, + "target_names": ['inliner', 'outlier'], + "DESCR" : DESCR} + +if hasattr(fetch_octane, "__doc__"): # docstrings can be stripped off + fetch_octane.__doc__ += _octane_descr + _param_descr diff --git a/tests/test_neighbors.py b/tests/test_neighbors.py index 98199da0e..60dffc190 100644 --- a/tests/test_neighbors.py +++ b/tests/test_neighbors.py @@ -3,7 +3,7 @@ import unittest import numpy as np -from skfda.datasets import make_multimodal_samples +from skfda.datasets import make_multimodal_samples, make_sinusoidal_process from skfda.exploratory.stats import mean as l2_mean from skfda.misc.metrics import lp_distance, pairwise_distance from skfda.ml.classification import (KNeighborsClassifier, @@ -11,6 +11,8 @@ NearestCentroids) from skfda.ml.clustering import NearestNeighbors from skfda.ml.regression import KNeighborsRegressor, RadiusNeighborsRegressor +#from skfda.exploratory.outliers import LocalOutlierFactor +from skfda._neighbors.outlier import LocalOutlierFactor # Pending theory from skfda.representation.basis import Fourier @@ -41,6 +43,13 @@ def setUp(self): self.probs = np.array(15 * [[1., 0.]] + 15 * [[0., 1.]])[idx] + # Dataset with outliers + fd_clean = make_sinusoidal_process(n_samples=25, error_std=0, + phase_std=0.1, random_state=0) + fd_outliers = make_sinusoidal_process(n_samples=2, error_std=0, + phase_mean=0.5, random_state=5) + self.fd_lof = fd_outliers.concatenate(fd_clean) + def test_predict_classifier(self): """Tests predict for neighbors classifier""" @@ -86,13 +95,16 @@ def test_kneighbors(self): nn = NearestNeighbors() nn.fit(self.X) + lof = LocalOutlierFactor(n_neighbors=5) + lof.fit(self.X) + knn = KNeighborsClassifier() knn.fit(self.X, self.y) knnr = KNeighborsRegressor() knnr.fit(self.X, self.modes_location) - for neigh in [nn, knn, knnr]: + for neigh in [nn, knn, knnr, lof]: dist, links = neigh.kneighbors(self.X[:4]) @@ -101,12 +113,12 @@ def test_kneighbors(self): [2, 17, 22, 27, 26], [3, 4, 9, 5, 25]]) + graph = neigh.kneighbors_graph(self.X[:4]) + dist_kneigh = lp_distance(self.X[0], self.X[7]) np.testing.assert_array_almost_equal(dist[0, 1], dist_kneigh) - graph = neigh.kneighbors_graph(self.X[:4]) - for i in range(30): self.assertEqual(graph[0, i] == 1.0, i in links[0]) self.assertEqual(graph[0, i] == 0.0, i not in links[0]) @@ -324,6 +336,91 @@ def test_multivariate_response_score(self): with np.testing.assert_raises(ValueError): neigh.score(self.X[:5], y) + def test_lof_fit_predict(self): + """ Test same results with different forms to call fit_predict""" + + # Outliers + expected = np.ones(len(self.fd_lof)) + expected[0:2] = -1 + + # With default l2 distance + lof = LocalOutlierFactor() + res = lof.fit_predict(self.fd_lof) + np.testing.assert_array_equal(expected, res) + + # With explicit l2 distance + lof2 = LocalOutlierFactor(metric=lp_distance) + res2 = lof2.fit_predict(self.fd_lof) + np.testing.assert_array_equal(expected, res2) + + d = pairwise_distance(lp_distance) + distances = d(self.fd_lof, self.fd_lof) + + # With precompute distances + lof3 = LocalOutlierFactor(metric="precomputed") + res3 = lof3.fit_predict(distances) + np.testing.assert_array_equal(expected, res3) + + # With multivariate sklearn + lof4 = LocalOutlierFactor(metric="euclidean", multivariate_metric=True) + res4 = lof4.fit_predict(self.fd_lof) + np.testing.assert_array_equal(expected, res4) + + # Other way of call fit_predict, undocumented in sklearn + lof5 = LocalOutlierFactor(novelty=True) + res5 = lof5.fit(self.fd_lof).predict() + np.testing.assert_array_equal(expected, res5) + + # Check values of negative outlier factor + negative_lof = [-7.1068, -1.5412, -0.9961, -0.9854, -0.9896, -1.0993, + -1.065, -0.9871, -0.9821, -0.9955, -1.0385, -1.0072, + -0.9832, -1.0134, -0.9939, -1.0074, -0.992, -0.992, + -0.9883, -1.0012, -1.1149, -1.002, -0.9994, -0.9869, + -0.9726, -0.9989, -0.9904] + + np.testing.assert_array_almost_equal( + lof.negative_outlier_factor_.round(4), negative_lof) + + # Check same negative outlier factor + np.testing.assert_array_almost_equal(lof.negative_outlier_factor_, + lof2.negative_outlier_factor_) + + np.testing.assert_array_almost_equal(lof.negative_outlier_factor_, + lof3.negative_outlier_factor_) + + def test_lof_decision_function(self): + """ Test decision function and score samples of LOF""" + + lof = LocalOutlierFactor(novelty=True) + lof.fit(self.fd_lof[5:]) + + score = lof.score_samples(self.fd_lof[:5]) + + np.testing.assert_array_almost_equal( + score.round(4), [-5.9726, -1.3445, -0.9853, -0.9817, -0.985], + err_msg='Error in LocalOutlierFactor.score_samples') + + # Test decision_function = score_function - offset + np.testing.assert_array_almost_equal( + lof.decision_function(self.fd_lof[:5]), score - lof.offset_, + err_msg='Error in LocalOutlierFactor.decision_function') + + def test_lof_exceptions(self): + """ Test error due to novelty attribute""" + + lof = LocalOutlierFactor(novelty=True) + + # Error in fit_predict function + with np.testing.assert_raises(AttributeError): + lof.fit_predict(self.fd_lof[5:]) + + lof.set_params(novelty=False) + lof.fit(self.fd_lof[5:]) + + # Error in predict function + with np.testing.assert_raises(AttributeError): + lof.predict(self.fd_lof[5:]) + if __name__ == '__main__': print()