From fc41bdd8a95065784135a90492f29e3e23fbe614 Mon Sep 17 00:00:00 2001 From: AryanAgarwal27 <67140930+AryanAgarwal27@users.noreply.github.com> Date: Wed, 1 Jan 2025 20:03:08 -0800 Subject: [PATCH] #1021 Fintech example documentation for Cleanlab implementation (#96) --- 1021_fintech_documentation/Final.ipynb | 2707 ++++++++++++++++++++ 1021_fintech_documentation/Requirement.txt | 4 + 2 files changed, 2711 insertions(+) create mode 100644 1021_fintech_documentation/Final.ipynb create mode 100644 1021_fintech_documentation/Requirement.txt diff --git a/1021_fintech_documentation/Final.ipynb b/1021_fintech_documentation/Final.ipynb new file mode 100644 index 0000000..dd6e17c --- /dev/null +++ b/1021_fintech_documentation/Final.ipynb @@ -0,0 +1,2707 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b2eebf0d-31ff-4ce0-b2b7-4d82ee61150b", + "metadata": { + "id": "b2eebf0d-31ff-4ce0-b2b7-4d82ee61150b" + }, + "source": [ + "# Detecting Data Quality Issues in Credit Card Fraud Detection Using Cleanlab\n", + "\n", + "In this 5-minute quickstart tutorial, we will use **Cleanlab's Datalab** to detect various issues in a tabular dataset commonly encountered in financial applications. This tutorial focuses on the **Credit Card Fraud Detection dataset**, which contains thousands of transaction records labeled as fraudulent or non-fraudulent. The dataset includes features such as transaction amount and anonymized variables for privacy.\n", + "\n", + "### Cleanlab Helps Uncover:\n", + "- **Label errors**: Mislabeled transactions, such as fraudulent cases incorrectly marked as non-fraudulent.\n", + "- **Outliers**: Transactions with abnormal patterns that deviate significantly from the rest of the dataset.\n", + "- **Near-duplicates**: Repeated transactions or entries that may distort results or impact model performance.\n", + "\n", + "Using Cleanlab, we automatically identify examples that are likely mislabeled or problematic, improving the overall data quality for better fraud detection performance. You can adapt this tutorial to detect and correct issues in your own financial tabular datasets.\n" + ] + }, + { + "cell_type": "markdown", + "id": "27fcddca-534f-4851-8c80-688e7cb7ff79", + "metadata": { + "id": "27fcddca-534f-4851-8c80-688e7cb7ff79" + }, + "source": [ + "## Quickstart\n", + "\n", + "Already have (out-of-sample) `pred_probs` from a model trained on your original data labels?\n", + "Have a `knn_graph` computed between dataset examples (reflecting similarity in their feature values)?\n", + "Run the code below to find issues in your dataset.\n" + ] + }, + { + "cell_type": "raw", + "id": "3dc09d4e-499e-4aed-942e-57f1df4deca7", + "metadata": { + "id": "3dc09d4e-499e-4aed-942e-57f1df4deca7" + }, + "source": [ + "from cleanlab import Datalab\n", + "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", + "lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)\n", + "\n", + "lab.get_issues()" + ] + }, + { + "cell_type": "markdown", + "id": "791717d7-d140-4d85-b516-9a6e6f28c7c0", + "metadata": { + "id": "791717d7-d140-4d85-b516-9a6e6f28c7c0" + }, + "source": [ + "# 1. Install Required Dependencies\n", + "\n", + "To get started, install the required packages for this tutorial using pip:\n", + "\n", + "```bash\n", + "!pip install \"cleanlab[datalab]\" scikit-learn pandas numpy\n" + ] + }, + { + "cell_type": "code", + "source": [ + "# Install required libraries with correct versions\n", + "!pip install \"cleanlab[datalab]\" \"numpy\" \"pandas==1.3.3\" \"scikit-learn==1.0.2\" \"scikit-image==0.18.3\"\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f7TpME1a6-Db", + "outputId": "9f5642dc-ae12-464b-e65f-77d0f57b4ce0" + }, + "id": "f7TpME1a6-Db", + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.22.0)\n", + "Requirement already satisfied: pandas==1.3.3 in /usr/local/lib/python3.10/dist-packages (1.3.3)\n", + "Requirement already satisfied: scikit-learn==1.0.2 in /usr/local/lib/python3.10/dist-packages (1.0.2)\n", + "Requirement already satisfied: scikit-image==0.18.3 in /usr/local/lib/python3.10/dist-packages (0.18.3)\n", + "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.5.0)\n", + "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.10/dist-packages (from pandas==1.3.3) (2.8.2)\n", + "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.10/dist-packages (from pandas==1.3.3) (2024.2)\n", + "Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (1.11.4)\n", + "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (1.4.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.0.2) (3.5.0)\n", + "Requirement already satisfied: matplotlib!=3.0.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (3.8.0)\n", + "Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (3.4.2)\n", + "Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (11.0.0)\n", + "Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (2.36.1)\n", + "Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (2024.9.20)\n", + "Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image==0.18.3) (1.4.1)\n", + "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.6)\n", + "Requirement already satisfied: termcolor>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.5.0)\n", + "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (3.2.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.16.1)\n", + "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (17.0.0)\n", + "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n", + "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.32.3)\n", + "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.5.0)\n", + "Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n", + "Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.7.0->cleanlab[datalab]) (2024.9.0)\n", + "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.11.10)\n", + "Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.26.5)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (24.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.2)\n", + "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (1.2.1)\n", + "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (0.12.1)\n", + "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (4.55.3)\n", + "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (1.4.7)\n", + "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib!=3.0.0,>=2.0.0->scikit-image==0.18.3) (3.2.0)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7.3->pandas==1.3.3) (1.17.0)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (2.4.4)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n", + "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (24.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.5.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.1.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (0.2.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.18.3)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets>=2.7.0->cleanlab[datalab]) (4.12.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (3.4.0)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (2.2.3)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets>=2.7.0->cleanlab[datalab]) (2024.8.30)\n" + ] + } + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "2af19c9f-970f-4f2e-ab48-392b376c0b98", + "metadata": { + "id": "2af19c9f-970f-4f2e-ab48-392b376c0b98" + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.neighbors import NearestNeighbors\n", + "\n", + "\n", + "from cleanlab import Datalab\n", + "\n", + "# Set random seed for reproducibility\n", + "SEED = 42\n", + "np.random.seed(SEED)\n", + "random.seed(SEED)\n" + ] + }, + { + "cell_type": "markdown", + "id": "3ef577c5-b3f1-4d5b-9b26-9324651b12fd", + "metadata": { + "id": "3ef577c5-b3f1-4d5b-9b26-9324651b12fd" + }, + "source": [ + "# 2. Load and Process the Data\n", + "\n", + "We will now load the Credit Card Fraud Detection dataset, which contains features like transaction amounts and anonymized variables, along with labels indicating whether the transaction is fraudulent (`1`) or non-fraudulent (`0`).\n", + "\n", + "First, we load the dataset and display the first few rows to get an overview of the data structure.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ea80ab8d-6461-47d9-ac45-7048540b4650", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "ea80ab8d-6461-47d9-ac45-7048540b4650", + "outputId": "35249fcf-22d4-433c-faa9-5871138a6db0" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " TransactionID TransactionDate Amount MerchantID \\\n", + "0 1 2024-04-03 14:15:35.462794 4189.27 688 \n", + "1 2 2024-03-19 13:20:35.462824 2659.71 109 \n", + "2 3 2024-01-08 10:08:35.462834 784.00 394 \n", + "3 4 2024-04-13 23:50:35.462850 3514.40 944 \n", + "4 5 2024-07-12 18:51:35.462858 369.07 475 \n", + "\n", + " TransactionType Location IsFraud \n", + "0 refund San Antonio 0 \n", + "1 refund Dallas 0 \n", + "2 purchase New York 0 \n", + "3 purchase Philadelphia 0 \n", + "4 purchase Phoenix 0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TransactionIDTransactionDateAmountMerchantIDTransactionTypeLocationIsFraud
012024-04-03 14:15:35.4627944189.27688refundSan Antonio0
122024-03-19 13:20:35.4628242659.71109refundDallas0
232024-01-08 10:08:35.462834784.00394purchaseNew York0
342024-04-13 23:50:35.4628503514.40944purchasePhiladelphia0
452024-07-12 18:51:35.462858369.07475purchasePhoenix0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "fraud_data", + "summary": "{\n \"name\": \"fraud_data\",\n \"rows\": 100000,\n \"fields\": [\n {\n \"column\": \"TransactionID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 28867,\n \"min\": 1,\n \"max\": 100000,\n \"num_unique_values\": 100000,\n \"samples\": [\n 75722,\n 80185,\n 19865\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionDate\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 100000,\n \"samples\": [\n \"2024-08-18 01:11:35.918051\",\n \"2024-06-09 07:44:35.939541\",\n \"2024-06-10 08:55:35.558368\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1442.4159985963513,\n \"min\": 1.05,\n \"max\": 4999.77,\n \"num_unique_values\": 90621,\n \"samples\": [\n 3273.37,\n 4040.01,\n 4120.55\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MerchantID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 288,\n \"min\": 1,\n \"max\": 1000,\n \"num_unique_values\": 1000,\n \"samples\": [\n 702,\n 152,\n 346\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"purchase\",\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Houston\",\n \"Dallas\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"IsFraud\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 3 + } + ], + "source": [ + "fraud_data = pd.read_csv(\"credit_card_fraud_dataset.csv\")\n", + "fraud_data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "a2e0a535-7ece-4607-ba19-ee6e7eac9374", + "metadata": { + "id": "a2e0a535-7ece-4607-ba19-ee6e7eac9374" + }, + "outputs": [], + "source": [ + "# Select relevant features and labels\n", + "X_raw = fraud_data[[\"Amount\", \"TransactionType\", \"Location\"]]\n", + "y = fraud_data[\"IsFraud\"]" + ] + }, + { + "cell_type": "markdown", + "id": "2600e93b-60d9-4d54-b8a6-092563c19aff", + "metadata": { + "id": "2600e93b-60d9-4d54-b8a6-092563c19aff" + }, + "source": [ + "We will now preprocess the dataset to prepare it for analysis. This involves:\n", + "1. Selecting relevant features (e.g., `Amount`, `TransactionType`, `Location`).\n", + "2. Encoding categorical variables (e.g., `TransactionType` and `Location`) using one-hot encoding.\n", + "3. Standardizing numerical variables (e.g., `Amount`) to ensure all features are on a similar scale.\n", + "\n", + "Next, we assign the preprocessed features to `X` and the labels (`IsFraud`) to `y`." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "814c39d6-4490-47db-8a0a-ded45fc5a09a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "814c39d6-4490-47db-8a0a-ded45fc5a09a", + "outputId": "6d8567f0-8396-450c-abc1-9fba07e1c4f5" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " Amount TransactionType_refund Location_Dallas Location_Houston \\\n", + "0 1.173161 True False False \n", + "1 0.112740 True True False \n", + "2 -1.187661 False False False \n", + "3 0.705284 False False False \n", + "4 -1.475326 False False False \n", + "\n", + " Location_Los Angeles Location_New York Location_Philadelphia \\\n", + "0 False False False \n", + "1 False False False \n", + "2 False True False \n", + "3 False False True \n", + "4 False False False \n", + "\n", + " Location_Phoenix Location_San Antonio Location_San Diego \\\n", + "0 False True False \n", + "1 False False False \n", + "2 False False False \n", + "3 False False False \n", + "4 True False False \n", + "\n", + " Location_San Jose \n", + "0 False \n", + "1 False \n", + "2 False \n", + "3 False \n", + "4 False \n" + ] + } + ], + "source": [ + "# One-hot encode categorical features\n", + "categorical_features = [\"TransactionType\", \"Location\"]\n", + "X_encoded = pd.get_dummies(X_raw, columns=categorical_features, drop_first=True)\n", + "\n", + "# Standardize numerical features\n", + "numeric_features = [\"Amount\"]\n", + "scaler = StandardScaler()\n", + "X_encoded[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])\n", + "\n", + "# Display preprocessed data\n", + "print(X_encoded.head())" + ] + }, + { + "cell_type": "markdown", + "id": "d394b6cd-34fd-4ec0-8f6f-99d13dd481ab", + "metadata": { + "id": "d394b6cd-34fd-4ec0-8f6f-99d13dd481ab" + }, + "source": [ + "### 3. Select a Classification Model and Compute Out-of-Sample Predicted Probabilities\n", + "\n", + "To detect potential label errors in the **Credit Card Fraud Detection dataset**, Cleanlab requires **probabilistic predictions** for every data point. However, predictions generated on the same data used for training can be **overfitted** and unreliable. For accurate results, Cleanlab works best with **out-of-sample** predicted class probabilities—i.e., predictions for data points excluded from the model during training.\n", + "\n", + "---\n", + "\n", + "### Why Use Out-of-Sample Predictions?\n", + "\n", + "Out-of-sample predictions ensure that the model hasn't seen the data points during training. This approach:\n", + "- **Prevents overfitting**: Predictions are not biased by the training process.\n", + "- **Improves reliability**: Probabilities are closer to real-world performance.\n", + "- **Supports Cleanlab's analysis**: Enables Cleanlab to accurately identify mislabeled data and other issues.\n", + "\n", + "---\n", + "\n", + "### How We Generate Out-of-Sample Predictions\n", + "\n", + "We use **K-fold cross-validation**, which:\n", + "1. Splits the dataset into `K` folds.\n", + "2. Trains the model on `K-1` folds and predicts probabilities on the excluded fold.\n", + "3. Repeats this for all folds so that every data point gets a prediction from a model that has not seen it during training.\n", + "\n", + "This ensures every data point has **out-of-sample predicted probabilities**.\n", + "\n", + "---\n", + "\n", + "### Model: Logistic Regression\n", + "\n", + "For this tutorial, we use **Logistic Regression**, a simple and interpretable model commonly used in fraud detection tasks. It predicts the probability of each class (`0` for non-fraud, `1` for fraud) based on the input features.\n", + "\n", + "---\n", + "\n", + "### Predicted Probabilities\n", + "\n", + "The output of cross-validation is an array of **predicted probabilities** (`pred_probs`):\n", + "- **Rows** correspond to individual transactions.\n", + "- **Columns** represent the probabilities of each class (`0` and `1`).\n", + "\n", + "For example:\n", + "| Transaction ID | Probability (Non-Fraud) | Probability (Fraud) |\n", + "|----------------|--------------------------|----------------------|\n", + "| 1 | 0.92 | 0.08 |\n", + "| 2 | 0.65 | 0.35 |\n", + "| ... | ... | ... |\n", + "\n", + "These probabilities are a critical input for Cleanlab to identify potential label issues in the dataset.\n", + "\n", + "Next, we will use these probabilities to construct a **K-Nearest Neighbors (KNN) graph** for analyzing data quality.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "dd2ccbc6-824a-44cd-bd1d-26e0a39e23a4", + "metadata": { + "id": "dd2ccbc6-824a-44cd-bd1d-26e0a39e23a4" + }, + "outputs": [], + "source": [ + "# Define the classification model\n", + "clf = LogisticRegression(max_iter=1000, random_state=SEED)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "bb9327e7-bc63-45a0-8d2c-7950186c961f", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bb9327e7-bc63-45a0-8d2c-7950186c961f", + "outputId": "1024212d-a062-4a11-8599-f792c7d48892" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Shape of predicted probabilities: (100000, 2)\n" + ] + } + ], + "source": [ + "# Perform K-fold cross-validation to compute out-of-sample predicted probabilities\n", + "num_crossval_folds = 5\n", + "pred_probs = cross_val_predict(\n", + " clf,\n", + " X_encoded, # Preprocessed feature matrix\n", + " y, # Labels\n", + " cv=num_crossval_folds,\n", + " method=\"predict_proba\" # Get predicted probabilities\n", + ")\n", + "\n", + "# Display the shape of the predicted probabilities array\n", + "print(\"Shape of predicted probabilities:\", pred_probs.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "2681aad2-d84d-4713-bed8-aa1204223fd5", + "metadata": { + "id": "2681aad2-d84d-4713-bed8-aa1204223fd5" + }, + "source": [ + "# 4. Construct K Nearest Neighbors Graph\n", + "\n", + "The **KNN graph** represents the similarity between examples in the dataset. It helps Cleanlab identify issues like:\n", + "- **Outliers**: Data points that are far from others in feature space.\n", + "- **Duplicates or Near-Duplicates**: Examples that are unusually close to each other.\n", + "\n", + "For tabular data, we define similarity using the **Euclidean distance** between feature values.\n", + "\n", + "We use scikit-learn's `NearestNeighbors` class to construct this graph:\n", + "1. Compute pairwise distances between all examples.\n", + "2. Represent the graph as a sparse matrix, with nonzero entries indicating the distance to nearest neighbors.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "6c49780c-5d65-4360-ad0b-90e6717ecddb", + "metadata": { + "id": "6c49780c-5d65-4360-ad0b-90e6717ecddb" + }, + "outputs": [], + "source": [ + "# Create a KNN model with Euclidean distance as the metric\n", + "knn = NearestNeighbors(metric=\"euclidean\")\n", + "\n", + "# Fit the KNN model to the preprocessed feature values\n", + "knn.fit(X_encoded.values)\n", + "\n", + "# Construct the KNN graph as a sparse matrix\n", + "knn_graph = knn.kneighbors_graph(mode=\"distance\")" + ] + }, + { + "cell_type": "markdown", + "id": "b27cf2de-e276-438a-8f0a-9a5de3d1757a", + "metadata": { + "id": "b27cf2de-e276-438a-8f0a-9a5de3d1757a" + }, + "source": [ + "# 5. Use Cleanlab to Find Dataset Issues\n", + "\n", + "With the given labels, predicted probabilities, and the KNN graph, Cleanlab can help us identify various issues in the **Credit Card Fraud Detection dataset**, such as:\n", + "\n", + "- **Label Issues**: Transactions where the assigned label (fraud or non-fraud) is likely incorrect.\n", + "- **Outliers**: Transactions with anomalous patterns that differ significantly from the rest.\n", + "- **Near-Duplicates**: Transactions that are highly similar or repeated.\n", + "- **Class Imbalance**: Uneven representation of classes in the dataset.\n", + "\n", + "We use Cleanlab's **Datalab** class to audit the dataset for these issues. The process involves:\n", + "1. Wrapping the dataset (preprocessed features and labels) into a dictionary format.\n", + "2. Creating a `Datalab` object to analyze the dataset.\n", + "3. Detecting and reporting various types of data quality issues." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "58ca8740-1e44-4959-8567-ec4ee2535bfa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "58ca8740-1e44-4959-8567-ec4ee2535bfa", + "outputId": "dc353a68-3607-46e0-fb87-07fcf651c288" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Finding label issues ...\n", + "Finding outlier issues ...\n", + "Finding near_duplicate issues ...\n", + "Finding non_iid issues ...\n", + "Finding class_imbalance issues ...\n", + "Finding underperforming_group issues ...\n", + "\n", + "Audit complete. 12043 issues found in the dataset.\n" + ] + } + ], + "source": [ + "from cleanlab import Datalab\n", + "# Wrap the dataset into a dictionary\n", + "data = {\"X\": X_encoded.values, \"y\": y}\n", + "\n", + "# Create a Datalab object\n", + "lab = Datalab(data, label_name=\"y\")\n", + "\n", + "# Use Cleanlab to find issues in the dataset\n", + "lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ef82352b-eb2e-4c7a-91d0-436fc61e16ac", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ef82352b-eb2e-4c7a-91d0-436fc61e16ac", + "outputId": "73358440-6c05-4fb6-9beb-ad5fc35055cc" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Dataset Information: num_examples: 100000, num_classes: 2\n", + "\n", + "Here is a summary of various issues found in your data:\n", + "\n", + " issue_type num_issues\n", + " near_duplicate 8639\n", + " outlier 1797\n", + "class_imbalance 1000\n", + " label 607\n", + "\n", + "Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html\n", + "See which examples in your dataset exhibit each issue via: `datalab.get_issues()`\n", + "\n", + "Data indices corresponding to top examples of each issue are shown below.\n", + "\n", + "\n", + "------------------ near_duplicate issues -------------------\n", + "\n", + "About this issue:\n", + "\tA (near) duplicate issue refers to two or more examples in\n", + " a dataset that are extremely similar to each other, relative\n", + " to the rest of the dataset. The examples flagged with this issue\n", + " may be exactly duplicated, or lie atypically close together when\n", + " represented as vectors (i.e. feature embeddings).\n", + " \n", + "\n", + "Number of examples with this issue: 8639\n", + "Overall dataset quality in terms of this issue: 0.5894\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n", + "62583 True 0.0 [55080] 0.0\n", + "30333 True 0.0 [13617] 0.0\n", + "12827 True 0.0 [15703] 0.0\n", + "66741 True 0.0 [82920] 0.0\n", + "45125 True 0.0 [95476] 0.0\n", + "\n", + "\n", + "---------------------- outlier issues ----------------------\n", + "\n", + "About this issue:\n", + "\tExamples that are very different from the rest of the dataset \n", + " (i.e. potentially out-of-distribution or rare/anomalous instances).\n", + " \n", + "\n", + "Number of examples with this issue: 1797\n", + "Overall dataset quality in terms of this issue: 0.3784\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_outlier_issue outlier_score\n", + "43484 True 0.003062\n", + "4659 True 0.007290\n", + "67602 True 0.007582\n", + "91994 True 0.007898\n", + "52696 True 0.008608\n", + "\n", + "\n", + "------------------ class_imbalance issues ------------------\n", + "\n", + "About this issue:\n", + "\tExamples belonging to the most under-represented class in the dataset.\n", + "\n", + "Number of examples with this issue: 1000\n", + "Overall dataset quality in terms of this issue: 0.0100\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_class_imbalance_issue class_imbalance_score given_label\n", + "68852 True 0.01 1\n", + "22652 True 0.01 1\n", + "33819 True 0.01 1\n", + "5781 True 0.01 1\n", + "44573 True 0.01 1\n", + "\n", + "Additional Information: \n", + "Rarest Class: 1\n", + "\n", + "\n", + "----------------------- label issues -----------------------\n", + "\n", + "About this issue:\n", + "\tExamples whose given label is estimated to be potentially incorrect\n", + " (e.g. due to annotation error) are flagged as having label issues.\n", + " \n", + "\n", + "Number of examples with this issue: 607\n", + "Overall dataset quality in terms of this issue: 0.9939\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_label_issue label_score given_label predicted_label\n", + "6901 True 0.006965 1 0\n", + "7933 True 0.007031 1 0\n", + "13204 True 0.007065 1 0\n", + "16276 True 0.007086 1 0\n", + "7546 True 0.007124 1 0\n" + ] + } + ], + "source": [ + "lab.report()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Label Issues\n", + "The report indicates that Cleanlab identified several label issues in the dataset. These are data entries where the given labels may not match the actual label, as estimated by Cleanlab. Each issue includes a numeric label score that quantifies how likely the label is correct (lower scores indicate higher likelihood of being mislabeled)." + ], + "metadata": { + "id": "qBcATrTFCWqJ" + }, + "id": "qBcATrTFCWqJ" + }, + { + "cell_type": "code", + "source": [ + "# Retrieve label issues\n", + "label_issues = lab.get_issues(\"label\")\n", + "print(label_issues.head())\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pee_lWpiCiIV", + "outputId": "d5bcf570-0051-4b92-df49-20c3479b88b1" + }, + "id": "pee_lWpiCiIV", + "execution_count": 11, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " is_label_issue label_score given_label predicted_label\n", + "0 False 0.990469 0 0\n", + "1 False 0.991203 0 0\n", + "2 False 0.988302 0 0\n", + "3 False 0.990321 0 0\n", + "4 False 0.991149 0 0\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Filter rows with label issues\n", + "label_issues_filtered = label_issues[label_issues['is_label_issue'] == True]\n", + "print(label_issues_filtered.head())\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8pGqVz8RDeoF", + "outputId": "322a6eb8-4b2f-4597-9b8f-614c97887a45" + }, + "id": "8pGqVz8RDeoF", + "execution_count": 12, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " is_label_issue label_score given_label predicted_label\n", + "190 True 0.007187 1 0\n", + "191 True 0.007622 1 0\n", + "208 True 0.007177 1 0\n", + "319 True 0.008984 1 0\n", + "506 True 0.009220 1 0\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Sort the label issues by label_score (lower scores indicate higher likelihood of being mislabeled)\n", + "sorted_issues = label_issues.sort_values(\"label_score\").index\n", + "\n", + "# View the most likely label errors\n", + "X_raw.iloc[sorted_issues].assign(\n", + " given_label=y.iloc[sorted_issues],\n", + " predicted_label=label_issues[\"predicted_label\"].iloc[sorted_issues]\n", + ").head()\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "m1KP2zEWDfaE", + "outputId": "6fc9c1b0-30a0-4c3f-f015-44e42202c166" + }, + "id": "m1KP2zEWDfaE", + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Amount TransactionType Location given_label predicted_label\n", + "6901 346.13 purchase San Jose 1 0\n", + "7933 25.91 refund San Jose 1 0\n", + "13204 963.84 purchase San Jose 1 0\n", + "16276 1093.22 purchase San Jose 1 0\n", + "7546 598.78 refund San Jose 1 0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AmountTransactionTypeLocationgiven_labelpredicted_label
6901346.13purchaseSan Jose10
793325.91refundSan Jose10
13204963.84purchaseSan Jose10
162761093.22purchaseSan Jose10
7546598.78refundSan Jose10
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \")\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 438.6116871789898,\n \"min\": 25.91,\n \"max\": 1093.22,\n \"num_unique_values\": 5,\n \"samples\": [\n 25.91,\n 598.78,\n 963.84\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"San Jose\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 1,\n \"num_unique_values\": 1,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Example Review of Label Issues\n", + "\n", + "The dataframe below shows the original label (`given_label`) for examples that Cleanlab finds most likely to be mislabeled, as well as an alternative `predicted_label` for each example.\n", + "\n", + "| Amount | TransactionType | Location | given_label | predicted_label |\n", + "|---------|------------------|-----------|-------------|-----------------|\n", + "| 346.13 | purchase | San Jose | 1 | 0 |\n", + "| 25.91 | refund | San Jose | 1 | 0 |\n", + "| 963.84 | purchase | San Jose | 1 | 0 |\n", + "| 1093.22 | purchase | San Jose | 1 | 0 |\n", + "| 598.78 | refund | San Jose | 1 | 0 |\n", + "\n", + "These examples have been labeled incorrectly and should be carefully re-examined:\n", + "- **Entry 1**: A purchase of 346.13 labeled as fraudulent (`1`) is predicted to be non-fraudulent (`0`).\n", + "- **Entry 2**: A refund of 25.91 is similarly labeled as fraudulent but predicted as non-fraudulent.\n", + "- **Entry 4**: A purchase of $1093.22 also seems misclassified as fraudulent.\n", + "\n", + "The predicted labels suggest a potential mislabeling pattern for transactions in `San Jose`. Transactions with relatively lower amounts or refunds might have been mislabeled as fraudulent. This should be reviewed with additional domain knowledge or transaction metadata for confirmation.\n", + "\n", + "Such insights are crucial for improving the dataset's quality and ensuring the model learns from accurate labels.\n" + ], + "metadata": { + "id": "-ApyX5r6FTmI" + }, + "id": "-ApyX5r6FTmI" + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "### Outlier Issues\n", + "\n", + "According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via the `get_issues` method. We sort the resulting DataFrame by Cleanlab’s outlier quality score to see the most severe outliers in our dataset." + ], + "metadata": { + "id": "_zzPdWl0GFOY" + }, + "id": "_zzPdWl0GFOY" + }, + { + "cell_type": "code", + "source": [ + "outlier_results = lab.get_issues(\"outlier\")\n", + "sorted_outliers = outlier_results.sort_values(\"outlier_score\").index\n", + "\n", + "X_raw.iloc[sorted_outliers].head()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "D7VClp15GIXC", + "outputId": "ec6c23c3-1802-42ba-d2aa-69537251da5d" + }, + "id": "D7VClp15GIXC", + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Amount TransactionType Location\n", + "43484 4999.73 purchase Chicago\n", + "4659 2114.37 refund Philadelphia\n", + "67602 3255.47 purchase San Jose\n", + "91994 1147.93 refund Chicago\n", + "52696 4005.05 purchase San Antonio" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AmountTransactionTypeLocation
434844999.73purchaseChicago
46592114.37refundPhiladelphia
676023255.47purchaseSan Jose
919941147.93refundChicago
526964005.05purchaseSan Antonio
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1519.3915375570575,\n \"min\": 1147.93,\n \"max\": 4999.73,\n \"num_unique_values\": 5,\n \"samples\": [\n 2114.37,\n 4005.05,\n 3255.47\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"refund\",\n \"purchase\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"Philadelphia\",\n \"San Antonio\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 14 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "\n", + "\n", + "#### **Key Observations**:\n", + "1. **Entry 1**: A purchase transaction with an unusually high amount of `$4999.73` in Chicago may represent a legitimate but rare high-value transaction or could be indicative of an error.\n", + "2. **Entry 2**: A refund for `$2114.37` in Philadelphia seems unusually high compared to typical refund amounts and should be verified.\n", + "3. **Entry 5**: Another high-value purchase transaction of `$4005.05` in San Antonio is rare and should be reviewed for validity.\n", + "\n", + "#### **Next Steps**:\n", + "- **Investigate Outliers**:\n", + " - Validate whether these transactions are legitimate or the result of data errors.\n", + " - Cross-check these entries against metadata such as timestamps, merchants, and customer profiles for better context.\n", + "- **Handle Outliers**:\n", + " - **Retain**: If the transaction is valid, keep it in the dataset for training.\n", + " - **Remove**: If the transaction is deemed erroneous or unrepresentative, exclude it from the dataset to avoid skewing the model's learning.\n", + "\n", + "These steps will ensure that the dataset is representative and does not include suspicious entries that could affect the performance of fraud detection models.\n", + " " + ], + "metadata": { + "id": "qQYl8X5RG9F6" + }, + "id": "qQYl8X5RG9F6" + }, + { + "cell_type": "markdown", + "source": [ + "### Near-Duplicate Issues\n", + "\n", + "According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by Cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.\n", + "\n", + "\n" + ], + "metadata": { + "id": "STlYZFJRRDtO" + }, + "id": "STlYZFJRRDtO" + }, + { + "cell_type": "code", + "source": [ + "duplicate_results = lab.get_issues(\"near_duplicate\")\n", + "duplicate_results.sort_values(\"near_duplicate_score\").head()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "VHcPnNYbQZ-n", + "outputId": "7dc6f1fe-ac78-4c77-96e5-176c7f3a6a16" + }, + "id": "VHcPnNYbQZ-n", + "execution_count": 15, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets \\\n", + "62583 True 0.0 [55080] \n", + "30333 True 0.0 [13617] \n", + "12827 True 0.0 [15703] \n", + "66741 True 0.0 [82920] \n", + "45125 True 0.0 [95476] \n", + "\n", + " distance_to_nearest_neighbor \n", + "62583 0.0 \n", + "30333 0.0 \n", + "12827 0.0 \n", + "66741 0.0 \n", + "45125 0.0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_near_duplicate_issuenear_duplicate_scorenear_duplicate_setsdistance_to_nearest_neighbor
62583True0.0[55080]0.0
30333True0.0[13617]0.0
12827True0.0[15703]0.0
66741True0.0[82920]0.0
45125True0.0[95476]0.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"duplicate_results\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_near_duplicate_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_sets\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance_to_nearest_neighbor\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 0.0,\n \"max\": 0.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 15 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "The results above show which examples Cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that Cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are." + ], + "metadata": { + "id": "0FyG5cJtRNGb" + }, + "id": "0FyG5cJtRNGb" + }, + { + "cell_type": "code", + "source": [ + "# Identify the row with the lowest near_duplicate_score\n", + "lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].idxmin()\n", + "\n", + "# Extract the indices of the lowest scoring duplicate and its near duplicate sets\n", + "indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", + "\n", + "# Display the relevant rows from the original dataset\n", + "X_raw.iloc[indices_to_display]\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 143 + }, + "id": "IqgcWEVIROAP", + "outputId": "eb36a8cd-a66e-4f3d-eb68-c7aac6ef27b5" + }, + "id": "IqgcWEVIROAP", + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Amount TransactionType Location\n", + "73 3374.61 refund New York\n", + "19427 3374.61 refund New York\n", + "30450 3374.63 refund New York" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AmountTransactionTypeLocation
733374.61refundNew York
194273374.61refundNew York
304503374.63refundNew York
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.011547005383782014,\n \"min\": 3374.61,\n \"max\": 3374.63,\n \"num_unique_values\": 2,\n \"samples\": [\n 3374.63,\n 3374.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 18 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "These examples are exact duplicates! Perhaps the same information was accidentally recorded multiple times in this data.\n", + "\n", + "Similarly, let’s take a look at another example and the identified near-duplicate sets:" + ], + "metadata": { + "id": "6nhecZHHSuv9" + }, + "id": "6nhecZHHSuv9" + }, + { + "cell_type": "code", + "source": [ + "# Identify the next row not in the previous near duplicate set\n", + "second_lowest_scoring_duplicate = duplicate_results[\"near_duplicate_score\"].drop(indices_to_display).idxmin()\n", + "\n", + "# Extract the indices of the second lowest scoring duplicate and its near duplicate sets\n", + "next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, \"near_duplicate_sets\"].tolist()\n", + "\n", + "# Display the relevant rows from the original dataset\n", + "X_raw.iloc[next_indices_to_display]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "94gQWzVkRW53", + "outputId": "106f3513-d065-4483-dc76-e6c28e614b39" + }, + "id": "94gQWzVkRW53", + "execution_count": 19, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Amount TransactionType Location\n", + "167 1796.39 refund New York\n", + "53564 1796.39 refund New York" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AmountTransactionTypeLocation
1671796.39refundNew York
535641796.39refundNew York
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"X_raw\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Amount\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 1796.39,\n \"max\": 1796.39,\n \"num_unique_values\": 1,\n \"samples\": [\n 1796.39\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"TransactionType\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"refund\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Location\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"New York\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 19 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the FAQ.\n", + "\n", + "This tutorial highlights a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Cleanlab with any ML model – the better the model, the more accurate the data errors detected by Cleanlab will be!" + ], + "metadata": { + "id": "6vexriCMTCAG" + }, + "id": "6vexriCMTCAG" + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "I56gc8gFTC4l" + }, + "id": "I56gc8gFTC4l", + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + }, + "colab": { + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/1021_fintech_documentation/Requirement.txt b/1021_fintech_documentation/Requirement.txt new file mode 100644 index 0000000..2758a9a --- /dev/null +++ b/1021_fintech_documentation/Requirement.txt @@ -0,0 +1,4 @@ +numpy==1.22.0 +pandas==1.3.3 +scikit-learn==1.0.2 +scikit-image==0.18.3 \ No newline at end of file