From a9aade13c61052527369dc964c8ee3aa3084add3 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Sun, 23 Jun 2024 20:42:44 -0400 Subject: [PATCH] Updated changes based off Elizabeth's comments --- python_clustering/python_clustering.md | 256 +++++++++++++++++++++++-- 1 file changed, 236 insertions(+), 20 deletions(-) diff --git a/python_clustering/python_clustering.md b/python_clustering/python_clustering.md index 50b22dec..00438ed9 100644 --- a/python_clustering/python_clustering.md +++ b/python_clustering/python_clustering.md @@ -2,7 +2,7 @@ author: Daniel Schwartz email: des338@drexel.edu -version: 0.0.0 +version: 1.0.0 current_version_description: Initial version module_type: standard docs_version: 3.0.0 @@ -68,17 +68,37 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md +## Summary of Key Concepts in Clustering + +- **Clustering Definition:** Clustering is an unsupervised machine learning technique used to group unlabeled data points into clusters based on their similarity. The goal is to identify groups of data points that are similar to each other and dissimilar to data points in other groups. Common algorithms include K-Means, hierarchical clustering, and Gaussian Mixture Models. + +- **Unsupervised vs. Supervised Learning:** Clustering falls under unsupervised learning, where algorithms are trained on unlabeled data to identify patterns and relationships without prior knowledge. Supervised learning, on the other hand, involves training on labeled data to predict labels for new data points. + +- **Applications:** Clustering finds applications in various fields such as customer segmentation, biomedical research, drug development, gene expression analysis, medical image analysis, and disease-risk prediction. + +- **K-Means Clustering Algorithm:** K-Means works by iteratively assigning data points to clusters based on their distance to cluster centroids. Key steps include choosing the number of clusters (K), initializing centroids, assigning data points, recalculating centroids, and iterating until convergence. + +- **Understanding Techniques:** Techniques like normalization, computing distances from cluster centroids, and visualization aid in building accurate clustering models and interpreting results. + +- **Challenges and Limitations:** Challenges include sensitivity to initialization, difficulty in choosing the number of clusters, handling outliers, and interpreting results in high-dimensional data. + +- **Mitigating Sensitivity:** Techniques like running the algorithm multiple times with different initializations, using robust algorithms, and preprocessing data help mitigate sensitivity to initialization. + +- **Conclusion:** Clustering is a powerful tool with diverse applications, but it's essential to understand its limitations and challenges. With foundational knowledge in clustering techniques, one can explore advanced methods and make informed decisions in data analysis and machine learning endeavors. -### Python Implementation of K-Means Clustering +## Python Implementation of K-Means Clustering This dataset contains various clinical attributes of patients, including their age, sex, chest pain type (cp), resting blood pressure (trtbps), serum cholesterol level (chol), fasting blood sugar (fbs) level, resting electrocardiographic results (restecg), maximum heart rate achieved (thalachh), exercise-induced angina (exng), ST depression induced by exercise relative to rest (oldpeak), slope of the peak exercise ST segment (slp), number of major vessels (caa) colored by fluoroscopy, thalassemia (thall) type, and the presence of heart disease (output). The data seems to be related to the diagnosis of heart disease, with the output variable indicating whether a patient has heart disease (1) or not (0). Each row represents a different patient, with their respective clinical characteristics recorded. To implement k-means clustering in Python using Scikit-learn, we can follow these steps: -1. Import Libraries +### 1. Import Libraries + +**Description:** +This step imports essential libraries needed for data manipulation, analysis, and visualization, as well as the KMeans clustering algorithm. * **numpy (np):** This library provides tools for numerical operations and working with arrays, which are essential for data manipulation in machine learning. * **pandas (pd):** Pandas is used for data analysis and manipulation, especially with tabular data. It makes it easy to load, clean, and organize your data. @@ -87,6 +107,8 @@ To implement k-means clustering in Python using Scikit-learn, we can follow thes * **sklearn.cluster (KMeans):** This is where the heart of our clustering algorithm lies. KMeans is the specific algorithm we'll use to group our data into clusters. * **scipy.spatial (distance):** Scipy is a broader scientific computing library. The distance module provides functions to calculate distances between points, which we'll use in our KMeans analysis. +**Why is it important:** +These libraries provide the foundational tools and functions required to perform data preprocessing, clustering, and visualization. Without them, we wouldn't be able to efficiently handle the data or perform the clustering analysis. ```python import numpy as np @@ -98,11 +120,23 @@ from scipy.spatial import distance ``` @Pyodide.eval +**Output:** +There is no direct output for this step, as it is focused on importing necessary libraries. However, successful execution without errors indicates that the libraries are correctly imported and ready for use. + + + + +### 2. Loading the Data + +**Description:** +This step involves loading the patient data from a CSV file into a Pandas DataFrame and then examining the structure of the data. -2. Loading the Data * `data = pd.read_csv(file)`: This line reads the CSV (Comma-Separated Values) file, which presumably contains your patient data, into a Pandas DataFrame called `data`. DataFrames are like tables, where each row represents a patient, and each column represents a feature (e.g., age, cholesterol). * `data.info()`: This function gives you a summary of the DataFrame, showing the column names, their data types, and how many non-null values are in each column. This helps you understand the structure of your data. +**Why it's important:** +Understanding the structure of your data is crucial before performing any data manipulation or analysis. It helps identify any missing values, understand data types, and get a general overview of the dataset. + ```python @Pyodide.exec @@ -124,8 +158,19 @@ data.info() ``` -3. Visualize Data -This code generates a scatter plot with `chol` (Cholesterol) on the x-axis and `trtbps` (Resting Blood Pressure) on the y-axis. The data points are colored based on the `output` column, using the `viridis` colormap. Labels and a title are added, and then the plot is displayed. +**Output:** + +`data.info()` gives a summary of the DataFrame, including the number of non-null entries for each column and their data types. +`print(data.head())` displays the first few rows of the DataFrame to give learners a feel for what the data looks like. + + + +### 3. Visualize Data +**Description:** This code generates a scatter plot with `chol` (Cholesterol) on the x-axis and `trtbps` (Resting Blood Pressure) on the y-axis. The data points are colored based on the `output` column, using the `viridis` colormap. Labels and a title are added, and then the plot is displayed. + + +**Why it's important:** +Understanding the structure of your data is crucial before performing any data manipulation or analysis. It helps identify any missing values, understand data types, and get a general overview of the dataset. ```python # Create the scatter plot @@ -137,11 +182,21 @@ plt.show() ``` @Pyodide.eval -4. Normalize DataFrame + +**Output:** +By adding the `print(data.head())` statement, you can see the first few rows of the data, which helps understand the dataset's structure and the columns being used in the plot. + + +### 4. Normalize DataFrame + +**Description:** * The function `normalize(df, features)` is defined to perform min-max normalization of the features listed in `features` within the DataFrame `df`. It creates a copy `result` of the DataFrame and iterates over each feature to scale its values to the range [0, 1]. The normalized DataFrame `result` is returned. * The `normalize` function is then applied to the `data` DataFrame to normalize all columns, and the results are stored in `normalized_data`. +**Why it is important:** +Normalization is crucial because it scales the data to a common range without distorting differences in the ranges of values. This ensures that no single feature dominates the clustering algorithm due to its scale, leading to more meaningful and comparable results. + ```python # Normalize dataframe @@ -163,11 +218,24 @@ def normalize(df, features): # Call the normalize function with the entire DataFrame 'data' and all its columns. # Store the result in 'normalized_data'. normalized_data = normalize(data, data.columns) + +# Print the normalized data to see the transformed values +print(normalized_data) ``` @Pyodide.eval -5. Run KMeans -* This line creates a KMeans object. +**Output:** +This code performs min-max normalization on the dataset and prints the resulting normalized_data. The output will show the scaled values of each feature, ensuring that all values are between 0 and 1. This step is critical for ensuring that the clustering algorithm treats each feature equally. + + + + + +### 5. Run KMeans +**Description:** This line creates a KMeans object. + +**Why this is important:** +The KMeans algorithm is a popular clustering method that partitions the data into distinct groups (clusters) based on feature similarity. By configuring the parameters, we can control the behavior of the algorithm and ensure consistent results. * `n_clusters = 2` tells KMeans to find two clusters in your data. * `max_iter = 500` sets a maximum of 500 iterations for the algorithm to converge. @@ -176,38 +244,91 @@ normalized_data = normalize(data, data.columns) ```python -# Run KMeans -kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) +# Create KMeans object +kmeans = KMeans(n_clusters=2, max_iter=500, n_init=40, random_state=2) +print("KMeans object created with the following parameters:") +print(f"Number of clusters: {kmeans.n_clusters}") +print(f"Maximum iterations: {kmeans.max_iter}") +print(f"Number of initializations: {kmeans.n_init}") +print(f"Random state: {kmeans.random_state}") + ``` @Pyodide.eval -6. Predict Clusters +**Output:** +Since the KMeans object creation itself does not produce output, the impact of this step will be evident in the following steps where we fit the model and predict clusters. + + + + + +### 6. Predict Clusters +**Description:** + * `kmeans.fit_predict()` does two things: + 1. It fits the KMeans model to your normalized data, meaning it finds the cluster centers. 2. It predicts which cluster each data point belongs to, returning an array `identified_clusters` where each element corresponds to the cluster assignment of a data point. * We create a copy `results` of the `normalized_data` and add a new column `cluster` to it, storing the identified cluster labels. +**Why this is important:** +Fitting the KMeans model to the data and predicting clusters are crucial steps in the clustering process. By assigning each data point to a cluster, we can analyze patterns and group similar data points together. This can reveal underlying structures in the data and help in further analysis or decision-making processes. + ```python +# Fit the KMeans model to the normalized data and predict the clusters identified_clusters = kmeans.fit_predict(normalized_data.values) + +# Create a copy of the normalized data to store the results results = normalized_data.copy() + +# Add the identified cluster labels as a new column 'cluster' in the results DataFrame results['cluster'] = identified_clusters + +# Print the results to observe the DataFrame with the cluster assignments +print(results.head()) ``` @Pyodide.eval -7. Compute Distance from Cluster Centroid -* This line calculates the Euclidean distance between each data point and its assigned cluster centroid. This distance is stored in the list `distance_from_centroid` and added as a new column `dist` in the results DataFrame. +**Output:** +The output will be a preview of the first few rows of the results DataFrame, which now includes the original normalized data along with the new cluster column. In this output: + +* Each row corresponds to a data point (e.g., a patient's data in a medical dataset). +* The columns represent the normalized features (e.g., age, sex, cp, etc.). +* The cluster column indicates the cluster assignment for each data point, with values such as 0 or 1 representing different clusters. +* This output allows shows how their data points have been grouped into clusters based on the KMeans algorithm. + + + + +### 7. Compute Distance from Cluster Centroid +**Description:** This line calculates the Euclidean distance between each data point and its assigned cluster centroid. This distance is stored in the list `distance_from_centroid` and added as a new column `dist` in the results DataFrame. + +**Why this is important:** +Computing the distance from each data point to its cluster centroid provides insight into how well the data points are clustered around their centroids. It helps assess the compactness of clusters and can be useful for evaluating the quality of the clustering. ```python -distance_from_centroid = [distance.euclidean(val[:-1],kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +# Calculate the Euclidean distance between each data point and its assigned cluster centroid +distance_from_centroid = [distance.euclidean(val[:-1], kmeans.cluster_centers_[int(val[-1])]) for val in results.values] + +# Add the computed distances as a new column 'dist' in the results DataFrame results['dist'] = distance_from_centroid + +# Print the results to observe the DataFrame with the distance values +print(results.head()) ``` @Pyodide.eval +**Output:** +The output will display the first few rows of the results DataFrame with the newly added dist column, representing the distances of each data point from its assigned cluster centroid. This output allows learners to understand how the distances are calculated and see the impact of the clustering on the data. + -8. Train the clustering model and visualize -* Creates a scatter plot of `chol` (Cholesterol) against `trtbps` (Resting Blood Pressure), colored by the identified clusters, with marker size proportional to the distance from the cluster centroid. +### 8. Train the clustering model and visualize +**Description:** Creates a scatter plot of `chol` (Cholesterol) against `trtbps` (Resting Blood Pressure), colored by the identified clusters, with marker size proportional to the distance from the cluster centroid. + +**Why this is important:** +Visualization is crucial for understanding clustering results. By plotting the data points with identified clusters, we can visually inspect how well the clustering algorithm has grouped similar data points together. Additionally, using marker size to represent the distance from the cluster centroid provides insights into the compactness of each cluster. ```python results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') @@ -217,8 +338,42 @@ plt.show() ``` @Pyodide.eval - +**Output:** +The output is a scatter plot where each data point is represented by a marker. The markers are colored based on the identified clusters, and their sizes vary depending on the distance from the cluster centroid. This visualization allows learners to visually inspect how the data points are grouped into clusters and how compact each cluster is. + + +## Review your knowledge + +```python +from sklearn.cluster import KMeans + +# Create a KMeans instance with ____ clusters +kmeans = KMeans(n_clusters=____) + +# Fit the model to the data +kmeans.fit(____) + +# Get the cluster centroids +centroids = kmeans.cluster_centers_ + +# Predict the cluster labels for the data points +labels = kmeans.predict(____) +``` + + +Fill in the blanks to implement the K-Means clustering algorithm in Python: + +[( )] `k`, `k`, `X` +[( )] `n_clusters`, `K`, `data` +[(X)] `K`, `n_clusters`, `data` +[( )] `data`, `n_clusters`, `K` +*** +
+ +This question tests your understanding of implementing the K-Means clustering algorithm using the scikit-learn library in Python. To answer correctly, you need to identify the correct placeholders for the number of clusters and the dataset in the code snippet. The correct option, "K, n_clusters, data," corresponds to the appropriate parameters and function calls required for the K-Means algorithm. + +
@@ -227,6 +382,7 @@ plt.show() Through this lesson, you've gained a solid foundation in clustering, a cornerstone of unsupervised machine learning. You've learned how the K-Means algorithm works, its strengths and limitations, and most importantly, how to harness it within Python's powerful data science ecosystem. +### Key Takeaways Here's a summary of key takeaways to keep in mind: * **Clustering Unveils Hidden Structures:** K-Means can reveal meaningful groupings within your data that might not be immediately apparent. This is crucial for tasks like customer segmentation, anomaly detection, and even preliminary exploration before applying more complex models. @@ -235,8 +391,7 @@ Here's a summary of key takeaways to keep in mind: * **K-Means Isn't Perfect:** Remember that K-Means has its limitations. It assumes clusters are spherical and of equal size, which isn't always the case in real-world data. Additionally, choosing the optimal number of clusters (K) requires careful consideration and experimentation. -**Looking Ahead: Beyond K-Means** - +### Beyond K-Means While K-Means is a great starting point, the world of clustering is vast. As you progress in your machine learning journey, you'll encounter more sophisticated algorithms like DBSCAN, hierarchical clustering, and Gaussian mixture models. Each has its own strengths and use cases. Consider exploring these areas to expand your clustering toolkit: @@ -249,6 +404,67 @@ The knowledge you've gained here equips you to tackle a wide range of data analy ## Additional Resources +### Full Code Implementation + +At the end of this module, here you will find a "Full Code" section where all the code is consolidated into a single cell block. This allows for easy copying and pasting for those who want to implement the entire process quickly. While this single block of code isn't designed as a step-by-step educational tool, it serves as a convenient reference for future use and helps streamline the process for those already familiar with the concepts. Below is the complete code implementation: + +```python +# Import Libraries +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.model_selection import train_test_split +from sklearn.cluster import KMeans +from scipy.spatial import distance +import io +from pyodide.http import open_url + +# Load Data +url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/heart.csv" +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) +data = pd.read_csv(file) +data.info() + +# Visualize Data +data.plot.scatter(x='chol', y='trtbps', c='output', colormap='viridis') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.title("Scatter Plot of Cholesterol vs. Blood Pressure") +plt.show() + +# Normalize DataFrame +def normalize(df, features): + result = df.copy() + for feature_name in features: + max_value = df[feature_name].max() + min_value = df[feature_name].min() + result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) + return result + +normalized_data = normalize(data, data.columns) + +# Run KMeans +kmeans = KMeans(n_clusters = 2, max_iter = 500, n_init = 40, random_state = 2) + +# Predict Clusters +identified_clusters = kmeans.fit_predict(normalized_data.values) +results = normalized_data.copy() +results['cluster'] = identified_clusters + +# Compute Distance from Cluster Centroid +distance_from_centroid = [distance.euclidean(val[:-1], kmeans.cluster_centers_[int(val[-1])]) for val in results.values] +results['dist'] = distance_from_centroid + +# Visualize Clusters +results.plot.scatter(x='chol', y='trtbps', c='cluster', colormap='viridis', s='dist') +plt.xlabel("Cholesterol") +plt.ylabel("Resting Blood Pressure") +plt.show() +``` +@Pyodide.eval + ## Feedback @feedback