Lecture

Introduction to Clustering (K-Means)


Clustering is an unsupervised learning method where the goal is to group similar data points into clusters without using labels.

One of the most popular algorithms for clustering is K-Means.


How K-Means Works

  1. Choose k – the number of clusters.
  2. Initialize k cluster centers randomly.
  3. Assign points to the nearest center.
  4. Update centers to be the mean of their assigned points.
  5. Repeat steps 3–4 until the cluster assignments stop changing.

K-Means tries to minimize the distance between points in the same cluster and their cluster center.


When to Use K-Means

  • You want to group data by similarity without predefined labels.
  • Your dataset has numerical features and a moderate number of dimensions.
  • You suspect there are clear groups in the data.

Example: Clustering Iris Data

K-Means Example
# Install scikit-learn in Jupyter Lite import piplite await piplite.install('scikit-learn') from sklearn.datasets import load_iris from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Load data (only first two features for visualization) iris = load_iris() X = iris.data[:, :2] # Apply K-Means kmeans = KMeans(n_clusters=3, random_state=42) labels = kmeans.fit_predict(X) # Plot clusters plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', edgecolor='k') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centers') plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) plt.title("K-Means Clustering (Iris)") plt.legend() plt.show()

Key Takeaways

  • Unsupervised learning means no labels are provided.
  • K-Means groups data into k clusters by minimizing within-cluster distances.
  • Choosing the right value of k is critical — often done via the elbow method.

What’s Next?

In the next lesson, we’ll look at Model Selection and Cross-Validation to ensure our models generalize well to unseen data.

Quiz
0 / 1

K-Means clustering requires labeled data to group similar data points.

True
False

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help