Implementing k-Means Clustering Algorithm in Python

English
Clustering
Machine Learning
Author

Manuel Solano

Published

July 15, 2024

Clustering is a fundamental technique in data science and machine learning, used to group similar data points together. One of the most popular clustering algorithms is k-Means, which aims to partition a set of points into k clusters by minimizing the variance within each cluster. This algorithm is widely used in various applications, from market segmentation to image compression.

In this blog post, I will walk you through the process of implementing the k-Means algorithm in Python. We’ll start with a given set of points, a specified number of clusters, initial centroids, and a maximum number of iterations. Our goal is to iteratively assign each point to the nearest centroid and update the centroids until they stabilize or the iteration limit is reached.

The challenge we will address involves:

Defining a function that takes a list of points, the number of clusters (k), initial centroids, and the maximum number of iterations. Iteratively updating the centroids based on the assignments of points. Ensuring the function returns the final centroids after convergence or the maximum iterations.

Example:
        input: points = [(1, 2), (1, 4), (1, 0), (10, 2), (10, 4), (10, 0)], k = 3, initial_centroids = [(1, 1), (10, 1)], max_iterations = 10
            '''
        output: [(1, 2), (10, 2)]
        reasoning: Given the initial centroids and a maximum of 10 iterations,
        the points are clustered around these points, and the centroids are
        updated to the mean of the assigned points, resulting in the final
        centroids which approximate the means of the two clusters.
        The exact number of iterations needed may vary,
        but the process will stop after 10 iterations at most.
        '''

Input

points =  [(1, 2), (1, 4), (1, 0), (10, 2), (10, 4), (10, 0)]
k = 3
initial_centroids = [(1, 1), (10, 1)] 
max_iterations = 10

Function

Here is the function I created for solving this challenge:

import numpy as np

def k_means_clustering(points: list[tuple[float, float]], k: int, initial_centroids: list[tuple[float, float]], max_iterations: int) -> list[tuple[float, float]]:
    
    def euc_distance(p1, p2):
        suma = 0
        for i in range(len(p1)):
            suma += (p1[i]-p2[i])**2
        return pow(suma, 1/2)

    points = np.array(points)
    centroid1 = np.array(initial_centroids[0])
    centroid2 = np.array(initial_centroids[1])
    i = 0
    while i < max_iterations:
        c1 = np.array([])
        c2 = np.array([])
        
        for p in points:
            c1 = np.append(c1, round(euc_distance(p, centroid1), 4))
            c2 = np.append(c2, round(euc_distance(p, centroid2), 4))

        init_centroid1 = centroid1[:]
        init_centroid2 = centroid2[:]
        # np.argptition
        #  does not sort the entire array. It only guarantees that the kth element is in sorted position and all 
        # smaller elements will be moved before it. Thus the first k elements will be the k-smallest elements.
        index1 = np.argpartition(c1, k)
        index2 = np.argpartition(c2, k)
        cluster1 = points[index1[:k]]
        centroid1 = np.mean(cluster1, axis = 0)
        cluster2 = points[index2[:k]]
        centroid2 = np.mean(cluster2, axis = 0)
        
        if np.all(init_centroid1 == centroid1) and np.all(init_centroid2 == centroid2):
            break
        
        i += 1       
    return centroid1.tolist(), centroid2.tolist()
    
k_means_clustering(points, k, initial_centroids, max_iterations)
([1.0, 2.0], [10.0, 2.0])

Visual Representation