Machine learning is building a predictive model using historical data to make predictions on new data. Clustering is also a machine learning technique used to find distinct groups in a dataset. Clustering algorithms group datapoints in clusters in a way that similar datapoints are grouped together.
Since, the output of model is probabilistic, evaluating your predictions becomes a crucial step. In this post, we will explore some of the most popular evaluation metrics for clustering problems.

Measure Clustering Performance

For clustering a dataset with no target to aim for, we need to look for other types of measurement that give us an indication of performance. The most common measurement for cluster performance is the distinctness of the clusters created, this is because the most common goal for clustering is to create clusters that are as unique as possible.

Best Clustering Metrics

The most common ways of measuring the performance of clustering models are to either measure the distinctiveness or similarity between the groups. Given this, there are three common metrics to use.

Silhouette Score

Silhouette Score compares the similarity of datapoints within the same cluster to datapoints in different clusters. It is the mean silhouette coefficient for all clusters, which is calculated using the mean intra-cluster distance and the mean nearest-cluster distance. This score is between -1 and 1, where the higher the score the more well-defined your clusters are. Silhouette Score mathematical formula is calculated as:

Silhouette_Score_Mathematical_Formula
Image source: geeksforgeeks.org

Calinski-Harabaz Index

Calinski-Harabasz Index measures the ratio of between-cluster dispersion to within-cluster dispersion in order to measure the distinctiveness between groups. Like Silhouette score, the higher the score the more well-defined the clusters are. Calinski-Harabasz score has no bound. Meaning there is no acceptable value. Calinski-Harabasz index mathematical formula is calculated as:

Calinski_Harabasz_Index_Mathematical_Formula

Davies-Bouldin Index

Davies-Bouldin Index estimates the average similarity between each cluster and its most comparable cluster. It evaluates the clustering quality by considering both the separation between clusters and their compactness. Davies-Bouldin Index score measures the similarity of your clusters, meaning that the lower the score the better separation there is between your clusters. Davies-Bouldin index mathematical formula is calculated as:

Davies_Bouldin_index_mathematical_formula

Adjusted Rand Index

The adjusted rand index is an evaluation metric that is used to measure the similarity between two clustering by considering all the samples pairs and calculating the counting pairs of the assigned in the same or different clusters in the actual and predicted clustering.
It measures the quality of clustering by comparing how well the clusters align with the true labels. It ranges the comparing result from -1 to 1, where 1 indicates perfect clustering, 0 indicates random clustering, and negative values suggest poor clustering. Adjusted Rand index mathematical formula is calculated as:

Adjusted_Rand_index_mathematical_formula

Mutual Information

Mutual Information is used to assess the agreement between the true labels of datapoints and the cluster assignments produced by a clustering algorithm. It quantifies the shared information and measures how well the clusters capture the underlying structure in dataset. Meaning it is used to check the mutual information in the actual label target versus the predicted label, High values indicate better alignment between clusters and true labels. Mutual Information mathematical formula is calculated as:

Mutual_Information_mathematical_formula

What is the best Clustering Metric?

The most commonly used metric for comparing the performance of a clustering algorithm is the Silhouette Score which has an upper and lower bound, that’s bounded between -1 and 1.

Recommended for you:

Leave a Reply

Your email address will not be published. Required fields are marked *