Understanding Clustering in Unsupervised Learning

Arif Romadhan
4 min readOct 4, 2020

--

Simple explanation regarding Clustering in Unsupervised Learning

Remember Unsupervised Learning ?

In the previous article, I was explained regarding Unsupervised Learning. Unsupervised Learning is a discovery pattern Given data input only without any label.

According to Wikipedia :

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human-labeled data, unsupervised learning, also known as self-organization allows for modeling of probability densities over inputs. Architecture or framework of unsupervised learning is provided on Figure below :

There are three case in Unsupervised Learning

Clustering, Dimensionality Reduction, and Association Rule

Clustering : grouping data based on similarity patterns

There are methods or algorithms that can be used in case clustering : K-Means Clustering, Affinity Propagation, Mean Shift, Spectral Clustering, Hierarchical Clustering, DBSCAN, ect.

In this section, only explain the intuition of Clustering in Unsupervised Learning

Clustering : Intuition

Clustering a data based on similarity patterns into 1 groups

Clustering a data based on similarity patterns into 2 groups

Clustering a data based on similarity patterns into 3 groups

Clustering a data based on similarity patterns into 4 groups

How do we know a point has the same group as another point?

As mentioned above : based on similarity patterns

How to measure the similarity of a point to another point?

The answer is : based on distance

How to measure the distance of a point to another point?

There are several ways to measure distance

  • Euclidean Distance
  • Manhattan Distance
  • Minkowski Distance
  • Hamming Distance

Euclidean Distance

Euclidean Distance represents the shortest distance between two points.

source : Role of Distance Metrics in Machine Learning

Mathematically, we can write this formula as

Example case :

In this case, the Euclidean Distance between the points is 6.3

Manhattan Distance

Manhattan Distance is the sum of absolute differences between points across all the dimensions.

Mathematically, we can write this formula as

Example case :

In this case, the Manhattan Distance between the points is 8

Minkowski Distance

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.

Mathematically, we can write this formula as

Minkowski distance can work like Manhattan or Euclidean distance. The selected P value will determine how the Minkowski distance works

  • q = 1: Manhattan distance
  • q = 2: Euclidean distance

Hamming Distance

Hamming Distance measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different

Mathematically, we can write this formula as

Example case :

Continue Learning

  1. how to implement Unsupervised Learning with K-means Clustering?
  2. how to implement Unsupervised Learning with DBSCAN Clustering?
  3. how to implement Unsupervised Learning with Gaussian Mixture Models Clustering?
  4. Dimensionality Reduction

About Me

I’m a Data Scientist, Focus on Machine Learning and Deep Learning. You can reach me from Medium and Linkedin

My Website : https://komuternak.com/

Reference

  1. Machine Learning Explanation : Supervised Learning & Unsupervised Learning
  2. Role of Distance Metrics in Machine Learning

--

--