On this page

09 Clustering

Clustering

Clustering is used to find groups in the data that are similar to each other. The goal is to group data points together based on their similarity without being given how many groups there should be.

Variable Naming:

$m$: Number of training examples
$x$: Data point
$x^{(i)}$: Data point of the $i$-th training example (one-based index)
$K$: Number of clusters
$\mu_k$: Centroid of the $k$-th cluster
$c^{(i)}$: index of cluster centroid closest to $x^{(i)}$
$\mu_c^{(i)}$: Centroid of the cluster assigned to $x^{(i)}$

K-Means Clustering

K-Means clustering is a popular clustering algorithm that groups data points into K clusters. The algorithm works as follows:

Initialize K centroids randomly.
Assign each data point to the nearest centroid.
Update the centroids by calculating the mean of all data points assigned to that centroid.
Repeat steps 2 and 3 until the centroids do not change significantly.

info

If a cluster ever has no values assigned to it, the most common thing to do is to remove the cluster. Alternatively the cluster could be randomly re-initialized.

Cost Function

The cost function for K-Means clustering is the sum of the squared Euclidean distances between each data point and its assigned centroid:

$$ J(c^{(1)}, \ldots, c^{(m)}, \mu_1, \ldots, \mu_K) = \frac{1}{m} \sum_{i=1}^{m} ||x^{(i)} - \mu_{c^{(i)}}||^2 $$

The cost function is also called the distortion function.

Number of Clusters

Choosing the right number of clusters is crucial for the performance of the algorithm.

One way to determine the number of clusters is the Elbow Method. The Elbow Method plots the cost function against the number of clusters. The point where the cost function starts to decrease more slowly is the optimal number of clusters.

report

Choosing the number of clusters to reduce the cost function does not work as the algorithm with the most clusters usually has the lowest cost.

Often, the number of clusters is chosen based on the required downstream task. This should be the most influential factor in determining the number of clusters!

Anomaly Detection

Anomaly detection is the identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.

Gaussian Distribution

Anomaly detection is often done using the Gaussian distribution (or normal distribution). The Gaussian distribution is defined by the mean $\mu$ and the variance $\sigma^2$. The distribution will be bell-shaped and symmetric around the mean.

The probability of $x$ given the Gaussian distribution is:

$$ p(x) = \frac{1}{\sqrt{2\pi\sigma}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

Parameter Estimation

The parameters $\mu$ and $\sigma^2$ can be estimated from the data.

$\mu$ can be calulated as the mean of the data: $\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$ $\sigma^2$ can be calculated as the variance of the data: $\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu)^2$

Anomaly Detection Algorithm

Choose $n$ features $x_i$ that might be indicative of anomalous examples.
Fit parameters $\mu_1, \ldots, \mu_n, \sigma_1^2, \ldots, \sigma_n^2$.
Calculate the probability $p(x)$ for new example $x$. If $p(x) < \epsilon$, flag this example as an anomaly.

$p(x^{(i)})$ is calculated from the probabilities of each feature in the sample.

$$ p(x) = p(x_1; \mu_1, \sigma_1^2) \cdot p(x_2; \mu_2, \sigma_2^2) \cdot \ldots \cdot p(x_n; \mu_n, \sigma_n^2) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2) $$

08 Decision Trees

10 Recommender Systems

09 Clustering

Clustering link

K-Means Clustering link

Cost Function link

Number of Clusters link

Anomaly Detection link

Gaussian Distribution link

Parameter Estimation link

Anomaly Detection Algorithm link