Chapter 17 - Hierarchical clustering

In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern.

Hierarchical agglomerative clustering

hierarchical agglomerative clustering (HAC):

a Bottom up algorithm bottom up algorithms: treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents.
visualized as a dendrogram
- The y-coordinate of the horizontal line is the similarity of the two clusters that were merged → combination similarity (above dendrogram is rotated)
fundamental assumption in HAC: the merge operation is monotonic
- monotonic: if s1, s2, . . . , sK−1 are the combination similarities of the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK−1 holds

Cutting point criteria if we want a partition of disjoint clusters:

Cut at a prespecified level of similarity
Cut the dendrogram where the gap between two successive combination similarities is largest.
Apply following equation (from previous chapter):
Cut when reach a prespecified number of clusters K

A simple, naive HAC algorithm:

compute the N × N similarity matrix C
executes N − 1 steps of merging the currently most similar clusters:
1. merge the two most similar clusters
2. update rows and columns of the merged cluster i in C

Single-link and complete-link clustering

base algorithm is the same as naive HAC

single-link / single-linkage clustering:

the similarity of two clusters is the similarity of their most similar members
merge criterion is local - pay attention solely to the area where the two clusters come closest to each other

complete-link / complete-linkage clustering:

the similarity of two clusters is the similarity of their most dissimilar members - choosing the cluster pair whose merge has the smallest diameter
merge criterion is non-local ****- the entire structure of the clustering can influence merge decisions

These two algorithms reduce the assessment of cluster quality to a single similarity between a pair of documents and therefore both algorithms often produce undesirable clusters.

for example:

single link problem: a chain of points can be extended for long distances without regard to the overall shape of the emerging cluster → chaining
complete-link problem: it pays too much attention to outliers, points that do not fit well into the global structure of the cluster.

Time complexity of HAC

naive: $\Theta(N^3)$

using priority queue: $\Theta(N^2logN)$

single-link: $\Theta(N^2)$

Q: Why we can improve single-link and not others?

A: single-link is best-merge persistent - i.e. Suppose that the best merge cluster for ωk is ωj in single-link clustering. Then after merging ωj with a third cluster ωi != ωk, the merge of ωi and ωj will be ωk’s best merge cluster.

We can store an array named NBM (next best merge) for each cluster, and we only need to update this array for new merged cluster and there is no need to update value for others, so the update is O(N) and there is no need for a priority queue. But we cannot this for other similarity functions.

Group-average agglomerative clustering (GAAC)

aka: group-average clustering, average-link clustering

Evaluates cluster quality based on all similarities between documents

$\vec{d}$ is the length-normalized vector of document d.

Since sum of individual vector similarities is equal to the similarities of their sums, we have:

(Only true if using dot product!)

Centroid clustering

The similarity of two clusters is defined as the similarity of their centroids

centroid clustering is not monotonic - inversions can occur. i.e. Similarity can increase during clustering (we define similarity as negative distance)
other three clustering algorithms are monotonic. In a monotonic HAC algorithm, similarity is monotonically decreasing from iteration to iteration.

Optimality of HAC

combination similarity: COMB-SIM of a clustering W = {ω1, . . . ,ωK} as the smallest combination similarity of any of its K clusters.

We define $\Omega$ to be optimal if all clusterings $\Omega'$ with k clusters, k ≤ K, have lower combination similarities.

Centroid clustering is not optimal because of inversions
Complete-link clustering and GAAC are not optimal, example:
Single link is optimal (prove by induction page 431)

GAAC is best choice for most applications

two exceptions:

non vector representation
the purpose of clustering is not to create a complete hierarchy or exhaustive partition of the entire document set
- first story detection or novelty detection: the task of detecting the first occurrence of an event in a stream of news stories.

Divisive clustering

aka: top-down clustering

We start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm. This procedure is applied recursively until each document is in its own singleton cluster.

in comparison to bottom-up:

more complex
more efficient if we do not generate a complete hierarchy
more accurate hierarchies in some circumstances

→ algorithm: Bisecting K-means

Cluster labeling

Differential cluster labeling: (Discriminative labeling)

Selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.

feature selection methods like MI and χ2 can be used
a differential test + a penalty for rare terms → often gives the best labeling results

Cluster-internal labeling: (Non-discriminative labeling)

Computes a label that solely depends on the cluster itself, not on other clusters.

e.g. the title of the document closest to the centroid
e.g. use a list of terms with high weights in the centroid of the cluster
efficient, but they fail to distinguish terms that are frequent in the collection as a whole from those that are frequent only in the cluster.

Implementation notes

Computational savings due to the inverted index are large if there are many zero similarities
for dense centroids:
- GAAN: $\Theta(MN^2logN)$ (M is the size of the vocabulary)
- complete link: $\Theta(M_{ave}N^2logN)$ (M_ave is the average size of the vocabulary of a document.)
- so complete link is more efficient
to get read of dense centroids: (covered in last chapter)
- truncating
- medoids
for large sets use HAC with flat clustering:
- e.g. employ an HAC algorithm to compute seeds of high quality for K-means: Apply HAC to sqrt(N) subset of documents
→ Buckshot algorithm: combines the determinism and higher reliability of HAC with the efficiency of K-means.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 17 - Hierarchical clustering a8a83a4ccea94da6bccab71093bc7075.md

Chapter 17 - Hierarchical clustering a8a83a4ccea94da6bccab71093bc7075.md

Chapter 17 - Hierarchical clustering

Hierarchical agglomerative clustering

Single-link and complete-link clustering

Time complexity of HAC

Group-average agglomerative clustering (GAAC)

Centroid clustering

Optimality of HAC

Divisive clustering

Cluster labeling

Implementation notes

Files

Chapter 17 - Hierarchical clustering a8a83a4ccea94da6bccab71093bc7075.md

Latest commit

History

Chapter 17 - Hierarchical clustering a8a83a4ccea94da6bccab71093bc7075.md

File metadata and controls

Chapter 17 - Hierarchical clustering

Hierarchical agglomerative clustering

Single-link and complete-link clustering

Time complexity of HAC

Group-average agglomerative clustering (GAAC)

Centroid clustering

Optimality of HAC

Divisive clustering

Cluster labeling

Implementation notes