How can the labels of AgglomerativeClustering be re-computed?

I am using scikit-learn's AgglomerativeClustering on a large data set.

I would like to modify the distance_threshold after the model has already been computed. Computing the model is slow (quadratic time), but it should easily be possible to re-compute the labels for a new distance_threshold in linear time because the model stores the children_ and distances_ arrays permanently. But how can the labels be re-computed for a different distance_threshold?

It can be assumed that distance_threshold was originally set to 0, i.e. the entire tree was computed.

Topic scikit-learn clustering

Category Data Science


You may calculate the new label using children_ and distances_ recursively and following the below definition from scikit-learn document.

children_array-like of shape (n_samples-1, 2)
The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + I
distances_array-like of shape (n_nodes-1,)
Distances between nodes in the corresponding place in children_. Only computed if distance_threshold is used or compute_distances is set to True.

from sklearn.cluster import AgglomerativeClustering
from sklearn import datasets
iris = datasets.load_iris()
x, y = iris.data, iris.target

ac = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='complete', compute_full_tree=True, distance_threshold=0, )
labels = ac.fit_predict(x)

children = ac.children_
distances = ac.distances_
thres = 1.5
n_samples = 150

def get_branch(label):  # Return the "children" Index based on Node
    for idx,child in enumerate(children):
        if label in child:
            return idx

def new_label(label):

    branch = get_branch(label)
    distance = distances[branch]
    while distance < thres:
        parent = branch + n_samples   # As per doc, It's the Parent's Node
        branch = get_branch(parent)   # Get the Index of Parent
        distance = distances[branch]  # Get the Distance of the elements in the Index 

    return label if not parent else parent
    
new_label(100)

One option to speedup computation with different thresholds is caching the results using the memory option.

Something like this:

from sklearn.cluster import AgglomerativeClustering

clustering = AgglomerativeClustering(compute_full_tree=True,
                                     distance_threshold=1.0,
                                     memory='mycachedir', 
                                     n_clusters=None,)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.