Exploration plots¶

PLSCAN exposes several attributes for inspecting and visualizing its cluster hierarchy.

[9]:

import numpy as np
import matplotlib.pyplot as plt
from fast_plscan import PLSCAN

plt.rcParams["figure.dpi"] = 150
plt.rcParams["figure.figsize"] = (2.75, 0.618 * 2.75)

data = np.load("data/clusterable/sources/clusterable_data.npy")

Condensed tree¶

Like HDBSCAN*, PLSCAN provides a condensed tree that summarizes how clusters merge as mutual reachability distance increases. Each branch represents a cluster lineage, and branch joins indicate merge events.

In PLSCAN visualizations, the tree is shown using distance-based quantities, which makes distance cuts directly interpretable.

[10]:

c = PLSCAN().fit(data)

c.condensed_tree_.plot(select_clusters=True)
plt.show()

The distance_cut() method can be used to extract clusters at a given distance:

[11]:

labels, probs = c.distance_cut(0.015)
plt.scatter(*data.T, c=labels % 10, s=1, alpha=0.4, linewidth=0, cmap="tab10")
plt.axis("off")
plt.show()

Leaf tree¶

PLSCAN does not select final clusters directly from the condensed tree. Instead, it builds a leaf-tree hierarchy that tracks which clusters are leaves (clusters with no child clusters) at each minimum cluster size threshold. The leaf_tree_ attribute stores this hierarchy:

[12]:

c = PLSCAN().fit(data)

c.leaf_tree_.plot(select_clusters=True, leaf_separation=0.1)
plt.show()

The min_cluster_size_cut() method can be used to extract clusters at a given minimum cluster size:

[13]:

labels, probs = c.min_cluster_size_cut(300)
plt.scatter(*data.T, c=labels % 10, s=1, alpha=0.4, linewidth=0, cmap="tab10")
plt.axis("off")
plt.show()

Persistence trace¶

PLSCAN selects clusters from the leaf tree by computing each leaf cluster’s persistence and finding the minimum cluster size threshold with the highest total persistence. The persistence_trace_ attribute visualizes this minimum-cluster-size vs persistence curve.

[14]:

c.persistence_trace_.plot()
plt.show()

_images/using_exploration_plots_11_0.png

By default, PLSCAN selects the minimum cluster size threshold with the highest persistence value. Other robust alternatives can be explored with the cluster_layers method:

[15]:

layers = c.cluster_layers(max_peaks=4)

for i, (size, labels, probs) in enumerate(layers):
    plt.subplot(2, 2, i + 1)
    plt.scatter(*data.T, c=labels % 10, s=1, alpha=0.4, linewidth=0, cmap="tab10")
    plt.title(f"mcs={int(size)}")  # mcs = min cluster size
    plt.axis("off")
plt.subplots_adjust(left=0, right=1, top=1, bottom=0)
plt.show()

_images/using_exploration_plots_13_0.png

These interesting clusterings relate to peaks in the minimum_cluster_size–persistence curve:

[16]:

c.persistence_trace_.plot()
plt.vlines(list(zip(*layers))[0], *plt.ylim(), color="k", linewidth=0.5, zorder=1)
plt.xlim([0, 300])
plt.show()

_images/using_exploration_plots_15_0.png