PLSCAN

class fast_plscan.PLSCAN(*, min_samples=5, space_tree='auto', metric='euclidean', metric_kws=None, min_cluster_size=None, max_cluster_size=inf, persistence_measure='size', num_threads=None)

PLSCAN computes HDBSCAN* leaf-clusters with an optimal minimum cluster size.

The algorithm builds a hierarchy of leaf-clusters, showing which HDBSCAN* [1] clusters are leaves as the minimum cluster size varies (filtration). Then, it computes the total leaf-cluster persistence per minimum cluster size, and picks the minimum cluster size that maximizes that score.

The leaf-cluster hierarchy in leaf_tree_ can be plotted as an alternative to HDBSCAN*’s condensed cluster tree.

Cluster segmentations for other high-persistence minimum cluster sizes can be computed using the cluster_layers method. This method finds the persistence peaks and returns their cluster labels and memberships.

References

Parameters:
  • min_samples (int, default: 5) – The number of neighbors to use for computing core distances and the mutual reachability distances. Higher values produce smoother density profiles with fewer peaks. Minimum spanning tree inputs are assumed to contain mutual reachability distances and ignore this parameter.

  • space_tree (str, default: 'auto') – The type of tree to use for the search. Options are “auto”, “kd_tree” and “ball_tree”. If “auto”, a “kd_tree” is used if that supports the selected metric. Space trees are not used when metric is “precomputed”.

  • metric (str, default: 'euclidean') – The distance metric to use. See PLSCAN.VALID_KDTREE_METRICS and PLSCAN.VALID_BALLTREE_METRICS for available options. Use “precomputed” if the input to .fit() contains distances. See sklearn documentation for metric definitions.

  • metric_kws (dict[str, Any] | None, default: None) – Additional keyword arguments for the distance metric. For example, p for the Minkowski distance.

  • min_cluster_size (float | None, default: None) – The minimum size limit for clusters, defaults to the value of min_samples. Values below min_samples are not allowed, as the leaf-clusters produced by those values can be incomplete and arbitrary.

  • max_cluster_size (float, default: inf) – The maximum size limit for clusters, by default np.inf.

  • persistence_measure (str, default: 'size') – Selects a persistence measure. Valid options are “size”, “distance”, “density”, “size-distance”, and “size-density”. The “size”, “distance”, and “density” options compute persistence as the range of size/distance/density values for which clusters are leaves. The “size-distance” and “size-density” options compute bi-persistence as the distance/density – minimum cluster size areas for which clusters are leaves. Density is computed as exp(-dist).

  • num_threads (int | None, default: None) – The number of threads to use for parallel computations, value must be positive. If None, OpenMP’s default maximum thread count is used.

Attributes

VALID_BALLTREE_METRICS

The distance metrics implemented for use with BallTrees.

VALID_KDTREE_METRICS

The distance metrics implemented for use with KDTrees.

condensed_tree_

The HDBSCAN* condensed cluster tree.

core_distances_

Core distance for each point, shape (n_samples,).

labels_

Cluster label for each point, shape (n_samples,).

leaf_tree_

The leaf-cluster hierarchy across all minimum cluster sizes.

minimum_spanning_tree_

The mutual reachability minimum spanning tree (or forest).

persistence_trace_

The total persistence signal over all minimum cluster sizes.

probabilities_

Cluster membership probability for each point, shape (n_samples,).

selected_clusters_

Leaf-tree node indices of the selected clusters, shape (n_clusters,).

single_linkage_tree_

The full single-linkage dendrogram in scipy linkage format.

Methods

cluster_layers([max_peaks, min_size, ...])

Return cluster labels and probabilities at each persistence peak.

compute_centroids([labels])

Return the probability-weighted centroid of each cluster.

compute_exemplar_indices([labels])

Return the exemplar point indices for each cluster.

compute_medoid_indices([labels])

Return the index of the medoid point for each cluster.

distance_cut(epsilon)

Return a DBSCAN*-style clustering at a fixed distance threshold.

fit(X[, y, sample_weights])

Computes PLSCAN clusters and hierarchies for the input data.

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

min_cluster_size_cut(cut_size)

Return the clustering produced by a specific minimum cluster size.

set_fit_request(*[, sample_weights])

Configure whether metadata should be requested to be passed to the fit method.

set_params(**params)

Set the parameters of this estimator.