SpectralClustering#
- class sklearn.cluster.SpectralClustering(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol='auto', assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)[source]#
- Apply clustering to a projection of the normalized Laplacian. - In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex, or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster, such as when clusters are nested circles on the 2D plane. - If the affinity matrix is the adjacency matrix of a graph, this method can be used to find normalized graph cuts [1], [2]. - When calling - fit, an affinity matrix is constructed using either a kernel function such the Gaussian (aka RBF) kernel with Euclidean distance- d(X, X):- np.exp(-gamma * d(X,X) ** 2) - or a k-nearest neighbors connectivity matrix. - Alternatively, a user-provided affinity matrix can be specified by setting - affinity='precomputed'.- Read more in the User Guide. - Parameters:
- n_clustersint, default=8
- The dimension of the projection subspace. 
- eigen_solver{‘arpack’, ‘lobpcg’, ‘amg’}, default=None
- The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities. If None, then - 'arpack'is used. See [4] for more details regarding- 'lobpcg'.
- n_componentsint, default=None
- Number of eigenvectors to use for the spectral embedding. If None, defaults to - n_clusters.
- random_stateint, RandomState instance, default=None
- A pseudo random number generator used for the initialization of the lobpcg eigenvectors decomposition when - eigen_solver == 'amg', and for the K-Means initialization. Use an int to make the results deterministic across calls (See Glossary).- Note - When using - eigen_solver == 'amg', it is necessary to also fix the global numpy seed with- np.random.seed(int)to get deterministic results. See pyamg/pyamg#139 for further information.
- n_initint, default=10
- Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if - assign_labels='kmeans'.
- gammafloat, default=1.0
- Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for - affinity='nearest_neighbors',- affinity='precomputed'or- affinity='precomputed_nearest_neighbors'.
- affinitystr or callable, default=’rbf’
- How to construct the affinity matrix.
- ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors. 
- ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel. 
- ‘precomputed’: interpret - Xas a precomputed affinity matrix, where larger values indicate greater similarity between instances.
- ‘precomputed_nearest_neighbors’: interpret - Xas a sparse graph of precomputed distances, and construct a binary affinity matrix from the- n_neighborsnearest neighbors of each instance.
- one of the kernels supported by - pairwise_kernels.
 
 - Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm. 
- n_neighborsint, default=10
- Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for - affinity='rbf'.
- eigen_tolfloat, default=”auto”
- Stopping criterion for eigen decomposition of the Laplacian matrix. If - eigen_tol="auto"then the passed tolerance will depend on the- eigen_solver:- If - eigen_solver="arpack", then- eigen_tol=0.0;
- If - eigen_solver="lobpcg"or- eigen_solver="amg", then- eigen_tol=Nonewhich configures the underlying- lobpcgsolver to automatically resolve the value according to their heuristics. See,- scipy.sparse.linalg.lobpcgfor details.
 - Note that when using - eigen_solver="lobpcg"or- eigen_solver="amg"values of- tol<1e-5may lead to convergence issues and should be avoided.- Added in version 1.2: Added ‘auto’ option. 
- assign_labels{‘kmeans’, ‘discretize’, ‘cluster_qr’}, default=’kmeans’
- The strategy for assigning labels in the embedding space. There are two ways to assign labels after the Laplacian embedding. k-means is a popular choice, but it can be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization [3]. The cluster_qr method [5] directly extract clusters from eigenvectors in spectral clustering. In contrast to k-means and discretization, cluster_qr has no tuning parameters and runs no iterations, yet may outperform k-means and discretization in terms of both quality and speed. - Changed in version 1.1: Added new labeling method ‘cluster_qr’. 
- degreefloat, default=3
- Degree of the polynomial kernel. Ignored by other kernels. 
- coef0float, default=1
- Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels. 
- kernel_paramsdict of str to any, default=None
- Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. 
- n_jobsint, default=None
- The number of parallel jobs to run when - affinity='nearest_neighbors'or- affinity='precomputed_nearest_neighbors'. The neighbors search will be done in parallel.- Nonemeans 1 unless in a- joblib.parallel_backendcontext.- -1means using all processors. See Glossary for more details.
- verbosebool, default=False
- Verbosity mode. - Added in version 0.24. 
 
- Attributes:
- affinity_matrix_array-like of shape (n_samples, n_samples)
- Affinity matrix used for clustering. Available only after calling - fit.
- labels_ndarray of shape (n_samples,)
- Labels of each point 
- n_features_in_int
- Number of features seen during fit. - Added in version 0.24. 
- feature_names_in_ndarray of shape (n_features_in_,)
- Names of features seen during fit. Defined only when - Xhas feature names that are all strings.- Added in version 1.0. 
 
 - See also - sklearn.cluster.KMeans
- K-Means clustering. 
- sklearn.cluster.DBSCAN
- Density-Based Spatial Clustering of Applications with Noise. 
 - Notes - A distance matrix for which 0 indicates identical elements and high values indicate very dissimilar elements can be transformed into an affinity / similarity matrix that is well-suited for the algorithm by applying the Gaussian (aka RBF, heat) kernel: - np.exp(- dist_matrix ** 2 / (2. * delta ** 2)) - where - deltais a free parameter representing the width of the Gaussian kernel.- An alternative is to take a symmetric version of the k-nearest neighbors connectivity matrix of the points. - If the pyamg package is installed, it is used: this greatly speeds up computation. - References [4]- Examples - >>> from sklearn.cluster import SpectralClustering >>> import numpy as np >>> X = np.array([[1, 1], [2, 1], [1, 0], ... [4, 7], [3, 5], [3, 6]]) >>> clustering = SpectralClustering(n_clusters=2, ... assign_labels='discretize', ... random_state=0).fit(X) >>> clustering.labels_ array([1, 1, 1, 0, 0, 0]) >>> clustering SpectralClustering(assign_labels='discretize', n_clusters=2, random_state=0) - For a comparison of Spectral clustering with other clustering algorithms, see Comparing different clustering algorithms on toy datasets - fit(X, y=None)[source]#
- Perform spectral clustering from features, or affinity matrix. - Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
- Training instances to cluster, similarities / affinities between instances if - affinity='precomputed', or distances between instances if- affinity='precomputed_nearest_neighbors. If a sparse matrix is provided in a format other than- csr_matrix,- csc_matrix, or- coo_matrix, it will be converted into a sparse- csr_matrix.
- yIgnored
- Not used, present here for API consistency by convention. 
 
- Returns:
- selfobject
- A fitted instance of the estimator. 
 
 
 - fit_predict(X, y=None)[source]#
- Perform spectral clustering on - Xand return cluster labels.- Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
- Training instances to cluster, similarities / affinities between instances if - affinity='precomputed', or distances between instances if- affinity='precomputed_nearest_neighbors. If a sparse matrix is provided in a format other than- csr_matrix,- csc_matrix, or- coo_matrix, it will be converted into a sparse- csr_matrix.
- yIgnored
- Not used, present here for API consistency by convention. 
 
- Returns:
- labelsndarray of shape (n_samples,)
- Cluster labels. 
 
 
 - get_metadata_routing()[source]#
- Get metadata routing of this object. - Please check User Guide on how the routing mechanism works. - Returns:
- routingMetadataRequest
- A - MetadataRequestencapsulating routing information.
 
 
 - get_params(deep=True)[source]#
- Get parameters for this estimator. - Parameters:
- deepbool, default=True
- If True, will return the parameters for this estimator and contained subobjects that are estimators. 
 
- Returns:
- paramsdict
- Parameter names mapped to their values. 
 
 
 - set_params(**params)[source]#
- Set the parameters of this estimator. - The method works on simple estimators as well as on nested objects (such as - Pipeline). The latter have parameters of the form- <component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
- Estimator parameters. 
 
- Returns:
- selfestimator instance
- Estimator instance. 
 
 
 
Gallery examples#
 
Comparing different clustering algorithms on toy datasets
