MiniBatchKMeans#

class sklearn.cluster.MiniBatchKMeans(n_clusters=8, *, init='k-means++', max_iter=100, batch_size=1024, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init='auto', reassignment_ratio=0.01)[source]#

Mini-Batch K-Means clustering.

See also

KMeans: The classic implementation of the clustering method based on the Lloyd’s algorithm. It consumes the whole set of input data at each iteration.

Notes

See https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

When there are too few points in the dataset, some centers may be duplicated, which means that a proper clustering in terms of the number of requesting clusters and the number of returned clusters will not always match. One solution is to set reassignment_ratio=0, which prevents reassignments of clusters that are too small.

Examples

&gt;&gt;&gt; from sklearn.cluster import MiniBatchKMeans
&gt;&gt;&gt; import numpy as np
&gt;&gt;&gt; X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 0], [4, 4],
...               [4, 5], [0, 1], [2, 2],
...               [3, 2], [5, 5], [1, -1]])
&gt;&gt;&gt; # manually fit on batches
&gt;&gt;&gt; kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          n_init="auto")
&gt;&gt;&gt; kmeans = kmeans.partial_fit(X[0:6,:])
&gt;&gt;&gt; kmeans = kmeans.partial_fit(X[6:12,:])
&gt;&gt;&gt; kmeans.cluster_centers_
array([[3.375, 3.  ],
       [0.75 , 0.5 ]])
&gt;&gt;&gt; kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)
&gt;&gt;&gt; # fit on the whole data
&gt;&gt;&gt; kmeans = MiniBatchKMeans(n_clusters=2,
...                          random_state=0,
...                          batch_size=6,
...                          max_iter=10,
...                          n_init="auto").fit(X)
&gt;&gt;&gt; kmeans.cluster_centers_
array([[3.55102041, 2.48979592],
       [1.06896552, 1.        ]])
&gt;&gt;&gt; kmeans.predict([[0, 0], [4, 4]])
array([1, 0], dtype=int32)

fit(X, y=None, sample_weight=None)[source]#

Compute the centroids on X by chunking it into mini-batches.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. sample_weight is not used during initialization if init is a callable or a user provided array.

Added in version 0.20.

Returns:

selfobject: Fitted estimator.

fit_predict(X, y=None, sample_weight=None)[source]#

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

labelsndarray of shape (n_samples,): Index of the cluster each sample belongs to.

fit_transform(X, y=None, sample_weight=None)[source]#

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

X_newndarray of shape (n_samples, n_clusters): X transformed in the new space.

get_feature_names_out(input_features=None)[source]#

get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: ["class_name0", "class_name1", "class_name2"].

Parameters:

input_featuresarray-like of str or None, default=None: Only used to validate feature names with the names seen in fit.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_metadata_routing()[source]#

get metadata routing of this object.

Please check User guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

partial_fit(X, y=None, sample_weight=None)[source]#

Update k means estimate on a single mini-batch X.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight. sample_weight is not used during initialization if init is a callable or a user provided array.

Returns:

selfobject: Return updated estimator.

predict(X)[source]#

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to predict.

Returns:

labelsndarray of shape (n_samples,): Index of the cluster each sample belongs to.

score(X, y=None, sample_weight=None)[source]#

Opposite of the value of X on the K-means objective.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

Returns:

scorefloat: Opposite of the value of X on the K-means objective.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANgED$') → MiniBatchKMeans[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config). Please see User guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANgED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANgED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged

Added in version 1.4: "polars" option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANgED$') → MiniBatchKMeans[source]#

Request metadata passed to the partial_fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config). Please see User guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to partial_fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANgED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANgED: Metadata routing for sample_weight parameter in partial_fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANgED$') → MiniBatchKMeans[source]#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config). Please see User guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANgED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANgED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

transform(X)[source]#

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): New data to transform.

Returns:

X_newndarray of shape (n_samples, n_clusters): X transformed in the new space.

gallery examples#

Biclustering documents with the Spectral Co-clustering algorithm

Compare BIRCH and MiniBatchKMeans

Comparing different clustering algorithms on toy datasets

Comparison of the K-Means and MiniBatchKMeans clustering algorithms

Empirical evaluation of the impact of k-means initialization

Online learning of a dictionary of parts of faces

Faces dataset decompositions

Clustering text documents using k-means

Online learning of a dictionary of parts of faces

Faces dataset decompositions