Version 1.3#
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 1.3.
Legend for changelogs
Major Feature something big that you couldn’t do before.
Feature something that you couldn’t do before.
Efficiency an existing feature now may not require as much computation or memory.
Enhancement a miscellaneous minor improvement.
Fix something that previously didn’t work as documented – or according to reasonable expectations – should now work.
API Change you will need to change your code to have the same effect in the future; or a feature will be removed in the future.
Version 1.3.2#
October 2023
Changelog#
sklearn.datasets#
Fix All dataset fetchers now accept
data_homeas any object that implements theos.PathLikeinterface, for instance,pathlib.Path. #27468 by Yao Xiao.
sklearn.decomposition#
Fix Fixes a bug in
decomposition.KernelPCAby forcing the output of the internalpreprocessing.KernelCentererto be a default array. When the arpack solver is used, it expects an array with adtypeattribute. #27583 by Guillaume Lemaitre.
sklearn.metrics#
Fix Fixes a bug for metrics using
zero_division=np.nan(e.g.precision_score) within a parallel loop (e.g.cross_val_score) where the singleton fornp.nanwill be different in the sub-processes. #27573 by Guillaume Lemaitre.
sklearn.tree#
Fix Do not leak data via non-initialized memory in decision tree pickle files and make the generation of those files deterministic. #27580 by Loïc Estève.
Version 1.3.1#
September 2023
Changed models#
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
Fix Ridge models with
solver='sparse_cg'may have slightly different results with scipy>=1.12, because of an underlying change in the scipy solver (see scipy#18488 for more details) #26814 by Loïc Estève
Changes impacting all modules#
Fix The
set_outputAPI correctly works with list input. #27044 by Thomas Fan.
Changelog#
sklearn.calibration#
Fix
calibration.CalibratedClassifierCVcan now handle models that produce large prediction scores. Before it was numerically unstable. #26913 by Omar Salman.
sklearn.cluster#
Fix
cluster.BisectingKMeanscould crash when predicting on data with a different scale than the data used to fit the model. #27167 by Olivier Grisel.Fix
cluster.BisectingKMeansnow works with data that has a single feature. #27243 by Jérémie du Boisberranger.
sklearn.cross_decomposition#
Fix
cross_decomposition.PLSRegressionnow automatically ravels the output ofpredictif fitted with one dimensionaly. #26602 by Yao Xiao.
sklearn.ensemble#
Fix Fix a bug in
ensemble.AdaBoostClassifierwithalgorithm="SAMME"where the decision function of each weak learner should be symmetric (i.e. the sum of the scores should sum to zero for a sample). #26521 by Guillaume Lemaitre.
sklearn.feature_selection#
Fix
feature_selection.mutual_info_regressionnow correctly computes the result whenXis of integer dtype. #26748 by Yao Xiao.
sklearn.impute#
Fix
impute.KNNImputernow correctly adds a missing indicator column intransformwhenadd_indicatoris set toTrueand missing values are observed duringfit. #26600 by Shreesha Kumar Bhat.
sklearn.metrics#
Fix Scorers used with
metrics.get_scorerhandle properly multilabel-indicator matrix. #27002 by Guillaume Lemaitre.
sklearn.mixture#
Fix The initialization of
mixture.GaussianMixturefrom user-providedprecisions_initforcovariance_typeoffullortiedwas not correct, and has been fixed. #26416 by Yang Tao.
sklearn.neighbors#
Fix
neighbors.KNeighborsClassifier.predictno longer raises an exception forpandas.DataFramesinput. #26772 by Jérémie du Boisberranger.Fix Reintroduce
sklearn.neighbors.BallTree.valid_metricsandsklearn.neighbors.KDTree.valid_metricsas public class attributes. #26754 by Julien Jerphanion.Fix
sklearn.model_selection.HalvingRandomSearchCVno longer raises when the input to theparam_distributionsparameter is a list of dicts. #26893 by Stefanie Senger.Fix Neighbors based estimators now correctly work when
metric="minkowski"and the metric parameterpis in the range0 < p < 1, regardless of thedtypeofX. #26760 by Shreesha Kumar Bhat.
sklearn.preprocessing#
Fix
preprocessing.LabelEncodercorrectly acceptsyas a keyword argument. #26940 by Thomas Fan.Fix
preprocessing.OneHotEncodershows a more informative error message whensparse_output=Trueand the output is configured to be pandas. #26931 by Thomas Fan.
sklearn.tree#
Fix
tree.plot_treenow acceptsclass_names=Trueas documented. #26903 by Thomas RoehrFix The
feature_namesparameter oftree.plot_treenow accepts any kind of array-like instead of just a list. #27292 by Rahil Parikh.
Version 1.3.0#
June 2023
Changed models#
The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.
Enhancement
multiclass.OutputCodeClassifier.predictnow uses a more efficient pairwise distance reduction. As a consequence, the tie-breaking strategy is different and thus the predicted labels may be different. #25196 by Guillaume Lemaitre.Enhancement The
fit_transformmethod ofdecomposition.DictionaryLearningis more efficient but may produce different results as in previous versions whentransform_algorithmis not the same asfit_algorithmand the number of iterations is small. #24871 by Omar Salman.Enhancement The
sample_weightparameter now will be used in centroids initialization forcluster.KMeans,cluster.BisectingKMeansandcluster.MiniBatchKMeans. This change will break backward compatibility, since numbers generated from same random seeds will be different. #25752 by Hleb Levitski, Jérémie du Boisberranger, Guillaume Lemaitre.Fix Treat more consistently small values in the
WandHmatrices during thefitandtransformsteps ofdecomposition.NMFanddecomposition.MiniBatchNMFwhich can produce different results than previous versions. #25438 by Yotam Avidar-Constantini.Fix
decomposition.KernelPCAmay produce different results throughinverse_transformifgammaisNone. Now it will be chosen correctly as1/n_featuresof the data that it is fitted on, while previously it might be incorrectly chosen as1/n_featuresof the data passed toinverse_transform. A new attributegamma_is provided for revealing the actual value ofgammaused each time the kernel is called. #26337 by Yao Xiao.
Changed displays#
Enhancement
model_selection.LearningCurveDisplaydisplays both the train and test curves by default. You can setscore_type="test"to keep the past behaviour. #25120 by Guillaume Lemaitre.Fix
model_selection.ValidationCurveDisplaynow accepts passing a list to theparam_rangeparameter. #27311 by Arturo Amor.
Changes impacting all modules#
Enhancement The
get_feature_names_outmethod of the following classes now raises aNotFittedErrorif the instance is not fitted. This ensures the error is consistent in all estimators with theget_feature_names_outmethod.The
NotFittedErrordisplays an informative message asking to fit the instance with the appropriate arguments.#25294, #25308, #25291, #25367, #25402, by John Pangas, Rahil Parikh , and Alex Buzenet.
Enhancement Added a multi-threaded Cython routine to the compute squared Euclidean distances (sometimes followed by a fused reduction operation) for a pair of datasets consisting of a sparse CSR matrix and a dense NumPy.
This can improve the performance of following functions and estimators:
A typical example of this performance improvement happens when passing a sparse CSR matrix to the
predictortransformmethod of estimators that rely on a dense NumPy representation to store their fitted parameters (or the reverse).For instance,
sklearn.neighbors.NearestNeighbors.kneighborsis now up to 2 times faster for this case on commonly available laptops.Enhancement All estimators that internally rely on OpenMP multi-threading (via Cython) now use a number of threads equal to the number of physical (instead of logical) cores by default. In the past, we observed that using as many threads as logical cores on SMT hosts could sometimes cause severe performance problems depending on the algorithms and the shape of the data. Note that it is still possible to manually adjust the number of threads used by OpenMP as documented in Parallelism.
Experimental / Under Development#
Major Feature Metadata routing’s related base methods are included in this release. This feature is only available via the
enable_metadata_routingfeature flag which can be enabled usingsklearn.set_configandsklearn.config_context. For now this feature is mostly useful for third party developers to prepare their code base for metadata routing, and we strongly recommend that they also hide it behind the same feature flag, rather than having it enabled by default. #24027 by Adrin Jalali, Benjamin Bossan, and Omar Salman.
Changelog#
sklearn#
Feature Added a new option
skip_parameter_validation, to the functionsklearn.set_configand context managersklearn.config_context, that allows to skip the validation of the parameters passed to the estimators and public functions. This can be useful to speed up the code but should be used with care because it can lead to unexpected behaviors or raise obscure error messages when setting invalid parameters. #25815 by Jérémie du Boisberranger.
sklearn.base#
Feature A
__sklearn_clone__protocol is now available to override the default behavior ofbase.clone. #24568 by Thomas Fan.Fix
base.TransformerMixinnow currently keeps a namedtuple’s class iftransformreturns a namedtuple. #26121 by Thomas Fan.
sklearn.calibration#
Fix
calibration.CalibratedClassifierCVnow does not enforce sample alignment onfit_params. #25805 by Adrin Jalali.
sklearn.cluster#
Major Feature Added
cluster.HDBSCAN, a modern hierarchical density-based clustering algorithm. Similarly tocluster.OPTICS, it can be seen as a generalization ofcluster.DBSCANby allowing for hierarchical instead of flat clustering, however it varies in its approach fromcluster.OPTICS. This algorithm is very robust with respect to its hyperparameters’ values and can be used on a wide variety of data without much, if any, tuning.This implementation is an adaptation from the original implementation of HDBSCAN in scikit-learn-contrib/hdbscan, by Leland McInnes et al.
Enhancement The
sample_weightparameter now will be used in centroids initialization forcluster.KMeans,cluster.BisectingKMeansandcluster.MiniBatchKMeans. This change will break backward compatibility, since numbers generated from same random seeds will be different. #25752 by Hleb Levitski, Jérémie du Boisberranger, Guillaume Lemaitre.Fix
cluster.KMeans,cluster.MiniBatchKMeansandcluster.k_meansnow correctly handle the combination ofn_init="auto"andinitbeing an array-like, running one initialization in that case. #26657 by Binesh Bannerjee.API Change The
sample_weightparameter inpredictforcluster.KMeans.predictandcluster.MiniBatchKMeans.predictis now deprecated and will be removed in v1.5. #25251 by Hleb Levitski.API Change The
Xredargument incluster.FeatureAgglomeration.inverse_transformis renamed toXtand will be removed in v1.5. #26503 by Adrin Jalali.
sklearn.compose#
Fix
compose.ColumnTransformerraises an informative error when the individual transformers ofColumnTransformeroutput pandas dataframes with indexes that are not consistent with each other and the output is configured to be pandas. #26286 by Thomas Fan.Fix
compose.ColumnTransformercorrectly sets the output of the remainder whenset_outputis called. #26323 by Thomas Fan.
sklearn.covariance#
Fix Allows
alpha=0incovariance.GraphicalLassoto be consistent withcovariance.graphical_lasso. #26033 by Genesis Valencia.Fix
covariance.empirical_covariancenow gives an informative error message when input is not appropriate. #26108 by Quentin Barthélemy.API Change Deprecates
cov_initincovariance.graphical_lassoin 1.3 since the parameter has no effect. It will be removed in 1.5. #26033 by Genesis Valencia.API Change Adds
costs_fitted attribute incovariance.GraphicalLassoandcovariance.GraphicalLassoCV. #26033 by Genesis Valencia.API Change Adds
covarianceparameter incovariance.GraphicalLasso. #26033 by Genesis Valencia.API Change Adds
epsparameter incovariance.GraphicalLasso,covariance.graphical_lasso, andcovariance.GraphicalLassoCV. #26033 by Genesis Valencia.
sklearn.datasets#
Enhancement Allows to overwrite the parameters used to open the ARFF file using the parameter
read_csv_kwargsindatasets.fetch_openmlwhen using the pandas parser. #26433 by Guillaume Lemaitre.Fix
datasets.fetch_openmlreturns improved data types whenas_frame=Trueandparser="liac-arff". #26386 by Thomas Fan.Fix Following the ARFF specs, only the marker
"?"is now considered as a missing values when opening ARFF files fetched usingdatasets.fetch_openmlwhen using the pandas parser. The parameterread_csv_kwargsallows to overwrite this behaviour. #26551 by Guillaume Lemaitre.Fix
datasets.fetch_openmlwill consistently usenp.nanas missing marker with both parsers"pandas"and"liac-arff". #26579 by Guillaume Lemaitre.API Change The
data_transposedargument ofdatasets.make_sparse_coded_signalis deprecated and will be removed in v1.5. #25784 by @Jérémie du Boisberranger.
sklearn.decomposition#
Efficiency
decomposition.MiniBatchDictionaryLearninganddecomposition.MiniBatchSparsePCAare now faster for small batch sizes by avoiding duplicate validations. #25490 by Jérémie du Boisberranger.Enhancement
decomposition.DictionaryLearningnow accepts the parametercallbackfor consistency with the functiondecomposition.dict_learning. #24871 by Omar Salman.Fix Treat more consistently small values in the
WandHmatrices during thefitandtransformsteps ofdecomposition.NMFanddecomposition.MiniBatchNMFwhich can produce different results than previous versions. #25438 by Yotam Avidar-Constantini.API Change The
Wargument indecomposition.NMF.inverse_transformanddecomposition.MiniBatchNMF.inverse_transformis renamed toXtand will be removed in v1.5. #26503 by Adrin Jalali.
sklearn.discriminant_analysis#
Enhancement
discriminant_analysis.LinearDiscriminantAnalysisnow supports the PyTorch. See Array API support (experimental) for more details. #25956 by Thomas Fan.
sklearn.ensemble#
Feature
ensemble.HistGradientBoostingRegressornow supports the Gamma deviance loss vialoss="gamma". Using the Gamma deviance as loss function comes in handy for modelling skewed distributed, strictly positive valued targets. #22409 by Christian Lorentzen.Feature Compute a custom out-of-bag score by passing a callable to
ensemble.RandomForestClassifier,ensemble.RandomForestRegressor,ensemble.ExtraTreesClassifierandensemble.ExtraTreesRegressor. #25177 by Tim Head.Feature
ensemble.GradientBoostingClassifiernow exposes out-of-bag scores via theoob_scores_oroob_score_attributes. #24882 by Ashwin Mathur.Efficiency
ensemble.IsolationForestpredict time is now faster (typically by a factor of 8 or more). Internally, the estimator now precomputes decision path lengths per tree atfittime. It is therefore not possible to load an estimator trained with scikit-learn 1.2 to make it predict with scikit-learn 1.3: retraining with scikit-learn 1.3 is required. #25186 by Felipe Breve Siola.Efficiency
ensemble.RandomForestClassifierandensemble.RandomForestRegressorwithwarm_start=Truenow only recomputes out-of-bag scores when there are actually moren_estimatorsin subsequentfitcalls. #26318 by Joshua Choo Yun Keat.Enhancement
ensemble.BaggingClassifierandensemble.BaggingRegressorexpose theallow_nantag from the underlying estimator. #25506 by Thomas Fan.Fix
ensemble.RandomForestClassifier.fitsetsmax_samples = 1whenmax_samplesis a float andround(n_samples * max_samples) < 1. #25601 by Jan Fidor.Fix
ensemble.IsolationForest.fitno longer warns about missing feature names when called withcontaminationnot"auto"on a pandas dataframe. #25931 by Yao Xiao.Fix
ensemble.HistGradientBoostingRegressorandensemble.HistGradientBoostingClassifiertreats negative values for categorical features consistently as missing values, following LightGBM’s and pandas’ conventions. #25629 by Thomas Fan.Fix Fix deprecation of
base_estimatorinensemble.AdaBoostClassifierandensemble.AdaBoostRegressorthat was introduced in #23819. #26242 by Marko Toplak.
sklearn.exceptions#
Feature Added
exceptions.InconsistentVersionWarningwhich is raised when a scikit-learn estimator is unpickled with a scikit-learn version that is inconsistent with the scikit-learn version the estimator was pickled with. #25297 by Thomas Fan.
sklearn.feature_extraction#
API Change
feature_extraction.image.PatchExtractornow follows the transformer API of scikit-learn. This class is defined as a stateless transformer meaning that it is not required to callfitbefore callingtransform. Parameter validation only happens atfittime. #24230 by Guillaume Lemaitre.
sklearn.feature_selection#
Enhancement All selectors in
sklearn.feature_selectionwill preserve a DataFrame’s dtype when transformed. #25102 by Thomas Fan.Fix
feature_selection.SequentialFeatureSelector’scvparameter now supports generators. #25973 byYao Xiao <Charlie-XIAO>.
sklearn.impute#
Enhancement Added the parameter
fill_valuetoimpute.IterativeImputer. #25232 by Thijs van Weezel.Fix
impute.IterativeImputernow correctly preserves the Pandas Index when theset_config(transform_output="pandas"). #26454 by Thomas Fan.
sklearn.inspection#
Enhancement Added support for
sample_weightininspection.partial_dependenceandinspection.PartialDependenceDisplay.from_estimator. This allows for weighted averaging when aggregating for each value of the grid we are making the inspection on. The option is only available whenmethodis set tobrute. #25209 and #26644 by Carlo Lemos.API Change
inspection.partial_dependencereturns autils.Bunchwith new key:grid_values. Thevalueskey is deprecated in favor ofgrid_valuesand thevalueskey will be removed in 1.5. #21809 and #25732 by Thomas Fan.
sklearn.kernel_approximation#
Fix
kernel_approximation.AdditiveChi2Sampleris now stateless. Thesample_interval_attribute is deprecated and will be removed in 1.5. #25190 by Vincent Maladière.
sklearn.linear_model#
Efficiency Avoid data scaling when
sample_weight=Noneand other unnecessary data copies and unexpected dense to sparse data conversion inlinear_model.LinearRegression. #26207 by Olivier Grisel.Enhancement
linear_model.SGDClassifier,linear_model.SGDRegressorandlinear_model.SGDOneClassSVMnow preserve dtype fornumpy.float32. #25587 by Omar Salman.Enhancement The
n_iter_attribute has been included inlinear_model.ARDRegressionto expose the actual number of iterations required to reach the stopping criterion. #25697 by John Pangas.Fix Use a more robust criterion to detect convergence of
linear_model.LogisticRegressionwithpenalty="l1"andsolver="liblinear"on linearly separable problems. #25214 by Tom Dupre la Tour.Fix Fix a crash when calling
fitonlinear_model.LogisticRegressionwithsolver="newton-cholesky"andmax_iter=0which failed to inspect the state of the model prior to the first parameter update. #26653 by Olivier Grisel.API Change Deprecates
n_iterin favor ofmax_iterinlinear_model.BayesianRidgeandlinear_model.ARDRegression.n_iterwill be removed in scikit-learn 1.5. This change makes those estimators consistent with the rest of estimators. #25697 by John Pangas.
sklearn.manifold#
Fix
manifold.Isomapnow correctly preserves the Pandas Index when theset_config(transform_output="pandas"). #26454 by Thomas Fan.
sklearn.metrics#
Feature Adds
zero_division=np.nanto multiple classification metrics:metrics.precision_score,metrics.recall_score,metrics.f1_score,metrics.fbeta_score,metrics.precision_recall_fscore_support,metrics.classification_report. Whenzero_division=np.nanand there is a zero division, the metric is undefined and is excluded from averaging. When not used for averages, the value returned isnp.nan. #25531 by Marc Torrellas Socastro.Feature
metrics.average_precision_scorenow supports the multiclass case. #17388 by Geoffrey Bolmier and #24769 by Ashwin Mathur.Efficiency The computation of the expected mutual information in
metrics.adjusted_mutual_info_scoreis now faster when the number of unique labels is large and its memory usage is reduced in general. #25713 by Kshitij Mathur, Guillaume Lemaitre, Omar Salman and Jérémie du Boisberranger.Enhancement
metrics.silhouette_samplesnow accepts a sparse matrix of pairwise distances between samples, or a feature array. #18723 by Sahil Gupta and #24677 by Ashwin Mathur.Enhancement A new parameter
drop_intermediatewas added tometrics.precision_recall_curve,metrics.PrecisionRecallDisplay.from_estimator,metrics.PrecisionRecallDisplay.from_predictions, which drops some suboptimal thresholds to create lighter precision-recall curves. #24668 by @dberenbaum.Enhancement
metrics.RocCurveDisplay.from_estimatorandmetrics.RocCurveDisplay.from_predictionsnow accept two new keywords,plot_chance_levelandchance_level_kwto plot the baseline chance level. This line is exposed in thechance_level_attribute. #25987 by Yao Xiao.Enhancement
metrics.PrecisionRecallDisplay.from_estimatorandmetrics.PrecisionRecallDisplay.from_predictionsnow accept two new keywords,plot_chance_levelandchance_level_kwto plot the baseline chance level. This line is exposed in thechance_level_attribute. #26019 by Yao Xiao.Fix
metrics.pairwise.manhattan_distancesnow supports readonly sparse datasets. #25432 by Julien Jerphanion.Fix Fixed
metrics.classification_reportso that empty input will returnnp.nan. Previously, “macro avg” andweighted avgwould return e.g.f1-score=np.nanandf1-score=0.0, being inconsistent. Now, they both returnnp.nan. #25531 by Marc Torrellas Socastro.Fix
metrics.ndcg_scorenow gives a meaningful error message for input of length 1. #25672 by Lene Preuss and Wei-Chun Chu.Fix
metrics.log_lossraises a warning if the values of the parametery_predare not normalized, instead of actually normalizing them in the metric. Starting from 1.5 this will raise an error. #25299 by @Omar Salman <OmarManzoor.Fix In
metrics.roc_curve, use the threshold valuenp.infinstead of arbitrarymax(y_score) + 1. This threshold is associated with the ROC curve pointtpr=0andfpr=0. #26194 by Guillaume Lemaitre.Fix The
'matching'metric has been removed when using SciPy>=1.9 to be consistent withscipy.spatial.distancewhich does not support'matching'anymore. #26264 by Barata T. OnggoAPI Change The
epsparameter of themetrics.log_losshas been deprecated and will be removed in 1.5. #25299 by Omar Salman.
sklearn.gaussian_process#
Fix
gaussian_process.GaussianProcessRegressorhas a new argumentn_targets, which is used to decide the number of outputs when sampling from the prior distributions. #23099 by Zhehao Liu.
sklearn.mixture#
Efficiency
mixture.GaussianMixtureis more efficient now and will bypass unnecessary initialization if the weights, means, and precisions are given by users. #26021 by Jiawei Zhang.
sklearn.model_selection#
Major Feature Added the class
model_selection.ValidationCurveDisplaythat allows easy plotting of validation curves obtained by the functionmodel_selection.validation_curve. #25120 by Guillaume Lemaitre.API Change The parameter
log_scalein the methodplotof the classmodel_selection.LearningCurveDisplayhas been deprecated in 1.3 and will be removed in 1.5. The default scale can be overridden by setting it directly on theaxobject and will be set automatically from the spacing of the data points otherwise. #25120 by Guillaume Lemaitre.Enhancement
model_selection.cross_validateaccepts a new parameterreturn_indicesto return the train-test indices of each cv split. #25659 by Guillaume Lemaitre.
sklearn.multioutput#
Fix
getattronmultioutput.MultiOutputRegressor.partial_fitandmultioutput.MultiOutputClassifier.partial_fitnow correctly raise anAttributeErrorif done before callingfit. #26333 by Adrin Jalali.
sklearn.naive_bayes#
Fix
naive_bayes.GaussianNBdoes not raise anymore aZeroDivisionErrorwhen the providedsample_weightreduces the problem to a single class infit. #24140 by Jonathan Ohayon and Chiara Marmo.
sklearn.neighbors#
Enhancement The performance of
neighbors.KNeighborsClassifier.predictand ofneighbors.KNeighborsClassifier.predict_probahas been improved whenn_neighborsis large andalgorithm="brute"with non Euclidean metrics. #24076 by Meekail Zain, Julien Jerphanion.Fix Remove support for
KulsinskiDistanceinneighbors.BallTree. This dissimilarity is not a metric and cannot be supported by the BallTree. #25417 by Guillaume Lemaitre.API Change The support for metrics other than
euclideanandmanhattanand for callables inneighbors.NearestNeighborsis deprecated and will be removed in version 1.5. #24083 by Valentin Laurent.
sklearn.neural_network#
Fix
neural_network.MLPRegressorandneural_network.MLPClassifierreports the rightn_iter_whenwarm_start=True. It corresponds to the number of iterations performed on the current call tofitinstead of the total number of iterations performed since the initialization of the estimator. #25443 by Marvin Krawutschke.
sklearn.pipeline#
Feature
pipeline.FeatureUnioncan now use indexing notation (e.g.feature_union["scalar"]) to access transformers by name. #25093 by Thomas Fan.Feature
pipeline.FeatureUnioncan now access thefeature_names_in_attribute if theXvalue seen during.fithas acolumnsattribute and all columns are strings. e.g. whenXis apandas.DataFrame#25220 by Ian Thompson.Fix
pipeline.Pipeline.fit_transformnow raises anAttributeErrorif the last step of the pipeline does not supportfit_transform. #26325 by Adrin Jalali.
sklearn.preprocessing#
Major Feature Introduces
preprocessing.TargetEncoderwhich is a categorical encoding based on target mean conditioned on the value of the category. #25334 by Thomas Fan.Feature
preprocessing.OrdinalEncodernow supports grouping infrequent categories into a single feature. Grouping infrequent categories is enabled by specifying how to select infrequent categories withmin_frequencyormax_categories. #25677 by Thomas Fan.Enhancement
preprocessing.PolynomialFeaturesnow calculates the number of expanded terms a-priori when dealing with sparsecsrmatrices in order to optimize the choice ofdtypeforindicesandindptr. It can now outputcsrmatrices withnp.int32indices/indptrcomponents when there are few enough elements, and will automatically usenp.int64for sufficiently large matrices. #20524 by niuk-a and #23731 by Meekail ZainEnhancement A new parameter
sparse_outputwas added topreprocessing.SplineTransformer, available as of SciPy 1.8. Ifsparse_output=True,preprocessing.SplineTransformerreturns a sparse CSR matrix. #24145 by Christian Lorentzen.Enhancement Adds a
feature_name_combinerparameter topreprocessing.OneHotEncoder. This specifies a custom callable to create feature names to be returned bypreprocessing.OneHotEncoder.get_feature_names_out. The callable combines input arguments(input_feature, category)to a string. #22506 by Mario Kostelac.Enhancement Added support for
sample_weightinpreprocessing.KBinsDiscretizer. This allows specifying the parametersample_weightfor each sample to be used while fitting. The option is only available whenstrategyis set toquantileandkmeans. #24935 by Seladus, Guillaume Lemaitre, and Dea María Léon, #25257 by Hleb Levitski.Enhancement Subsampling through the
subsampleparameter can now be used inpreprocessing.KBinsDiscretizerregardless of the strategy used. #26424 by Jérémie du Boisberranger.Fix
preprocessing.PowerTransformernow correctly preserves the Pandas Index when theset_config(transform_output="pandas"). #26454 by Thomas Fan.Fix
preprocessing.PowerTransformernow correctly raises error when usingmethod="box-cox"on data with a constantnp.nancolumn. #26400 by Yao Xiao.Fix
preprocessing.PowerTransformerwithmethod="yeo-johnson"now leaves constant features unchanged instead of transforming with an arbitrary value for thelambdas_fitted parameter. #26566 by Jérémie du Boisberranger.API Change The default value of the
subsampleparameter ofpreprocessing.KBinsDiscretizerwill change fromNoneto200_000in version 1.5 whenstrategy="kmeans"orstrategy="uniform". #26424 by Jérémie du Boisberranger.
sklearn.svm#
API Change
dualparameter now acceptsautooption forsvm.LinearSVCandsvm.LinearSVR. #26093 by Hleb Levitski.
sklearn.tree#
Major Feature
tree.DecisionTreeRegressorandtree.DecisionTreeClassifiersupport missing values whensplitter='best'and criterion isgini,entropy, orlog_loss, for classification orsquared_error,friedman_mse, orpoissonfor regression. #23595, #26376 by Thomas Fan.Enhancement Adds a
class_namesparameter totree.export_text. This allows specifying the parameterclass_namesfor each target class in ascending numerical order. #25387 by William M and crispinlogan.Fix
tree.export_graphvizandtree.export_textnow acceptsfeature_namesandclass_namesas array-like rather than lists. #26289 by Yao Xiao
sklearn.utils#
Fix Fixes
utils.check_arrayto properly convert pandas extension arrays. #25813 and #26106 by Thomas Fan.Fix
utils.check_arraynow supports pandas DataFrames with extension arrays and object dtypes by returning an ndarray with object dtype. #25814 by Thomas Fan.API Change
utils.estimator_checks.check_transformers_unfitted_statelesshas been introduced to ensure stateless transformers don’t raiseNotFittedErrorduringtransformwith no prior call tofitorfit_transform. #25190 by Vincent Maladière.API Change A
FutureWarningis now raised when instantiating a class which inherits from a deprecated base class (i.e. decorated byutils.deprecated) and which overrides the__init__method. #25733 by Brigitta Sipőcz and Jérémie du Boisberranger.
sklearn.semi_supervised#
Enhancement
semi_supervised.LabelSpreading.fitandsemi_supervised.LabelPropagation.fitnow accepts sparse metrics. #19664 by Kaushik Amar Das.
Miscellaneous#
Enhancement Replace obsolete exceptions
EnvironmentError,IOErrorandWindowsError. #26466 by Dimitri Papadopoulos ORfanos.
Code and documentation contributors
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.2, including:
2357juan, Abhishek Singh Kushwah, Adam Handke, Adam Kania, Adam Li, adienes, Admir Demiraj, adoublet, Adrin Jalali, A.H.Mansouri, Ahmedbgh, Ala-Na, Alex Buzenet, AlexL, Ali H. El-Kassas, amay, András Simon, André Pedersen, Andrew Wang, Ankur Singh, annegnx, Ansam Zedan, Anthony22-dev, Artur Hermano, Arturo Amor, as-90, ashah002, Ashish Dutt, Ashwin Mathur, AymericBasset, Azaria Gebremichael, Barata Tripramudya Onggo, Benedek Harsanyi, Benjamin Bossan, Bharat Raghunathan, Binesh Bannerjee, Boris Feld, Brendan Lu, Brevin Kunde, cache-missing, Camille Troillard, Carla J, carlo, Carlo Lemos, c-git, Changyao Chen, Chiara Marmo, Christian Lorentzen, Christian Veenhuis, Christine P. Chai, crispinlogan, Da-Lan, DanGonite57, Dave Berenbaum, davidblnc, david-cortes, Dayne, Dea María Léon, Denis, Dimitri Papadopoulos Orfanos, Dimitris Litsidis, Dmitry Nesterov, Dominic Fox, Dominik Prodinger, Edern, Ekaterina Butyugina, Elabonga Atuo, Emir, farhan khan, Felipe Siola, futurewarning, Gael Varoquaux, genvalen, Hleb Levitski, Guillaume Lemaitre, gunesbayir, Haesun Park, hujiahong726, i-aki-y, Ian Thompson, Ido M, Ily, Irene, Jack McIvor, jakirkham, James Dean, JanFidor, Jarrod Millman, JB Mountford, Jérémie du Boisberranger, Jessicakk0711, Jiawei Zhang, Joey Ortiz, JohnathanPi, John Pangas, Joshua Choo Yun Keat, Joshua Hedlund, JuliaSchoepp, Julien Jerphanion, jygerardy, ka00ri, Kaushik Amar Das, Kento Nozawa, Kian Eliasi, Kilian Kluge, Lene Preuss, Linus, Logan Thomas, Loic Esteve, Louis Fouquet, Lucy Liu, Madhura Jayaratne, Marc Torrellas Socastro, Maren Westermann, Mario Kostelac, Mark Harfouche, Marko Toplak, Marvin Krawutschke, Masanori Kanazu, mathurinm, Matt Haberland, Max Halford, maximeSaur, Maxwell Liu, m. bou, mdarii, Meekail Zain, Mikhail Iljin, murezzda, Nawazish Alam, Nicola Fanelli, Nightwalkx, Nikolay Petrov, Nishu Choudhary, NNLNR, npache, Olivier Grisel, Omar Salman, ouss1508, PAB, Pandata, partev, Peter Piontek, Phil, pnucci, Pooja M, Pooja Subramaniam, precondition, Quentin Barthélemy, Rafal Wojdyla, Raghuveer Bhat, Rahil Parikh, Ralf Gommers, ram vikram singh, Rushil Desai, Sadra Barikbin, SANJAI_3, Sashka Warner, Scott Gigante, Scott Gustafson, searchforpassion, Seoeun Hong, Shady el Gewily, Shiva chauhan, Shogo Hida, Shreesha Kumar Bhat, sonnivs, Sortofamudkip, Stanislav (Stanley) Modrak, Stefanie Senger, Steven Van Vaerenbergh, Tabea Kossen, Théophile Baranger, Thijs van Weezel, Thomas A Caswell, Thomas Germer, Thomas J. Fan, Tim Head, Tim P, Tom Dupré la Tour, tomiock, tspeng, Valentin Laurent, Veghit, VIGNESH D, Vijeth Moudgalya, Vinayak Mehta, Vincent M, Vincent-violet, Vyom Pathak, William M, windiana42, Xiao Yuan, Yao Xiao, Yaroslav Halchenko, Yotam Avidar-Constantini, Yuchen Zhou, Yusuf Raji, zeeshan lone