Version 0.16#
Version 0.16.1#
April 14, 2015
Changelog#
Bug fixes#
- Allow input data larger than - block_sizein- covariance.LedoitWolfby Andreas Müller.
- Fix a bug in - isotonic.IsotonicRegressiondeduplication that caused unstable result in- calibration.CalibratedClassifierCVby Jan Hendrik Metzen.
- Fix sorting of labels in - preprocessing.label_binarizeby Michael Heilman.
- Fix several stability and convergence issues in - cross_decomposition.CCAand- cross_decomposition.PLSCanonicalby Andreas Müller
- Fix a bug in - cluster.KMeanswhen- precompute_distances=Falseon fortran-ordered data.
- Fix a speed regression in - ensemble.RandomForestClassifier’s- predictand- predict_probaby Andreas Müller.
- Fix a regression where - utils.shuffleconverted lists and dataframes to arrays, by Olivier Grisel
Version 0.16#
March 26, 2015
Highlights#
- Speed improvements (notably in - cluster.DBSCAN), reduced memory requirements, bug-fixes and better default settings.
- Multinomial Logistic regression and a path algorithm in - linear_model.LogisticRegressionCV.
- Out-of core learning of PCA via - decomposition.IncrementalPCA.
- Probability calibration of classifiers using - calibration.CalibratedClassifierCV.
- cluster.Birchclustering method for large-scale datasets.
- Scalable approximate nearest neighbors search with Locality-sensitive hashing forests in - neighbors.LSHForest.
- Improved error messages and better validation when using malformed input data. 
- More robust integration with pandas dataframes. 
Changelog#
New features#
- The new - neighbors.LSHForestimplements locality-sensitive hashing for approximate nearest neighbors search. By Maheshakya Wijewardena.
- Added - svm.LinearSVR. This class uses the liblinear implementation of Support Vector Regression which is much faster for large sample sizes than- svm.SVRwith linear kernel. By Fabian Pedregosa and Qiang Luo.
- Incremental fit for - GaussianNB.
- Added - sample_weightsupport to- dummy.DummyClassifierand- dummy.DummyRegressor. By Arnaud Joly.
- Added the - metrics.label_ranking_average_precision_scoremetrics. By Arnaud Joly.
- Add the - metrics.coverage_errormetrics. By Arnaud Joly.
- Added - linear_model.LogisticRegressionCV. By Manoj Kumar, Fabian Pedregosa, Gael Varoquaux and Alexandre Gramfort.
- Added - warm_startconstructor parameter to make it possible for any trained forest model to grow additional trees incrementally. By Laurent Direr.
- Added - sample_weightsupport to- ensemble.GradientBoostingClassifierand- ensemble.GradientBoostingRegressor. By Peter Prettenhofer.
- Added - decomposition.IncrementalPCA, an implementation of the PCA algorithm that supports out-of-core learning with a- partial_fitmethod. By Kyle Kastner.
- Averaged SGD for - SGDClassifierand- SGDRegressorBy Danny Sullivan.
- Added - cross_val_predictfunction which computes cross-validated estimates. By Luis Pedro Coelho
- Added - linear_model.TheilSenRegressor, a robust generalized-median-based estimator. By Florian Wilhelm.
- Added - metrics.median_absolute_error, a robust metric. By Gael Varoquaux and Florian Wilhelm.
- Add - cluster.Birch, an online clustering algorithm. By Manoj Kumar, Alexandre Gramfort and Joel Nothman.
- Added shrinkage support to - discriminant_analysis.LinearDiscriminantAnalysisusing two new solvers. By Clemens Brunner and Martin Billinger.
- Added - kernel_ridge.KernelRidge, an implementation of kernelized ridge regression. By Mathieu Blondel and Jan Hendrik Metzen.
- All solvers in - linear_model.Ridgenow support- sample_weight. By Mathieu Blondel.
- Added - cross_validation.PredefinedSplitcross-validation for fixed user-provided cross-validation folds. By Thomas Unterthiner.
- Added - calibration.CalibratedClassifierCV, an approach for calibrating the predicted probabilities of a classifier. By Alexandre Gramfort, Jan Hendrik Metzen, Mathieu Blondel and Balazs Kegl.
Enhancements#
- Add option - return_distancein- hierarchical.ward_treeto return distances between nodes for both structured and unstructured versions of the algorithm. By Matteo Visconti di Oleggio Castello. The same option was added in- hierarchical.linkage_tree. By Manoj Kumar
- Add support for sample weights in scorer objects. Metrics with sample weight support will automatically benefit from it. By Noel Dawe and Vlad Niculae. 
- Added - newton-cgand- lbfgssolver support in- linear_model.LogisticRegression. By Manoj Kumar.
- Add - selection="random"parameter to implement stochastic coordinate descent for- linear_model.Lasso,- linear_model.ElasticNetand related. By Manoj Kumar.
- Add - sample_weightparameter to- metrics.jaccard_similarity_scoreand- metrics.log_loss. By Jatin Shah.
- Support sparse multilabel indicator representation in - preprocessing.LabelBinarizerand- multiclass.OneVsRestClassifier(by Hamzeh Alsalhi with thanks to Rohit Sivaprasad), as well as evaluation metrics (by Joel Nothman).
- Add - sample_weightparameter to- metrics.jaccard_similarity_score. By- Jatin Shah.
- Add support for multiclass in - metrics.hinge_loss. Added- labels=Noneas optional parameter. By- Saurabh Jha.
- Add - sample_weightparameter to- metrics.hinge_loss. By- Saurabh Jha.
- Add - multi_class="multinomial"option in- linear_model.LogisticRegressionto implement a Logistic Regression solver that minimizes the cross-entropy or multinomial loss instead of the default One-vs-Rest setting. Supports- lbfgsand- newton-cgsolvers. By Lars Buitinck and Manoj Kumar. Solver option- newton-cgby Simon Wu.
- DictVectorizercan now perform- fit_transformon an iterable in a single pass, when giving the option- sort=False. By Dan Blanchard.
- model_selection.GridSearchCVand- model_selection.RandomizedSearchCVcan now be configured to work with estimators that may fail and raise errors on individual folds. This option is controlled by the- error_scoreparameter. This does not affect errors raised on re-fit. By Michal Romaniuk.
- Add - digitsparameter to- metrics.classification_reportto allow report to show different precision of floating point numbers. By Ian Gilmore.
- Add a quantile prediction strategy to the - dummy.DummyRegressor. By Aaron Staple.
- Add - handle_unknownoption to- preprocessing.OneHotEncoderto handle unknown categorical features more gracefully during transform. By Manoj Kumar.
- Added support for sparse input data to decision trees and their ensembles. By Fares Hedyati and Arnaud Joly. 
- Optimized - cluster.AffinityPropagationby reducing the number of memory allocations of large temporary data-structures. By Antony Lee.
- Parallelization of the computation of feature importances in random forest. By Olivier Grisel and Arnaud Joly. 
- Add - n_iter_attribute to estimators that accept a- max_iterattribute in their constructor. By Manoj Kumar.
- Added decision function for - multiclass.OneVsOneClassifierBy Raghav RV and Kyle Beauchamp.
- neighbors.kneighbors_graphand- radius_neighbors_graphsupport non-Euclidean metrics. By Manoj Kumar
- Parameter - connectivityin- cluster.AgglomerativeClusteringand family now accept callables that return a connectivity matrix. By Manoj Kumar.
- Sparse support for - metrics.pairwise.paired_distances. By Joel Nothman.
- cluster.DBSCANnow supports sparse input and sample weights and has been optimized: the inner loop has been rewritten in Cython and radius neighbors queries are now computed in batch. By Joel Nothman and Lars Buitinck.
- Add - class_weightparameter to automatically weight samples by class frequency for- ensemble.RandomForestClassifier,- tree.DecisionTreeClassifier,- ensemble.ExtraTreesClassifierand- tree.ExtraTreeClassifier. By Trevor Stephens.
- grid_search.RandomizedSearchCVnow does sampling without replacement if all parameters are given as lists. By Andreas Müller.
- Parallelized calculation of - metrics.pairwise_distancesis now supported for scipy metrics and custom callables. By Joel Nothman.
- Allow the fitting and scoring of all clustering algorithms in - pipeline.Pipeline. By Andreas Müller.
- More robust seeding and improved error messages in - cluster.MeanShiftby Andreas Müller.
- Make the stopping criterion for - mixture.GMM,- mixture.DPGMMand- mixture.VBGMMless dependent on the number of samples by thresholding the average log-likelihood change instead of its sum over all samples. By Hervé Bredin.
- The outcome of - manifold.spectral_embeddingwas made deterministic by flipping the sign of eigenvectors. By Hasil Sharma.
- Significant performance and memory usage improvements in - preprocessing.PolynomialFeatures. By Eric Martin.
- Numerical stability improvements for - preprocessing.StandardScalerand- preprocessing.scale. By Nicolas Goix
- svm.SVCfitted on sparse input now implements- decision_function. By Rob Zinkov and Andreas Müller.
- cross_validation.train_test_splitnow preserves the input type, instead of converting to numpy arrays.
documentation improvements#
- Added example of using - pipeline.FeatureUnionfor heterogeneous input. By Matt Terry
- documentation on scorers was improved, to highlight the handling of loss functions. By Matt Pico. 
- A discrepancy between liblinear output and scikit-learn’s wrappers is now noted. By Manoj Kumar. 
- Improved documentation generation: examples referring to a class or function are now shown in a gallery on the class/function’s API reference page. By Joel Nothman. 
- More explicit documentation of sample generators and of data transformation. By Joel Nothman. 
- sklearn.neighbors.BallTreeand- sklearn.neighbors.KDTreeused to point to empty pages stating that they are aliases of BinaryTree. This has been fixed to show the correct class docs. By Manoj Kumar.
- Added silhouette plots for analysis of KMeans clustering using - metrics.silhouette_samplesand- metrics.silhouette_score. See Selecting the number of clusters with silhouette analysis on KMeans clustering
Bug fixes#
- Metaestimators now support ducktyping for the presence of - decision_function,- predict_probaand other methods. This fixes behavior of- grid_search.GridSearchCV,- grid_search.RandomizedSearchCV,- pipeline.Pipeline,- feature_selection.RFE,- feature_selection.RFECVwhen nested. By Joel Nothman
- The - scoringattribute of grid-search and cross-validation methods is no longer ignored when a- grid_search.GridSearchCVis given as a base estimator or the base estimator doesn’t have predict.
- The function - hierarchical.ward_treenow returns the children in the same order for both the structured and unstructured versions. By Matteo Visconti di Oleggio Castello.
- feature_selection.RFECVnow correctly handles cases when- stepis not equal to 1. By Nikolay Mayorov
- The - decomposition.PCAnow undoes whitening in its- inverse_transform. Also, its- components_now always have unit length. By Michael Eickenberg.
- Fix incomplete download of the dataset when - datasets.download_20newsgroupsis called. By Manoj Kumar.
- Various fixes to the Gaussian processes subpackage by Vincent Dubourg and Jan Hendrik Metzen. 
- Calling - partial_fitwith- class_weight=='auto'throws an appropriate error message and suggests a workaround. By Danny Sullivan.
- RBFSamplerwith- gamma=gformerly approximated- rbf_kernelwith- gamma=g/2.; the definition of- gammais now consistent, which may substantially change your results if you use a fixed value. (If you cross-validated over- gamma, it probably doesn’t matter too much.) By Dougal Sutherland.
- Pipeline object delegates the - classes_attribute to the underlying estimator. It allows, for instance, to make bagging of a pipeline object. By Arnaud Joly
- neighbors.NearestCentroidnow uses the median as the centroid when metric is set to- manhattan. It was using the mean before. By Manoj Kumar
- Fix numerical stability issues in - linear_model.SGDClassifierand- linear_model.SGDRegressorby clipping large gradients and ensuring that weight decay rescaling is always positive (for large l2 regularization and large learning rate values). By Olivier Grisel
- When - compute_full_treeis set to “auto”, the full tree is built when n_clusters is high and is early stopped when n_clusters is low, while the behavior should be vice versa in- cluster.AgglomerativeClustering(and friends). This has been fixed By Manoj Kumar
- Fix lazy centering of data in - linear_model.enet_pathand- linear_model.lasso_path. It was centered around one. It has been changed to be centered around the origin. By Manoj Kumar
- Fix handling of precomputed affinity matrices in - cluster.AgglomerativeClusteringwhen using connectivity constraints. By Cathy Deng
- Correct - partial_fithandling of- class_priorfor- sklearn.naive_bayes.MultinomialNBand- sklearn.naive_bayes.BernoulliNB. By Trevor Stephens.
- Fixed a crash in - metrics.precision_recall_fscore_supportwhen using unsorted- labelsin the multi-label setting. By Andreas Müller.
- Avoid skipping the first nearest neighbor in the methods - radius_neighbors,- kneighbors,- kneighbors_graphand- radius_neighbors_graphin- sklearn.neighbors.NearestNeighborsand family, when the query data is not the same as fit data. By Manoj Kumar.
- Fix log-density calculation in the - mixture.GMMwith tied covariance. By Will Dawson
- Fixed a scaling error in - feature_selection.SelectFdrwhere a factor- n_featureswas missing. By Andrew Tulloch
- Fix zero division in - neighbors.KNeighborsRegressorand related classes when using distance weighting and having identical data points. By Garret-R.
- Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon. 
- Fixed an error in the computation of conditional probabilities in - naive_bayes.BernoulliNB. By Hanna Wallach.
- Make the method - radius_neighborsof- neighbors.NearestNeighborsreturn the samples lying on the boundary for- algorithm='brute'. By Yan Yi.
- Flip sign of - dual_coef_of- svm.SVCto make it consistent with the documentation and- decision_function. By Artem Sobolev.
- Fixed handling of ties in - isotonic.IsotonicRegression. We now use the weighted average of targets (secondary method). By Andreas Müller and Michael Bommarito.
API changes summary#
- GridSearchCVand- cross_val_scoreand other meta-estimators don’t convert pandas DataFrames into arrays any more, allowing DataFrame specific operations in custom estimators.
- multiclass.fit_ovr,- multiclass.predict_ovr,- predict_proba_ovr,- multiclass.fit_ovo,- multiclass.predict_ovo,- multiclass.fit_ecocand- multiclass.predict_ecocare deprecated. Use the underlying estimators instead.
- Nearest neighbors estimators used to take arbitrary keyword arguments and pass these to their distance metric. This will no longer be supported in scikit-learn 0.18; use the - metric_paramsargument instead.
- n_jobsparameter of the fit method shifted to the constructor of the
- LinearRegression class. 
 
- The - predict_probamethod of- multiclass.OneVsRestClassifiernow returns two probabilities per sample in the multiclass case; this is consistent with other estimators and with the method’s documentation, but previous versions accidentally returned only the positive probability. Fixed by Will Lamond and Lars Buitinck.
- Change default value of precompute in - linear_model.ElasticNetand- linear_model.Lassoto False. Setting precompute to “auto” was found to be slower when n_samples > n_features since the computation of the Gram matrix is computationally expensive and outweighs the benefit of fitting the Gram for just one alpha.- precompute="auto"is now deprecated and will be removed in 0.18 By Manoj Kumar.
- Expose - positiveoption in- linear_model.enet_pathand- linear_model.enet_pathwhich constrains coefficients to be positive. By Manoj Kumar.
- Users should now supply an explicit - averageparameter to- sklearn.metrics.f1_score,- sklearn.metrics.fbeta_score,- sklearn.metrics.recall_scoreand- sklearn.metrics.precision_scorewhen performing multiclass or multilabel (i.e. not binary) classification. By Joel Nothman.
- scoringparameter for cross validation now accepts- 'f1_micro',- 'f1_macro'or- 'f1_weighted'.- 'f1'is now for binary classification only. Similar changes apply to- 'precision'and- 'recall'. By Joel Nothman.
- The - fit_intercept,- normalizeand- return_modelsparameters in- linear_model.enet_pathand- linear_model.lasso_pathhave been removed. They were deprecated since 0.14
- From now onwards, all estimators will uniformly raise - NotFittedErrorwhen any of the- predictlike methods are called before the model is fit. By Raghav RV.
- Input data validation was refactored for more consistent input validation. The - check_arraysfunction was replaced by- check_arrayand- check_X_y. By Andreas Müller.
- Allow - X=Nonein the methods- radius_neighbors,- kneighbors,- kneighbors_graphand- radius_neighbors_graphin- sklearn.neighbors.NearestNeighborsand family. If set to None, then for every sample this avoids setting the sample itself as the first nearest neighbor. By Manoj Kumar.
- Add parameter - include_selfin- neighbors.kneighbors_graphand- neighbors.radius_neighbors_graphwhich has to be explicitly set by the user. If set to True, then the sample itself is considered as the first nearest neighbor.
- threshparameter is deprecated in favor of new- tolparameter in- GMM,- DPGMMand- VBGMM. See- Enhancementssection for details. By Hervé Bredin.
- Estimators will treat input with dtype object as numeric when possible. By Andreas Müller 
- Estimators now raise - ValueErrorconsistently when fitted on empty data (less than 1 sample or less than 1 feature for 2D input). By Olivier Grisel.
- The - shuffleoption of- linear_model.SGDClassifier,- linear_model.SGDRegressor,- linear_model.Perceptron,- linear_model.PassiveAggressiveClassifierand- linear_model.PassiveAggressiveRegressornow defaults to- True.
- cluster.DBSCANnow uses a deterministic initialization. The- random_stateparameter is deprecated. By Erich Schubert.
Code Contributors#
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, Andrew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu, Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng, Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brunner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Sutherland, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R, Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque, isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michelbacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti doc, Matti Lyra, Max Linke, Mehdi Cherti, Michael Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas, Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Unterthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta, Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan Meng, Yan Yi, Yu-Chin
