Version 0.14#
Version 0.14#
August 7, 2013
Changelog#
Missing values with sparse and dense matrices can be imputed with the transformer
preprocessing.Imputerby Nicolas Trésegnie.The core implementation of decision trees has been rewritten from scratch, allowing for faster tree induction and lower memory consumption in all tree-based estimators. By Gilles Louppe.
Added
ensemble.AdaBoostClassifierandensemble.AdaBoostRegressor, by Noel Dawe and Gilles Louppe. See the AdaBoost section of the user guide for details and examples.Added
grid_search.RandomizedSearchCVandgrid_search.ParameterSamplerfor randomized hyperparameter optimization. By Andreas Müller.Added biclustering algorithms (
sklearn.cluster.bicluster.SpectralCoclusteringandsklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.datasets.make_biclustersandsklearn.datasets.make_checkerboard), and scoring metrics (sklearn.metrics.consensus_score). By Kemal Eren.Added Restricted Boltzmann Machines (
neural_network.BernoulliRBM). By Yann Dauphin.Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass under Python 3.3.
Ability to pass one penalty (alpha value) per target in
linear_model.Ridge, by @eickenberg and Mathieu Blondel.Fixed
sklearn.linear_model.stochastic_gradient.pyL2 regularization issue (minor practical significance). By Norbert Crombach and Mathieu Blondel .Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the documentation. See Choosing the right estimator. By Jaques Grobler.
grid_search.GridSearchCVandcross_validation.cross_val_scorenow support the use of advanced scoring functions such as area under the ROC curve and f-beta scores. See The scoring parameter: defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function fromsklearn.metricsasscore_funcis deprecated.Multi-label classification output is now supported by
metrics.accuracy_score,metrics.zero_one_loss,metrics.f1_score,metrics.fbeta_score,metrics.classification_report,metrics.precision_scoreandmetrics.recall_scoreby Arnaud Joly.Two new metrics
metrics.hamming_lossandmetrics.jaccard_similarity_scoreare added with multi-label support by Arnaud Joly.Speed and memory usage improvements in
feature_extraction.text.CountVectorizerandfeature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.The
min_dfparameter infeature_extraction.text.CountVectorizerandfeature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A value of at least 2 is still recommended for practical use.svm.LinearSVC,linear_model.SGDClassifierandlinear_model.SGDRegressornow have asparsifymethod that converts theircoef_into a sparse matrix, meaning stored models trained using these estimators can be made much more compact.linear_model.SGDClassifiernow produces multiclass probability estimates when trained under log loss or modified Huber loss.Hyperlinks to documentation in example code on the website by Martin Luessi.
Fixed bug in
preprocessing.MinMaxScalercausing incorrect scaling of the features for non-defaultfeature_rangesettings. By Andreas Müller.max_featuresintree.DecisionTreeClassifier,tree.DecisionTreeRegressorand all derived ensemble estimators now support percentage values. By Gilles Louppe.Performance improvements in
isotonic.IsotonicRegressionby Nelle Varoquaux.metrics.accuracy_scorehas an option normalize to return the fraction or the number of correctly classified samples by Arnaud Joly.Added
metrics.log_lossthat computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars Buitinck.A bug that caused
ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.Feature selectors now share a mixin providing consistent
transform,inverse_transformandget_supportmethods. By Joel Nothman.A fitted
grid_search.GridSearchCVorgrid_search.RandomizedSearchCVcan now generally be pickled. By Joel Nothman.Refactored and vectorized implementation of
metrics.roc_curveandmetrics.precision_recall_curve. By Joel Nothman.The new estimator
sklearn.decomposition.TruncatedSVDperforms dimensionality reduction using SVD on sparse matrices, and can be used for latent semantic analysis (LSA). By Lars Buitinck.Added self-contained example of out-of-core learning on text data Out-of-core classification of text documents. By Eustache Diemert.
The default number of components for
sklearn.decomposition.RandomizedPCAis now correctly documented to ben_features. This was the default behavior, so programs using it will continue to work as they did.sklearn.cluster.KMeansnow fits several orders of magnitude faster on sparse data (the speedup depends on the sparsity). By Lars Buitinck.Reduce memory footprint of FastICA by Denis Engemann and Alexandre Gramfort.
Verbose output in
sklearn.ensemble.gradient_boostingnow uses a column format and prints progress in decreasing frequency. It also shows the remaining time. By Peter Prettenhofer.sklearn.ensemble.gradient_boostingprovides out-of-bag improvementoob_improvement_rather than the OOB score for model selection. An example that shows how to use OOB estimates to select the number of trees was added. By Peter Prettenhofer.Most metrics now support string labels for multiclass classification by Arnaud Joly and Lars Buitinck.
New OrthogonalMatchingPursuitCV class by Alexandre Gramfort and Vlad Niculae.
Fixed a bug in
sklearn.covariance.GraphLassoCV: the ‘alphas’ parameter now works as expected when given a list of values. By Philippe Gervais.Fixed an important bug in
sklearn.covariance.GraphLassoCVthat prevented all folds provided by a CV object to be used (only the first 3 were used). When providing a CV object, execution time may thus increase significantly compared to the previous version (bug results are correct now). By Philippe Gervais.cross_validation.cross_val_scoreand thegrid_searchmodule is now tested with multi-output data by Arnaud Joly.datasets.make_multilabel_classificationcan now return the output in label indicator multilabel format by Arnaud Joly.K-nearest neighbors,
neighbors.KNeighborsRegressorandneighbors.RadiusNeighborsRegressor, and radius neighbors,neighbors.RadiusNeighborsRegressorandneighbors.RadiusNeighborsClassifiersupport multioutput data by Arnaud Joly.Random state in LibSVM-based estimators (
svm.SVC,svm.NuSVC,svm.OneClassSVM,svm.SVR,svm.NuSVR) can now be controlled. This is useful to ensure consistency in the probability estimates for the classifiers trained withprobability=True. By Vlad Niculae.Out-of-core learning support for discrete naive Bayes classifiers
sklearn.naive_bayes.MultinomialNBandsklearn.naive_bayes.BernoulliNBby adding thepartial_fitmethod by Olivier Grisel.New website design and navigation by Gilles Louppe, Nelle Varoquaux, Vincent Michel and Andreas Müller.
Improved documentation on multi-class, multi-label and multi-output classification by Yannick Schwartz and Arnaud Joly.
Better input and error handling in the
sklearn.metricsmodule by Arnaud Joly and Joel Nothman.Speed optimization of the
hmmmodule by Mikhail KorobovSignificant speed improvements for
sklearn.cluster.DBSCANby cleverless
API changes summary#
The
auc_scorewas renamedmetrics.roc_auc_score.Testing scikit-learn with
sklearn.test()is deprecated. Usenosetests sklearnfrom the command line.Feature importances in
tree.DecisionTreeClassifier,tree.DecisionTreeRegressorand all derived ensemble estimators are now computed on the fly when accessing thefeature_importances_attribute. Settingcompute_importances=Trueis no longer required. By Gilles Louppe.linear_model.lasso_pathandlinear_model.enet_pathcan return its results in the same format as that oflinear_model.lars_path. This is done by setting thereturn_modelsparameter toFalse. By Jaques Grobler and Alexandre Gramfortgrid_search.IterGridwas renamed togrid_search.ParameterGrid.Fixed bug in
KFoldcausing imperfect class balance in some cases. By Alexandre Gramfort and Tadej Janež.sklearn.neighbors.BallTreehas been refactored, and asklearn.neighbors.KDTreehas been added which shares the same interface. The Ball Tree now works with a wide variety of distance metrics. Both classes have many new methods, including single-tree and dual-tree queries, breadth-first and depth-first searching, and more advanced queries such as kernel density estimation and 2-point correlation functions. By Jake VanderplasSupport for scipy.spatial.cKDTree within neighbors queries has been removed, and the functionality replaced with the new
sklearn.neighbors.KDTreeclass.sklearn.neighbors.KernelDensityhas been added, which performs efficient kernel density estimation with a variety of kernels.sklearn.decomposition.KernelPCAnow always returns output withn_componentscomponents, unless the new parameterremove_zero_eigis set toTrue. This new behavior is consistent with the way kernel PCA was always documented; previously, the removal of components with zero eigenvalues was tacitly performed on all data.gcv_mode="auto"no longer tries to perform SVD on a densified sparse matrix insklearn.linear_model.RidgeCV.Sparse matrix support in
sklearn.decomposition.RandomizedPCAis now deprecated in favor of the newTruncatedSVD.cross_validation.KFoldandcross_validation.StratifiedKFoldnow enforcen_folds >= 2otherwise aValueErroris raised. By Olivier Grisel.datasets.load_files’scharsetandcharset_errorsparameters were renamedencodinganddecode_errors.Attribute
oob_score_insklearn.ensemble.GradientBoostingRegressorandsklearn.ensemble.GradientBoostingClassifieris deprecated and has been replaced byoob_improvement_.Attributes in OrthogonalMatchingPursuit have been deprecated (copy_X, Gram, …) and precompute_gram renamed precompute for consistency. See #2224.
sklearn.preprocessing.StandardScalernow converts integer input to float, and raises a warning. Previously it rounded for dense integer input.sklearn.multiclass.OneVsRestClassifiernow has adecision_functionmethod. This will return the distance of each sample from the decision boundary for each class, as long as the underlying estimators implement thedecision_functionmethod. By Kyle Kastner.Better input validation, warning on unexpected shapes for y.
People#
List of contributors for release 0.14 by number of commits.
277 Gilles Louppe
245 Lars Buitinck
187 Andreas Mueller
124 Arnaud Joly
112 Jaques Grobler
109 Gael Varoquaux
107 Olivier Grisel
102 Noel Dawe
99 Kemal Eren
79 Joel Nothman
75 Jake VanderPlas
73 Nelle Varoquaux
71 Vlad Niculae
65 Peter Prettenhofer
64 Alexandre Gramfort
54 Mathieu Blondel
38 Nicolas Trésegnie
35 eustache
27 Denis Engemann
25 Yann N. Dauphin
19 Justin Vincent
17 Robert Layton
15 Doug Coleman
14 Michael Eickenberg
13 Robert Marchman
11 Fabian Pedregosa
11 Philippe Gervais
10 Jim Holmström
10 Tadej Janež
10 syhw
9 Mikhail Korobov
9 Steven De Gryze
8 sergeyf
7 Ben Root
7 Hrishikesh Huilgolkar
6 Kyle Kastner
6 Martin Luessi
6 Rob Speer
5 Federico Vaggi
5 Raul Garreta
5 Rob Zinkov
4 Ken Geis
3 A. Flaxman
3 Denton Cockburn
3 Dougal Sutherland
3 Ian Ozsvald
3 Johannes Schönberger
3 Robert McGibbon
3 Roman Sinayev
3 Szabo Roland
2 Diego Molla
2 Imran Haque
2 Jochen Wersdörfer
2 Sergey Karayev
2 Yannick Schwartz
2 jamestwebber
1 Abhijeet Kolhe
1 Alexander Fabisch
1 Bastiaan van den Berg
1 Benjamin Peterson
1 Daniel Velkov
1 Fazlul Shahriar
1 Felix Brockherde
1 Félix-Antoine Fortin
1 Harikrishnan S
1 Jack Hale
1 JakeMick
1 James McDermott
1 John Benediktsson
1 John Zwinck
1 Joshua Vredevoogd
1 Justin Pati
1 Kevin Hughes
1 Kyle Kelley
1 Matthias Ekman
1 Miroslav Shubernetskiy
1 Naoki Orii
1 Norbert Crombach
1 Rafael Cunha de Almeida
1 Rolando Espinoza La fuente
1 Seamus Abshere
1 Sergey Feldman
1 Sergio Medina
1 Stefano Lattarini
1 Steve Koch
1 Sturla Molden
1 Thomas Jarosch
1 Yaroslav Halchenko