Related Projects#

Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organization also accepts high-quality contributions of repositories conforming to this template.

Below is a list of sister-projects, extensions and domain specific packages.

Interoperability and framework enhancements#

These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s estimators.

Auto-ML

auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
autoviml Automatically Build Multiple Machine Learning Models with a Single Line of Code. Designed as a faster way to use scikit-learn models without having to preprocess data.
TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a machine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in replacement for a scikit-learn estimator.
Featuretools A framework to perform automated feature engineering. It can be used for transforming temporal and relational datasets into feature matrices for machine learning.
EvalML EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. It incorporates multiple modeling libraries under one API, and the objects that EvalML creates use an sklearn-compatible API.
MLJAR AutoML Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation.

Experimentation and model registry frameworks

MLFlow MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
Neptune Metadata store for MLOps, built for teams that run a lot of experiments. It gives you a single place to log, store, display, organize, compare, and query all your model building metadata.
Sacred Tool to help you configure, organize, log and reproduce experiments
Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning experiments with multiple learners and large feature sets.

Model inspection and visualization

dtreeviz A python library for decision tree visualization and model interpretation.
sklearn-evaluation Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis. Visual analysis, model selection, evaluation and diagnostics.
yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis, model selection, evaluation, and diagnostics.

Model export for production

sklearn-onnx Serialization of many Scikit-learn pipelines to ONNX for interchange and prediction.
skops.io A persistence model more secure than pickle, which can be used instead of pickle in most common cases.
sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the help of JPMML-SkLearn library.
treelite Compiles tree-based ensemble models into C code for minimizing prediction latency.
emlearn Implements scikit-learn estimators in C99 for embedded devices and microcontrollers. Supports several classifier, regression and outlier detection models.

Model throughput

Intel(R) Extension for scikit-learn Mostly on high end Intel(R) hardware, accelerates some scikit-learn models for both training and inference under certain circumstances. This project is maintained by Intel(R) and scikit-learn’s maintainers are not involved in the development of this project. Also note that in some cases using the tools and estimators under scikit-learn-intelex would give different results than scikit-learn itself. If you encounter issues while using this project, make sure you report potential issues in their respective repositories.

Interface to R with genomic applications

BiocSklearn Exposes a small number of dimension reduction facilities as an illustration of the basilisk protocol for interfacing python with R. Intended as a springboard for more complete interop.

Other estimators and tasks#

Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.

Time series and forecasting

Darts Darts is a Python library for user-friendly forecasting and anomaly detection on time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The forecasting models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn.
sktime A scikit-learn compatible toolbox for machine learning with time series including time series classification/regression and (supervised/panel) forecasting.
skforecast A python library that eases using scikit-learn regressors as multi-step forecasters. It also works with any regressor compatible with the scikit-learn API.
tslearn A machine learning library for time series that offers tools for pre-processing and feature extraction as well as dedicated models for clustering, classification and regression.

Gradient (tree) boosting

Note scikit-learn own modern gradient boosting estimators HistGradientBoostingClassifier and HistGradientBoostingRegressor.

XGBoost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
LightGBM LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient.

Structured learning

HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.

Deep neural networks etc.

skorch A scikit-learn compatible neural network library that wraps PyTorch.
scikeras provides a wrapper around Keras to interface it with scikit-learn. SciKeras is the successor of tf.keras.wrappers.scikit_learn.

Federated Learning

Flower A friendly federated learning framework with a unified approach that can federate any workload, any ML framework, and any programming language.

Privacy Preserving Machine Learning

Concrete ML A privacy preserving ML framework built on top of Concrete, with bindings to traditional ML frameworks, thanks to fully homomorphic encryption. APIs of so-called Concrete ML built-in models are very close to scikit-learn APIs.

Broad scope

mlxtend Includes a number of additional estimators as well as model visualization utilities.
scikit-lego A number of scikit-learn compatible custom transformers, models and metrics, focusing on solving practical industry tasks.

Other regression and classification

py-earth Multivariate adaptive regression splines
gplearn Genetic Programming for symbolic regression tasks.
scikit-multilearn Multi-label classification with focus on label space manipulation.

Decomposition and clustering

lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling to sample from the true posterior distribution. (scikit-learn’s LatentDirichletAllocation implementation uses variational inference to sample from a tractable approximation of a topic model’s posterior distribution.)
kmodes k-modes clustering algorithm for categorical data, and several of its variations.
hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering. As of scikit-learn version 1.3.0, there is HDBSCAN.

Pre-processing

categorical-encoding A library of sklearn compatible categorical variable encoders. As of scikit-learn version 1.3.0, there is TargetEncoder.
imbalanced-learn Various methods to under- and over-sample datasets.
Feature-engine A library of sklearn compatible transformers for missing data imputation, categorical encoding, variable transformation, discretization, outlier handling and more. Feature-engine allows the application of preprocessing steps to selected groups of variables and it is fully compatible with the Scikit-learn Pipeline.

Topological Data Analysis

giotto-tda A library for Topological Data Analysis aiming to provide a scikit-learn compatible API. It offers tools to transform data inputs (point clouds, graphs, time series, images) into forms suitable for computations of topological summaries, and components dedicated to extracting sets of scalar features of topological origin, which can be used alongside other feature extraction methods in scikit-learn.

Statistical learning with Python#

Other packages useful for data analysis and machine learning.

Pandas Tools for working with heterogeneous and columnar data, relational queries, time series and basic statistics.
statsmodels Estimating and analysing statistical models. More focused on statistical tests and less on prediction than scikit-learn.
PyMC Bayesian statistical models and fitting algorithms.
Seaborn Visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
scikit-survival A library implementing models to learn from censored time-to-event data (also called survival analysis). Models are fully compatible with scikit-learn.

Recommendation Engine packages#

implicit, Library for implicit feedback datasets.
lightfm A Python/Cython implementation of a hybrid recommender system.
Surprise Lib Library for explicit feedback datasets.

Domain specific packages#

scikit-network Machine learning on graphs.
scikit-image Image processing and computer vision in python.
Natural language toolkit (nltk) Natural language processing and some machine learning.
gensim A library for topic modelling, document indexing and similarity retrieval
NiLearn Machine learning for neuro-imaging.
AstroML Machine learning for astronomy.

Translations of scikit-learn documentation#

Translation’s purpose is to ease reading and understanding in languages other than English. Its aim is to help people who do not understand English or have doubts about its interpretation. Additionally, some people prefer to read documentation in their native language, but please bear in mind that the only official documentation is the English one [1].

Those translation efforts are community initiatives and we have no control on them. If you want to contribute or report an issue with the translation, please contact the authors of the translation. Some available translations are linked here to improve their dissemination and promote community efforts.

Footnotes