Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Column Transformer with Mixed Types#
This example illustrates how to apply different preprocessing and feature
extraction pipelines to different subsets of features, using
ColumnTransformer. This is particularly handy for the
case of datasets that contain heterogeneous data types, since we may want to
scale the numeric features and one-hot encode the categorical ones.
In this example, the numeric data is standard-scaled after mean-imputation. The
categorical data is one-hot encoded via OneHotEncoder, which
creates a new category for missing values. We further reduce the dimensionality
by selecting categories using a chi-squared test.
In addition, we show two different ways to dispatch the columns to the particular pre-processor: by column names and by column data types.
Finally, the preprocessing pipeline is integrated in a full prediction pipeline
using Pipeline, together with a simple classification
model.
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
np.random.seed(0)
Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# Alternatively X and y can be obtained directly from the frame attribute:
# X = titanic.frame.drop('survived', axis=1)
# y = titanic.frame['survived']
Use ColumnTransformer by selecting column by names
We will train our classifier with the following features:
Numeric Features:
age: float;fare: float.
Categorical Features:
embarked: categories encoded as strings{'C', 'S', 'Q'};sex: categories encoded as strings{'female', 'male'};pclass: ordinal integers{1, 2, 3}.
We create the preprocessing pipelines for both numeric and categorical data.
Note that pclass could either be treated as a categorical or numeric
feature.
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = Pipeline(
steps=[
("encoder", OneHotEncoder(handle_unknown="ignore")),
("selector", SelectPercentile(chi2, percentile=50)),
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
Append classifier to preprocessing pipeline. Now we have a full prediction pipeline.
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
model score: 0.798
HTML representation of Pipeline (display diagram)
When the Pipeline is printed out in a jupyter notebook an HTML
representation of the estimator is displayed:
clf
Use ColumnTransformer by selecting column by data types
When dealing with a cleaned dataset, the preprocessing can be automatic by
using the data types of the column to decide whether to treat a column as a
numerical or categorical feature.
sklearn.compose.make_column_selector gives this possibility.
First, let’s only select a subset of columns to simplify our
example.
subset_feature = ["embarked", "sex", "pclass", "age", "fare"]
X_train, X_test = X_train[subset_feature], X_test[subset_feature]
Then, we introspect the information regarding each column data type.
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 embarked 1045 non-null category
1 sex 1047 non-null category
2 pclass 1047 non-null int64
3 age 841 non-null float64
4 fare 1046 non-null float64
dtypes: category(2), float64(2), int64(1)
memory usage: 35.0 KB
We can observe that the embarked and sex columns were tagged as
category columns when loading the data with fetch_openml. Therefore, we
can use this information to dispatch the categorical columns to the
categorical_transformer and the remaining columns to the
numerical_transformer.
Note
In practice, you will have to handle yourself the column data type.
If you want some columns to be considered as category, you will have to
convert them into categorical columns. If you are using pandas, you can
refer to their documentation regarding Categorical data.
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, selector(dtype_exclude="category")),
("cat", categorical_transformer, selector(dtype_include="category")),
]
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
clf
model score: 0.798
The resulting score is not exactly the same as the one from the previous
pipeline because the dtype-based selector treats the pclass column as
a numeric feature instead of a categorical feature as previously:
selector(dtype_exclude="category")(X_train)
['pclass', 'age', 'fare']
selector(dtype_include="category")(X_train)
['embarked', 'sex']
Using the prediction pipeline in a grid search
Grid search can also be performed on the different preprocessing steps
defined in the ColumnTransformer object, together with the classifier’s
hyperparameters as part of the Pipeline.
We will search for both the imputer strategy of the numeric preprocessing
and the regularization parameter of the logistic regression using
RandomizedSearchCV. This
hyperparameter search randomly selects a fixed number of parameter
settings configured by n_iter. Alternatively, one can use
GridSearchCV but the cartesian product of
the parameter space will be evaluated.
param_grid = {
"preprocessor__num__imputer__strategy": ["mean", "median"],
"preprocessor__cat__selector__percentile": [10, 30, 50, 70],
"classifier__C": [0.1, 1.0, 10, 100],
}
search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
search_cv