DictVectorizer#
- class sklearn.feature_extraction.DictVectorizer(*, dtype=<class 'numpy.float64'>, separator='=', sparse=True, sort=True)[source]#
- Transforms lists of feature-value mappings to vectors. - This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators. - When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on. For instance, a feature “f” that can take on the values “ham” and “spam” will become two features in the output, one signifying “f=ham”, the other “f=spam”. - If a feature value is a sequence or set of strings, this transformer will iterate over the values and will count the occurrences of each string value. - However, note that this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int or iterables of strings, the DictVectorizer can be followed by - OneHotEncoderto complete binary one-hot encoding.- Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix. - For an efficiency comparison of the different feature extractors, see FeatureHasher and DictVectorizer Comparison. - Read more in the User Guide. - Parameters:
- dtypedtype, default=np.float64
- The type of feature values. Passed to Numpy array/scipy.sparse matrix constructors as the dtype argument. 
- separatorstr, default=”=”
- Separator string used when constructing new features for one-hot coding. 
- sparsebool, default=True
- Whether transform should produce scipy.sparse matrices. 
- sortbool, default=True
- Whether - feature_names_and- vocabulary_should be sorted when fitting.
 
- Attributes:
- vocabulary_dict
- A dictionary mapping feature names to feature indices. 
- feature_names_list
- A list of length n_features containing the feature names (e.g., “f=ham” and “f=spam”). 
 
 - See also - FeatureHasher
- Performs vectorization using only a hash function. 
- sklearn.preprocessing.OrdinalEncoder
- Handles nominal/categorical features encoded as columns of arbitrary data types. 
 - Examples - >>> from sklearn.feature_extraction import DictVectorizer >>> v = DictVectorizer(sparse=False) >>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] >>> X = v.fit_transform(D) >>> X array([[2., 0., 1.], [0., 1., 3.]]) >>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, ... {'baz': 1.0, 'foo': 3.0}] True >>> v.transform({'foo': 4, 'unseen_feature': 3}) array([[0., 0., 4.]]) - fit(X, y=None)[source]#
- Learn a list of feature name -> indices mappings. - Parameters:
- XMapping or iterable over Mappings
- Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype). - Changed in version 0.24: Accepts multiple string values for one categorical feature. 
- y(ignored)
- Ignored parameter. 
 
- Returns:
- selfobject
- DictVectorizer class instance. 
 
 
 - fit_transform(X, y=None)[source]#
- Learn a list of feature name -> indices mappings and transform X. - Like fit(X) followed by transform(X), but does not require materializing X in memory. - Parameters:
- XMapping or iterable over Mappings
- Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype). - Changed in version 0.24: Accepts multiple string values for one categorical feature. 
- y(ignored)
- Ignored parameter. 
 
- Returns:
- Xa{array, sparse matrix}
- Feature vectors; always 2-d. 
 
 
 - get_feature_names_out(input_features=None)[source]#
- Get output feature names for transformation. - Parameters:
- input_featuresarray-like of str or None, default=None
- Not used, present here for API consistency by convention. 
 
- Returns:
- feature_names_outndarray of str objects
- Transformed feature names. 
 
 
 - get_metadata_routing()[source]#
- Get metadata routing of this object. - Please check User Guide on how the routing mechanism works. - Returns:
- routingMetadataRequest
- A - MetadataRequestencapsulating routing information.
 
 
 - get_params(deep=True)[source]#
- Get parameters for this estimator. - Parameters:
- deepbool, default=True
- If True, will return the parameters for this estimator and contained subobjects that are estimators. 
 
- Returns:
- paramsdict
- Parameter names mapped to their values. 
 
 
 - inverse_transform(X, dict_type=<class 'dict'>)[source]#
- Transform array or sparse matrix X back to feature mappings. - X must have been produced by this DictVectorizer’s transform or fit_transform method; it may only have passed through transformers that preserve the number of features and their order. - In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than the original ones. - Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features)
- Sample matrix. 
- dict_typetype, default=dict
- Constructor for feature mappings. Must conform to the collections.Mapping API. 
 
- Returns:
- X_originallist of dict_type objects of shape (n_samples,)
- Feature mappings for the samples in X. 
 
 
 - restrict(support, indices=False)[source]#
- Restrict the features to those in support using feature selection. - This function modifies the estimator in-place. - Parameters:
- supportarray-like
- Boolean mask or list of indices (as returned by the get_support member of feature selectors). 
- indicesbool, default=False
- Whether support is a list of indices. 
 
- Returns:
- selfobject
- DictVectorizer class instance. 
 
 - Examples - >>> from sklearn.feature_extraction import DictVectorizer >>> from sklearn.feature_selection import SelectKBest, chi2 >>> v = DictVectorizer() >>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] >>> X = v.fit_transform(D) >>> support = SelectKBest(chi2, k=2).fit(X, [0, 1]) >>> v.get_feature_names_out() array(['bar', 'baz', 'foo'], ...) >>> v.restrict(support.get_support()) DictVectorizer() >>> v.get_feature_names_out() array(['bar', 'foo'], ...) 
 - set_output(*, transform=None)[source]#
- Set output container. - See Introducing the set_output API for an example on how to use the API. - Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
- Configure output of - transformand- fit_transform.- "default": Default output format of a transformer
- "pandas": DataFrame output
- "polars": Polars output
- None: Transform configuration is unchanged
 - Added in version 1.4: - "polars"option was added.
 
- Returns:
- selfestimator instance
- Estimator instance. 
 
 
 - set_params(**params)[source]#
- Set the parameters of this estimator. - The method works on simple estimators as well as on nested objects (such as - Pipeline). The latter have parameters of the form- <component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
- Estimator parameters. 
 
- Returns:
- selfestimator instance
- Estimator instance. 
 
 
 - transform(X)[source]#
- Transform feature->value dicts to array or sparse matrix. - Named features not encountered during fit or fit_transform will be silently ignored. - Parameters:
- XMapping or iterable over Mappings of shape (n_samples,)
- Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype). 
 
- Returns:
- Xa{array, sparse matrix}
- Feature vectors; always 2-d. 
 
 
 
Gallery examples#
 
Column Transformer with Heterogeneous Data Sources
 
    