chi2#
- sklearn.feature_selection.chi2(X, y)[source]#
Compute chi-squared stats between each non-negative feature and class.
This score can be used to select the
n_features
features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.
Read more in the User guide.
- Parameters:
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Sample vectors.
- yarray-like of shape (n_samples,)
Target vector (class labels).
- Returns:
- chi2ndarray of shape (n_features,)
Chi2 statistics for each feature.
- p_valuesndarray of shape (n_features,)
P-values for each feature.
See also
f_classif
ANOVA F-value between label/feature for classification tasks.
f_regression
F-value between label/feature for regression tasks.
Notes
Complexity of this algorithm is O(n_classes * n_features).
Examples
>>> import numpy as np >>> from sklearn.feature_selection import chi2 >>> X = np.array([[1, 1, 3], ... [0, 1, 5], ... [5, 4, 1], ... [6, 6, 2], ... [1, 4, 0], ... [0, 0, 0]]) >>> y = np.array([1, 1, 0, 0, 2, 2]) >>> chi2_stats, p_values = chi2(X, y) >>> chi2_stats array([15.3..., 6.5 , 8.9...]) >>> p_values array([0.0004..., 0.0387..., 0.0116... ])
gallery examples#
Column Transformer with Mixed Types