fetch_20newsgroups_vectorized#

sklearn.datasets.fetch_20newsgroups_vectorized(*, subset='train', remove=(), data_home=None, download_if_missing=True, return_X_y=False, normalize=True, as_frame=False, n_retries=3, delay=1.0)[source]#

Load and vectorize the 20 newsgroups dataset (classification).

Download it if necessary.

This is a convenience function; the transformation is done using the default settings for CountVectorizer. For more advanced usage (stopword filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a custom CountVectorizer, HashingVectorizer, TfidfTransformer or TfidfVectorizer.

The resulting counts are normalized using sklearn.preprocessing.normalize unless normalize is set to False.

Classes	20
Samples total	18846
Dimensionality	130107
Features	real

Read more in the User Guide.

Parameters:

subset{‘train’, ‘test’, ‘all’}, default=’train’

Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering.

removetuple, default=()

May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata.

‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post.

data_homestr or path-like, default=None

Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

download_if_missingbool, default=True

If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site.

return_X_ybool, default=False

If True, returns (data.data, data.target) instead of a Bunch object.

Added in version 0.20.

normalizebool, default=True

If True, normalizes each document’s feature vector to unit norm using sklearn.preprocessing.normalize.

Added in version 0.22.

as_framebool, default=False

If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string, or categorical). The target is a pandas DataFrame or Series depending on the number of target_columns.

Added in version 0.24.

n_retriesint, default=3

Number of retries when HTTP errors are encountered.

Added in version 1.5.

delayfloat, default=1.0

Number of seconds between retries.

Added in version 1.5.

Returns:

bunchBunch

Dictionary-like object, with the following attributes.

data: {sparse matrix, dataframe} of shape (n_samples, n_features): The input data matrix. If as_frame is True, data is a pandas DataFrame with sparse columns.
target: {ndarray, series} of shape (n_samples,): The target labels. If as_frame is True, target is a pandas Series.
target_names: list of shape (n_classes,): The names of target classes.
DESCR: str: The full description of the dataset.
frame: dataframe of shape (n_samples, n_features + 1): Only present when as_frame=True. Pandas DataFrame with data and target.

Added in version 0.24.

(data, target)tuple if return_X_y is True

data and target would be of the format defined in the Bunch description above.

Added in version 0.20.

Examples

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> newsgroups_vectorized = fetch_20newsgroups_vectorized(subset='test')
>>> newsgroups_vectorized.data.shape
(7532, 130107)
>>> newsgroups_vectorized.target.shape
(7532,)

Gallery examples#

Model Complexity Influence

Multiclass sparse logistic regression on 20newgroups

The Johnson-Lindenstrauss bound for embedding with random projections