Tutorial

Tokenizing text in Python

Use tokenizers from the Python NLTK to complete a standard text normalization technique

By Jacob Murel, Ph.D.

Tokenization is a preprocessing technique in natural language processing (NLP). It breaks down unstructured text data into smaller units called tokens. A single token can range from a single character or individual word to much larger textual units.

Essentially, electronic text is nothing more than a sequence of characters. NLP tools, however, generally process text in terms of linguistic units, such as words, clauses, sentences, and paragraphs. Thus, NLP algorithms must first segment text data into separate tokens that can be processed by NLP tools.

Tokenization is one stage in text mining pipelines that converts raw text data into a structured format for machine processing. It's necessary for other preprocessing techniques, and, therefore, is often (one of) the first preprocessing steps in NLP pipelines. For example, stemming and lemmatization reduce morphological variants to one base word form (for example, running, runs, and runner become run). These text normalization techniques only work on tokenized text, but they need some method for identifying individual words.

Readily implemented and conceptually simple, tokenization is a crucial step in preparing text data sets for neural network and transformer architectures. Tokenization is used in building deep learning models like large language models (LLM), as well as conducting various NLP tasks, such as sentiment analysis and word embeddings. For example, GPT uses a tokenization method called byte-pair encoding (BPE).

The different types of tokenization essentially denote varying levels of granularity in the tokenization process. Word tokenization is the most common type used in introductions to tokenization, and it divides raw text into word-level units. Subword tokenization delimits text beneath the word level; wordpiece tokenization breaks text into partial word units (for example, starlight becomes star and light), and character tokenization divides raw text into individual characters (for example, letters, digits, and punctuation marks). Other tokenization methods, such as sentence tokenization, divide text above the word level.

In this tutorial, you use the Python natural language toolkit (NLTK) to walk through tokenizing .txt files at different levels of granularity using an open-access Asian religious texts file that is sourced largely from Project Gutenberg. You focus on tokenization as a means to prepare raw text data for use in machine learning models and NLP tasks. Other libraries and packages, such as Keras and Genism, also come with tokenization algorithms. Transformer architectures such as BERT can also implement tokenization. However, this tutorial focuses on tokenization with Python NLTK.

Prerequisites

You need an IBM Cloud account to create a watsonx.ai project.

If you want to execute these instructions without the need to download, install, and configure tools, you can use the hands-on guided project, "LLM Foundations: Get started with tokenization."

Steps

Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code, text, images, and data visualizations to formulate a well-formed analysis.

Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
Create a Jupyter Notebook.

A notebook environment opens for you to load your data set and copy code from this tutorial to tackle a simple single-file text tokenization task. To view how each block of code affects the text file, each step’s code block is best inserted as a separate cell of code in your watsonx.ai project notebook.

Step 2. Install and import relevant libraries

You need a few libraries for this tutorial. Make sure to import the necessary Python libraries. If they're not installed, you can resolve this with a quick pip install (included at the top of the following code).

# download necessary libraries and packages for tokenization
!pip install nltk -U
!pip install spacy -U

import nltk
import re
import spacy

from nltk.tokenize import word_tokenize

# download and install the spacy language model
!python3 -m spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')

Step 3. Read and load the data

In this tutorial, you use the Asian religious text data set from the UCI Machine Learning Repository. You will focus on tokenizing one text file in this tutorial to highlight additional steps in creating high quality tokenized text data. You also explore different types of tokenization. Note that many of these steps can be combined together under one Python function to iterate through a text corpus.

Download the Asian religious text data from the UCI Machine Learning Repository.
Unzip the file and open the resulting folder.
Select the Complete_data.txt file to upload from your local system to your notebook in watsonx.ai.
Read the data into the project by selecting the </> icon in the upper right menu, and then selecting Read data.
Select Upload a data file.
Drag your data set over the prompt, Drop data files here or browse for files to upload.
Return to the </> menu, and select Read data again. Click Select data from project.
In the pop-up menu, click Data asset on the left-side menu. Select your data file (for example, Complete_data.txt), and then click the lower right-side blue button Select.
Select Insert code to cell, or the copy to clipboard icon to manually inject the data into your notebook.

Because watsonx imports the file as a streamingbody, you must convert it into a text string. You are also going to run the text through additional commands from Python’s regular expressions (RegEx) library so that you remove line breaks and extra whitespace. Admittedly, these additional commands from Python's RegEx library are not necessary, but they create a cleaner output for this current stage.

# convert loaded streamingbody file into text string for preprocessing
# note "streaming_body_1" in the following line may need to be changed to whatever your file imports as
raw_bytes = streaming_body_1.read()
working_txt = raw_bytes.decode("utf-8", errors="ignore")

# clean text by removing successive white space and line breaks
clean_txt = re.sub(r"\n", " ", working_txt)
clean_txt = re.sub(r"\s+", " ", clean_txt)
clean_txt = clean_txt.strip()

print(clean_txt)

Step 4. Tokenize the text

The NLTK library comes with functions to tokenize text at various degrees of granularity. For this first task, you tokenize at the word level. You can pass your cleaned text string through the word_tokenize() function.

tokens = word_tokenize(clean_txt)
print(tokens)

Step 5. Remove noisy data

The first four characters of the tokenization output reveal much about NLTK’s tokenizer:

“0.1” “1.The” “Buddha” “:”

In tokenization, a delimiter is the character or sequence by which the tokenizer divides tokens. The NLTK word_tokenize() function’s delimiter is primarily whitespace. The function can also individuate words from adjacent punctuation, as evidenced by the separate output tokens for "Buddha" and its adjacent colon. Despite this caveat, the tokenizer is clearly not infallible, as it does not recognize "1." and "The" as separate semantic units. This might be due to the tokenizer’s internal rules that account for decimals following numerical characters with no subsequent whitespace.

Overall, your tokenized output contains a lot of noise. There are tokens comprised of nothing except ellipses or colons and some that combine numerical and alphabetic digits. This clearly creates problems if you want to use the tokenized data for training a classifier or for word embedding. You can remove non-alphabetic tokens using the following command.

# remove non-alphabetic tokens
filtered_tokens_alpha = [word for word in tokens if word.isalpha()]
print(filtered_tokens_alpha)

Unfortunately, because this command removes all tokens that contain non-alphabetic characters, you lose tokens that contain actual words, such as the “1.The” token. Of course, a token comprised only of The might be removed anyway if later steps in your text preprocessing pipeline incorporate a stopword list during tasks like stemming.

For now, let’s assume that you want every word from the initial text in your tokenized output. To account for cases such as “1.The,” you must remove non-alphabetic characters before tokenization. You can modify the same RegEx commands that you previously used to remove whitespace and linebreaks from the raw text. Because some words are separated only by punctuation marks without white space, you will replace all non-alphabetic characters with a single space, then remove successive, leading, and trailing spaces.

# replace non-alphabetic characters with single whitespace
reg_txt = re.sub(r'[^a-zA-Z\s]', ' ', clean_txt)
# remove any whitespace that appears in sequence
reg_txt = re.sub(r"\s+", " ", reg_txt)
# remove any new leading and trailing whitespace
reg_txt = reg_txt.strip()

print(reg_txt)

Now, you can tokenize the regularized text.

# tokenize regularized text
reg_tokens = word_tokenize(reg_txt)
print(reg_tokens)

You now have a tokenized output with far less non-alphabetic noise. While it is not necessary to remove non-alphabetic characters from text for tokenization, doing so helps conduct additional preprocessing (for example, stemming and lemmatization) and produces more meaningfully accurate results for NLP tasks, such as sentiment analysis.

Other tokenization methods

Word-level tokenization is one of the most common types of tokenization in preparing for NLP tasks. However, it is not the only granular level for tokenizing text.

Character tokenization

One issue that can arise when using a word-level tokenizer is unknown word tokens. Out-of-vocabulary (OOV) terms (that is, words not recognized by a tokenizer with a pretrained vocabulary) might be returned as unknown tokens ([UNK]). OOV terms can arise if one uses a tokenizer with a pretrained vocabulary. Character tokenization is one method of solving for this. Because character tokenization tokenizes at the character level, the chances of meeting OOV terms is miniscule. But character tokens in and of themselves might not provide meaningful or helpful data for NLP tasks that focus on word-level units, such as word embedding models.

The NLTK tokenizer requires a specified pattern to differentiate characters. The pattern defined in the following code separates alphabetic characters, digits, punctuation marks, and spaces as individual characters.

# import NLTK regular expression tokenizer
from nltk.tokenize import regexp_tokenize

# tokenize text at character level
pattern = r"\S|\s"
character_tokens = regexp_tokenize(clean_txt, pattern)

# print first 100 character tokens
print(character_tokens[:100])

You can also implement RegEx commands (similar to those used with the word tokenizer) before character tokenization to remove digits, punctuation, and whitespace, if required. Doing this cleanup helps remove whitespace tokens if you only care about the actual alphabetic characters used.

Sentence tokenization

Sentence tokenization has several use cases, such as sentiment analysis tasks or machine translation. For example, with regard to machine translation, a word’s significance or meaning in another language cannot always be determined in isolation from its context. In this case, you might prefer a sentence tokenization algorithm as opposed to word-level tokenization.

# import sentence NLTK sentence tokenizer
from nltk.tokenize import sent_tokenize

# tokenize text at sentence level
sentence_tokens = sent_tokenize(clean_txt)

# print first 10 sentence tokens
print(sentence_tokens[:10])

Obviously, the NLTK sentence tokenizer is not infallible. The second token in the code’s output is:

"Rahula:The Buddha: "In the same way, Rahula, bodily acts, verbal acts, & mental acts are to be done with repeated reflection.The Buddha:"Whenever you want to perform a bodily act, you should reflect on it: 'This bodily act I want to perform would it lead to self-affliction, to the affliction of others, or to both?

This token contains several syntactic units, which should be divided into several tokens. Nevertheless, the tokenizer clumps them together, most likely because of the original text file's inconsistent formatting, such as (missing) white space before and after punctuation marks. Of course, there are means of correcting for these inconsistencies. But, such methods require a more involved regularization process beyond the scope of this tutorial.

Summary and next steps

In this tutorial, you used tokenizers from the Python NLTK to complete a standard text normalization technique. Although this tutorial describes how tokenization works with regards to a single text, the same commands and techniques can be deployed on a corpus of text files by combining the formatting and tokenization commands under one Python function and iterating files through that function.

The NLTK library primarily supports work with English language texts in addition to some other Latin script languages. Unfortunately, languages such as Chinese, Japanese, and Arabic require other preprocessing tools.

Try watsonx for free

Build an AI strategy for your business on one collaborative AI and data platform called IBM watsonx, which brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.

Try watsonx.ai, the next-generation studio for AI builders.

Next steps

Explore more articles and tutorials about watsonx on IBM Developer.

To continue learning about tokenization, see the following IBM Developer articles and tutorials:

Topics

Languages

Products

Open Source