Tutorial

Extract text from documents using watsonx.ai TextExtraction API

Quickly make your documents searchable, accessible, and analyzable

By Nidhi Jain

Text extraction is essential for making documents searchable, accessible, and analyzable. It enables automation and seamless integration with business workflows by converting unstructured text into structured data.

Using watsonx.ai TextExtraction API, you can efficiently extract text from PDFs and other highly structured documents. You can extract and convert to a file format that is easier to work with programmatically, such as Markdown or JSON.

To see supported file types and languages see the IBM watsonx documentation.

To be able to use watsonx.ai TextExtraction API, you need to store documents or files in an IBM Cloud Object Storage bucket and then connect watsonx.ai with that bucket.

In this tutorial, you'll learn two approaches:

User interface
Python code

You will learn how to:

Connect IBM Cloud Object Storage with watsonx.ai
Upload and download files from the Cloud Object Storage bucket
Extract text

Prerequisites

To complete this tutorial, you require an IBM Cloud account with the following service instances:

Cloud Object Storage
wx.ai Runtime

Approach 1: User interface

In this section, you'll learn how to complete the following tasks:

Upload files using UI
Connect watsonx.ai and Cloud Object Storage connection using UI
Text extraction using Python code
Download files using UI

Step 1: Create a task credential

A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.

See Managing task credentials in the IBM watsonx documentation.

Step 2: Create a bucket and upload files

From the IBM Cloud Resource list, click Storage and then click to open your IBM Cloud Object Storage(COS) service instance.
Click Create Bucket. You can choose to create any bucket according to your requirements, but for an easy start, click Create a Custom Bucket.
Give your bucket a unique name, keep all the default configurations, and click Create Bucket.
When the bucket is created successfully, from the Objects tab of the bucket, click Upload or drag and drop to upload your files. When the upload is successfully completed, you will see a "success" message.

Step 3: Create an IBM Cloud Object Storage connection with watsonx.ai

In the watsonx platform, create a new project or use an existing one.
From your project, click New Asset > Connect to a data source. Choose IBM Cloud Object Storage and click Next.
Complete all the required fields in the form.
- Integrated instance [Optional]: Select a service instance to automatically fill in the connection form.
- Name [required]: Provide your choice of connection name.
- Bucket: Use the bucket name that you created in step 2.
- Login URL [required]: To get the login URL, open the storage bucket in IBM Cloud Object Storage, click the Configurations tab, and then copy the public endpoint URL.
- Credential setting: Shared [recommended]
- Authentication method [required]: Set the authentication method as Access Key and Secret Key. From the main service page for your IBM Cloud Object Storage service instance, click the Service credentials tab, and then find the key that you want to use and expand it so you can see the key details. The key that you use must include an HMAC credential. If you don't have an existing key, create one with an HMAC credential. Copy the following values to paste into the corresponding fields in the connection asset details page:
  - Access key - access_key_id
  - Secret key - secret_access_key
When you have provided all details, click Test Connection. If you have successfully connected, you will see the message: "The test was successful. Click Create to save the connection information."
Click Create. (You can ignore any warning messages.)
Important: From the Assets page of your project, open this connection asset. In the URL for the current page, copy the connection ID that is displayed after the /connections/ segment of the URL. You will use this later in the tutorial.

Note: If the test connection fails, try creating another connection with only the connection name and save the connection by clicking Create. You can then edit the connection by adding the login URL and authentication method and test again.

Step 4: Extract text

Import all the required python packages:

from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.helpers import DataConnection, S3Location
import ibm_boto3
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai.foundation_models.extractions import TextExtractions
from ibm_watsonx_ai.metanames import TextExtractionsMetaNames
import time

Initialize the wx.ai client:

credentials = Credentials(url="https://us-south.ml.cloud.ibm.com", api_key="<your_cloud_env_api_key>")

project_id = "<your_wx.ai_project_id>"

wx_client = APIClient(credentials=credentials, project_id=project_id)

Initialize connection_asset_id and bucketname:

connection_asset_id = "<your_connection_asset_id>"
bucketname = "<your_bucket_name>"

Initialize the Cloud Object Storage client. Note: You can get an aws_access_key_id and aws_secret_access_key by completing the previous step 3, no. 3:

CloudObjectStorage_client = ibm_boto3.client(
    service_name='s3',
    aws_access_key_id='<access_key_id>',
    aws_secret_access_key='<secret_access_key>',
    endpoint_url='https://s3.us-south.cloud-object-storage.appdomain.cloud'
)

Run the text extraction code:

response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname)

if "Contents" in response:
            for obj in response["Contents"]:
                print(obj["Key"])
                local_source_file_name = obj["Key"]
                source_file_name = local_source_file_name
                results_file_name = source_file_name.replace("pdf", "json")

                # creating object for that pdf file
                document_reference = DataConnection(connection_asset_id=connection_asset_id,
                                    location=S3Location(bucket=bucketname,
                                                        path=source_file_name))

                # creating object for resultant file
                results_reference = DataConnection(connection_asset_id=connection_asset_id,
                                location=S3Location(bucket=bucketname,
                                                    path=results_file_name))

                extraction = TextExtractions(api_client=wx_client,
                            project_id=project_id)

                steps = {TextExtractionsMetaNames.OCR: {'language_list': ['en']},
                        TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}

                extraction.run_job(document_reference=document_reference,  # Running job to extract text
                            results_reference=results_reference,
                            steps=steps)

Step 5: Download extracted files

When the text extraction is complete, you can download .json files with extracted text from the bucket.

Note: To download the files locally, run the code in the text extraction function, below.

Approach 2: Python code

In this section, you will complete the following tasks:

Connect watsonx.ai and Cloud Object Storage using Python code
Upload multiple files from your local system to your Cloud Object Storage bucket
Extract text using Python code
Download extracted text files from from the bucket to your local system

Step 1: Create a task credential

A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.

See Managing task credentials in the IBM watsonx documentation.

Step 2: Create a Cloud Object Storage bucket

Create a bucket using a Cloud Object Storage service instance, as detailed in Step 2: Create a bucket and upload files, above.

Step 3: Create a Cloud Object Storage connection with watsonx.ai

Import all necessary Python packages:

from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.helpers import DataConnection, S3Location
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai.foundation_models.extractions import TextExtractions
from ibm_watsonx_ai.metanames import TextExtractionsMetaNames
import time
import os
import ibm_boto3

Initialize the wx.ai client:

credentials = Credentials(url="https://us-south.ml.cloud.ibm.com", api_key="<your_cloud_env_api_key>")

project_id = "<your_wx.ai_project_id>"

wx_client = APIClient(credentials=credentials, project_id=project_id)

Initialize datasource_name and bucketname:

datasource_name = 'bluemixcloudobjectstorage'
bucketname = "<your_bucket_name>"

Initialize Cloud Object Storage credentials:

cos_credentials = {
                "endpoint_url": "<endpoint url>",
                "apikey": "<apikey>",
                "access_key_id": "<access_key_id>",
                "secret_access_key": "<secret_access_key>"
            }

Initialize the connection props:

conn_meta_props= {
    client.connections.ConfigurationMetaNames.NAME: f"Connection to Database - {datasource_name} ",
    client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name(datasource_name),
    client.connections.ConfigurationMetaNames.DESCRIPTION: "Connection to external Database",
    client.connections.ConfigurationMetaNames.PROPERTIES: {
        'bucket': bucketname,
        'access_key': cos_credentials['access_key_id'],
        'secret_key': cos_credentials['secret_access_key'],
        'iam_url': 'https://iam.cloud.ibm.com/identity/token',
        'url': cos_credentials['endpoint_url']
    }
}

Create the connection:

conn_details = client.connections.create(meta_props=conn_meta_props)
connection_asset_id = client.connections.get_id(conn_details)
print(connection_asset_id)

Note: Print and save this connection_asset_id to use in a later step.

Step 4: Extract text

We are choosing files from our local system to upload to the Cloud Object Storage bucket for text extraction. When the extraction job is finished, we will download the extracted text Markdown file to our local system. (Note: We are extracting text from PDF files.)

def text_extraction(file_paths, extraction, steps, results_path):
    for path in file_paths:
        local_source_file_path = path
        source_file_name = local_source_file_path.split("/")[-1]
        if source_file_name=='.DS_Store':
            continue

        results_file_name = "text_extracted_" + source_file_name.replace("pdf", "json")

        connection_asset_id = "<your_connection_asset_id>"
        remote_document_reference = DataConnection(connection_asset_id=connection_asset_id, # object for remote data connection to bucket
                                            location=S3Location(bucket = bucketname, path = "."))
        remote_document_reference.set_client(wx_client)

        try:
            remote_document_reference.write(data=local_source_file_path, remote_name=source_file_name) # uploading file to bucket
            document_reference = DataConnection(connection_asset_id=connection_asset_id, # creating object for that pdf file
                                        location=S3Location(bucket=bucketname,
                                                            path=source_file_name))

            results_reference = DataConnection(connection_asset_id=connection_asset_id,  # creating object for resultant file
                                            location=S3Location(bucket=bucketname,
                                                                path=results_file_name))

            # for each file extraction job is created
            details = extraction.run_job(document_reference=document_reference,
                                    results_reference=results_reference,
                                    steps=steps,
                                    results_format="markdown")

            extraction_job_id = extraction.get_id(extraction_details=details)
            print("\n" + source_file_name + " - " + extraction_job_id)

            # below while loop is checking for the job status while extraction is going on in every 5 seconds,
            # for large files it takes more time to upload file in bucket and then extract, so its better to keep the check
            # if job is completed then it will go to the next step,
            # if job is failed then job it won't go to the next step

            while True:
                status_json = extraction.get_job_details(extraction_id=extraction_job_id)
                status = status_json['entity']['results']['status']
                print(status)
                if status=="failed":
                    print(status_json)
                    break
                if status!="completed":
                    time.sleep(5)
                else:
                    break


            if status=="completed":
                results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)
                filename = source_file_name.replace("pdf", "md")
                results_reference.download(results_path+"/"+filename) # to download the resultant file with extracted text in local
                print("saved as " + filename)

        except Exception as e:
            print("error : ", e, results_reference)

    return "done"

# input pdfs folder path and output results path

input_folder_path = "./input" # your input files folder path
results_path = "./output" # folder in which you want to save the text extracted files

from pathlib import Path

file_paths = [file for file in Path(input_folder_path).iterdir() if file.is_file()]

# calling text_extraction function

extraction = TextExtractions(api_client=wx_client, project_id=project_id)

steps = {TextExtractionsMetaNames.OCR: {'languages_list': ['en']},
        TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}

text_extraction(file_paths, extraction, steps,results_path)

Step 5: Clean the Cloud Object Storage bucket

Note: The following step is optional.

When the text extraction is completed, .json files are also saved in the bucket. If you want to clean the bucket after the extraction, run the following code:

# CloudObjectStorage client initialisation

CloudObjectStorage_client = ibm_boto3.client(
    service_name='s3',
    aws_access_key_id=os.getenv('ACCESS_KEY_ID'),
    aws_secret_access_key=os.getenv('SECRET_ACCESS_KEY'),
    endpoint_url='https://s3.us-south.cloud-object-storage.appdomain.cloud'
)

# delete all objects in bucket

def delete_bucket_objects(bucketname):
    try:
        # List all objects in the bucket
        response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname)
        if 'Contents' in response:
            for obj in response['Contents']:
                print(f"Deleting {obj['Key']}...")
                CloudObjectStorage_client.delete_object(Bucket=bucketname, Key=obj['Key'])
            print("All objects deleted successfully.")
        else:
            print("Bucket is already empty.")
    except Exception as e:
        print(f"Error: {e}")

# calling the delete_bucket_objects function
delete_bucket_objects(bucketname)

Summary and next steps

In this tutorial, you've learned how to use the TextExtractions Manager to submit text extraction requests, monitor the job status, and download the resulting file. Check out our IBM Watson Machine Learning library for more samples, tutorials, documentation, how-tos, and blog posts.

Simplifying your business documents in this manner is particularly beneficial for retrieval-augmented generation (RAG) tasks, where the goal is to retrieve relevant information in response to a user query and provide it as input to a foundation model. Supplying accurate contextual data helps the model generate outputs that are more factual and up to date.

To continue building your knowledge and skills, see Retrieval-augmented generation in the IBM watsonx documentation.

Topics

Languages

Products

Open Source

Extract text from documents using watsonx.ai TextExtraction API

Prerequisites

Approach 1: User interface

Step 1: Create a task credential

Step 2: Create a bucket and upload files

Step 3: Create an IBM Cloud Object Storage connection with watsonx.ai

Step 4: Extract text

Step 5: Download extracted files

Approach 2: Python code

Step 1: Create a task credential

Step 2: Create a Cloud Object Storage bucket

Step 3: Create a Cloud Object Storage connection with watsonx.ai

Step 4: Extract text

Step 5: Clean the Cloud Object Storage bucket

Summary and next steps