This is a cache of https://developer.ibm.com/tutorials/extracting-text-watsonx-ai-textextraction-api/. It is a snapshot of the page as it appeared on 2025-11-14T13:48:39.265+0000.
Extract text from documents using watsonx.ai textExtraction API - IBM Developer
text extraction is essential for making documents searchable, accessible, and analyzable. It enables automation and seamless integration with business workflows by converting unstructured text into structured data.
Using watsonx.ai textExtraction API, you can efficiently extract text from PDFs and other highly structured documents. You can extract and convert to a file format that is easier to work with programmatically, such as Markdown or JSON.
To be able to use watsonx.ai textExtraction API, you need to store documents or files in an IBM Cloud Object Storage bucket and then connect watsonx.ai with that bucket.
In this tutorial, you'll learn two approaches:
User interface
Python code
You will learn how to:
Connect IBM Cloud Object Storage with watsonx.ai
Upload and download files from the Cloud Object Storage bucket
Extract text
Prerequisites
To complete this tutorial, you require an IBM Cloud account with the following service instances:
Cloud Object Storage
wx.ai Runtime
Approach 1: User interface
In this section, you'll learn how to complete the following tasks:
Upload files using UI
Connect watsonx.ai and Cloud Object Storage connection using UI
text extraction using Python code
Download files using UI
Step 1: Create a task credential
A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.
From the IBM Cloud Resource list, click Storage and then click to open your IBM Cloud Object Storage(COS) service instance.
Click Create Bucket. You can choose to create any bucket according to your requirements, but for an easy start, click Create a Custom Bucket.
Give your bucket a unique name, keep all the default configurations, and click Create Bucket.
When the bucket is created successfully, from the Objects tab of the bucket, click Upload or drag and drop to upload your files. When the upload is successfully completed, you will see a "success" message.
Step 3: Create an IBM Cloud Object Storage connection with watsonx.ai
In the watsonx platform, create a new project or use an existing one.
From your project, click New Asset > Connect to a data source. Choose IBM Cloud Object Storage and click Next.
Complete all the required fields in the form.
Integrated instance [Optional]: Select a service instance to automatically fill in the connection form.
Name [required]: Provide your choice of connection name.
Bucket: Use the bucket name that you created in step 2.
Login URL [required]: To get the login URL, open the storage bucket in IBM Cloud Object Storage, click the Configurations tab, and then copy the public endpoint URL.
Credential setting: Shared [recommended]
Authentication method [required]: Set the authentication method as Access Key and Secret Key. From the main service page for your IBM Cloud Object Storage service instance, click the Service credentials tab, and then find the key that you want to use and expand it so you can see the key details. The key that you use must include an HMAC credential. If you don't have an existing key, create one with an HMAC credential. Copy the following values to paste into the corresponding fields in the connection asset details page:
Access key - access_key_id
Secret key - secret_access_key
When you have provided all details, click Test Connection. If you have successfully connected, you will see the message: "The test was successful. Click Create to save the connection information."
Click Create. (You can ignore any warning messages.)
Important: From the Assets page of your project, open this connection asset. In the URL for the current page, copy the connection ID that is displayed after the /connections/ segment of the URL. You will use this later in the tutorial.
Note: If the test connection fails, try creating another connection with only the connection name and save the connection by clicking Create. You can then edit the connection by adding the login URL and authentication method and test again.
Step 4: Extract text
Import all the required python packages:
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.helpers import DataConnection, S3Location
import ibm_boto3
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai.foundation_models.extractions importtextExtractions
from ibm_watsonx_ai.metanames importtextExtractionsMetaNames
import time
response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname)
if"Contents"in response:
for obj in response["Contents"]:
print(obj["Key"])
local_source_file_name = obj["Key"]
source_file_name = local_source_file_name
results_file_name = source_file_name.replace("pdf", "json")
# creating object for that pdf file
document_reference = DataConnection(connection_asset_id=connection_asset_id,
location=S3Location(bucket=bucketname,
path=source_file_name))
# creating object for resultant file
results_reference = DataConnection(connection_asset_id=connection_asset_id,
location=S3Location(bucket=bucketname,
path=results_file_name))
extraction = textExtractions(api_client=wx_client,
project_id=project_id)
steps = {textExtractionsMetaNames.OCR: {'language_list': ['en']},
textExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}
extraction.run_job(document_reference=document_reference, # Running job to extract text
results_reference=results_reference,
steps=steps)
Copy codeCopied!
Step 5: Download extracted files
When the text extraction is complete, you can download .json files with extracted text from the bucket.
Note: To download the files locally, run the code in the text extraction function, below.
Approach 2: Python code
In this section, you will complete the following tasks:
Connect watsonx.ai and Cloud Object Storage using Python code
Upload multiple files from your local system to your Cloud Object Storage bucket
Extract text using Python code
Download extracted text files from from the bucket to your local system
Step 1: Create a task credential
A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.
Create a bucket using a Cloud Object Storage service instance, as detailed in Step 2: Create a bucket and upload files, above.
Step 3: Create a Cloud Object Storage connection with watsonx.ai
Import all necessary Python packages:
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.helpers import DataConnection, S3Location
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai.foundation_models.extractions importtextExtractions
from ibm_watsonx_ai.metanames importtextExtractionsMetaNames
import time
import os
import ibm_boto3
Note: Print and save this connection_asset_id to use in a later step.
Step 4: Extract text
We are choosing files from our local system to upload to the Cloud Object Storage bucket for text extraction. When the extraction job is finished, we will download the extracted text Markdown file to our local system. (Note: We are extracting text from PDF files.)
deftext_extraction(file_paths, extraction, steps, results_path):
for path in file_paths:
local_source_file_path = path
source_file_name = local_source_file_path.split("/")[-1]
if source_file_name=='.DS_Store':
continue
results_file_name = "text_extracted_" + source_file_name.replace("pdf", "json")
connection_asset_id = "<your_connection_asset_id>"
remote_document_reference = DataConnection(connection_asset_id=connection_asset_id, # object for remote data connection to bucket
location=S3Location(bucket = bucketname, path = "."))
remote_document_reference.set_client(wx_client)
try:
remote_document_reference.write(data=local_source_file_path, remote_name=source_file_name) # uploading file to bucket
document_reference = DataConnection(connection_asset_id=connection_asset_id, # creating object for that pdf file
location=S3Location(bucket=bucketname,
path=source_file_name))
results_reference = DataConnection(connection_asset_id=connection_asset_id, # creating object for resultant file
location=S3Location(bucket=bucketname,
path=results_file_name))
# for each file extraction job is created
details = extraction.run_job(document_reference=document_reference,
results_reference=results_reference,
steps=steps,
results_format="markdown")
extraction_job_id = extraction.get_id(extraction_details=details)
print("\n" + source_file_name + " - " + extraction_job_id)
# below while loop is checking for the job status while extraction is going on in every 5 seconds,# for large files it takes more time to upload file in bucket and then extract, so its better to keep the check# if job is completed then it will go to the next step,# if job is failed then job it won't go to the next stepwhileTrue:
status_json = extraction.get_job_details(extraction_id=extraction_job_id)
status = status_json['entity']['results']['status']
print(status)
if status=="failed":
print(status_json)
breakif status!="completed":
time.sleep(5)
else:
breakif status=="completed":
results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)
filename = source_file_name.replace("pdf", "md")
results_reference.download(results_path+"/"+filename) # to download the resultant file with extracted text in localprint("saved as " + filename)
except Exception as e:
print("error : ", e, results_reference)
return"done"
Copy codeCopied!Show more
# input pdfs folder path and output results path
input_folder_path = "./input"# your input files folder path
results_path = "./output"# folder in which you want to save the text extracted filesfrom pathlib import Path
file_paths = [file for file in Path(input_folder_path).iterdir() if file.is_file()]
When the text extraction is completed, .json files are also saved in the bucket. If you want to clean the bucket after the extraction, run the following code:
# delete all objects in bucketdefdelete_bucket_objects(bucketname):
try:
# List all objects in the bucket
response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname)
if'Contents'in response:
for obj in response['Contents']:
print(f"Deleting {obj['Key']}...")
CloudObjectStorage_client.delete_object(Bucket=bucketname, Key=obj['Key'])
print("All objects deleted successfully.")
else:
print("Bucket is already empty.")
except Exception as e:
print(f"Error: {e}")
Copy codeCopied!
# calling the delete_bucket_objects function
delete_bucket_objects(bucketname)
Copy codeCopied!
Summary and next steps
In this tutorial, you've learned how to use the textExtractions Manager to submit text extraction requests, monitor the job status, and download the resulting file. Check out our IBM Watson Machine Learning library for more samples, tutorials, documentation, how-tos, and blog posts.
Simplifying your business documents in this manner is particularly beneficial for retrieval-augmented generation (RAG) tasks, where the goal is to retrieve relevant information in response to a user query and provide it as input to a foundation model. Supplying accurate contextual data helps the model generate outputs that are more factual and up to date.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.