About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Tutorial
Extract text from documents using watsonx.ai TextExtraction API
Quickly make your documents searchable, accessible, and analyzable
Text extraction is essential for making documents searchable, accessible, and analyzable. It enables automation and seamless integration with business workflows by converting unstructured text into structured data.
Using watsonx.ai TextExtraction API, you can efficiently extract text from PDFs and other highly structured documents. You can extract and convert to a file format that is easier to work with programmatically, such as Markdown or JSON.
To see supported file types and languages see the IBM watsonx documentation.
To be able to use watsonx.ai TextExtraction API, you need to store documents or files in an IBM Cloud Object Storage bucket and then connect watsonx.ai with that bucket.
In this tutorial, you'll learn two approaches:
- User interface
- Python code
You will learn how to:
- Connect IBM Cloud Object Storage with watsonx.ai
- Upload and download files from the Cloud Object Storage bucket
- Extract text
Prerequisites
To complete this tutorial, you require an IBM Cloud account with the following service instances:
- Cloud Object Storage
- wx.ai Runtime
Approach 1: User interface
In this section, you'll learn how to complete the following tasks:
- Upload files using UI
- Connect watsonx.ai and Cloud Object Storage connection using UI
- Text extraction using Python code
- Download files using UI
Step 1: Create a task credential
A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.
See Managing task credentials in the IBM watsonx documentation.
Step 2: Create a bucket and upload files
- From the IBM Cloud Resource list, click Storage and then click to open your IBM Cloud Object Storage(COS) service instance.
- Click Create Bucket. You can choose to create any bucket according to your requirements, but for an easy start, click Create a Custom Bucket.
- Give your bucket a unique name, keep all the default configurations, and click Create Bucket.
- When the bucket is created successfully, from the Objects tab of the bucket, click Upload or drag and drop to upload your files. When the upload is successfully completed, you will see a "success" message.
Step 3: Create an IBM Cloud Object Storage connection with watsonx.ai
- In the watsonx platform, create a new project or use an existing one.
- From your project, click New Asset > Connect to a data source. Choose IBM Cloud Object Storage and click Next.
Complete all the required fields in the form.
- Integrated instance [Optional]: Select a service instance to automatically fill in the connection form.
- Name [required]: Provide your choice of connection name.
- Bucket: Use the bucket name that you created in step 2.
- Login URL [required]: To get the login URL, open the storage bucket in IBM Cloud Object Storage, click the Configurations tab, and then copy the public endpoint URL.
- Credential setting: Shared [recommended]
Authentication method [required]: Set the authentication method as Access Key and Secret Key. From the main service page for your IBM Cloud Object Storage service instance, click the Service credentials tab, and then find the key that you want to use and expand it so you can see the key details. The key that you use must include an HMAC credential. If you don't have an existing key, create one with an HMAC credential. Copy the following values to paste into the corresponding fields in the connection asset details page:
- Access key - access_key_id
- Secret key - secret_access_key
When you have provided all details, click Test Connection. If you have successfully connected, you will see the message: "The test was successful. Click Create to save the connection information."
- Click Create. (You can ignore any warning messages.)
- Important: From the Assets page of your project, open this connection asset. In the URL for the current page, copy the connection ID that is displayed after the
/connections/segment of the URL. You will use this later in the tutorial.
Note: If the test connection fails, try creating another connection with only the connection name and save the connection by clicking Create. You can then edit the connection by adding the login URL and authentication method and test again.
Step 4: Extract text
Import all the required python packages:
from ibm_watsonx_ai import Credentials from ibm_watsonx_ai.helpers import DataConnection, S3Location import ibm_boto3 from ibm_watsonx_ai import APIClient from ibm_watsonx_ai.foundation_models.extractions import TextExtractions from ibm_watsonx_ai.metanames import TextExtractionsMetaNames import timeInitialize the
wx.aiclient:credentials = Credentials(url="https://us-south.ml.cloud.ibm.com", api_key="<your_cloud_env_api_key>") project_id = "<your_wx.ai_project_id>" wx_client = APIClient(credentials=credentials, project_id=project_id)Initialize
connection_asset_idandbucketname:connection_asset_id = "<your_connection_asset_id>" bucketname = "<your_bucket_name>"Initialize the Cloud Object Storage client. Note: You can get an
aws_access_key_idandaws_secret_access_keyby completing the previous step 3, no. 3:CloudObjectStorage_client = ibm_boto3.client( service_name='s3', aws_access_key_id='<access_key_id>', aws_secret_access_key='<secret_access_key>', endpoint_url='https://s3.us-south.cloud-object-storage.appdomain.cloud' )Run the text extraction code:
response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname) if "Contents" in response: for obj in response["Contents"]: print(obj["Key"]) local_source_file_name = obj["Key"] source_file_name = local_source_file_name results_file_name = source_file_name.replace("pdf", "json") # creating object for that pdf file document_reference = DataConnection(connection_asset_id=connection_asset_id, location=S3Location(bucket=bucketname, path=source_file_name)) # creating object for resultant file results_reference = DataConnection(connection_asset_id=connection_asset_id, location=S3Location(bucket=bucketname, path=results_file_name)) extraction = TextExtractions(api_client=wx_client, project_id=project_id) steps = {TextExtractionsMetaNames.OCR: {'language_list': ['en']}, TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}} extraction.run_job(document_reference=document_reference, # Running job to extract text results_reference=results_reference, steps=steps)
Step 5: Download extracted files
When the text extraction is complete, you can download .json files with extracted text from the bucket.
Note: To download the files locally, run the code in the text extraction function, below.
Approach 2: Python code
In this section, you will complete the following tasks:
- Connect watsonx.ai and Cloud Object Storage using Python code
- Upload multiple files from your local system to your Cloud Object Storage bucket
- Extract text using Python code
- Download extracted text files from from the bucket to your local system
Step 1: Create a task credential
A task credential is an API key that is used to authenticate long-running jobs that are started by steps that you will complete in this section of the tutorial.
See Managing task credentials in the IBM watsonx documentation.
Step 2: Create a Cloud Object Storage bucket
Create a bucket using a Cloud Object Storage service instance, as detailed in Step 2: Create a bucket and upload files, above.
Step 3: Create a Cloud Object Storage connection with watsonx.ai
Import all necessary Python packages:
from ibm_watsonx_ai import Credentials from ibm_watsonx_ai.helpers import DataConnection, S3Location from ibm_watsonx_ai import APIClient from ibm_watsonx_ai.foundation_models.extractions import TextExtractions from ibm_watsonx_ai.metanames import TextExtractionsMetaNames import time import os import ibm_boto3Initialize the
wx.aiclient:credentials = Credentials(url="https://us-south.ml.cloud.ibm.com", api_key="<your_cloud_env_api_key>") project_id = "<your_wx.ai_project_id>" wx_client = APIClient(credentials=credentials, project_id=project_id)Initialize
datasource_nameandbucketname:datasource_name = 'bluemixcloudobjectstorage' bucketname = "<your_bucket_name>"Initialize Cloud Object Storage credentials:
cos_credentials = { "endpoint_url": "<endpoint url>", "apikey": "<apikey>", "access_key_id": "<access_key_id>", "secret_access_key": "<secret_access_key>" }Initialize the connection props:
conn_meta_props= { client.connections.ConfigurationMetaNames.NAME: f"Connection to Database - {datasource_name} ", client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: client.connections.get_datasource_type_id_by_name(datasource_name), client.connections.ConfigurationMetaNames.DESCRIPTION: "Connection to external Database", client.connections.ConfigurationMetaNames.PROPERTIES: { 'bucket': bucketname, 'access_key': cos_credentials['access_key_id'], 'secret_key': cos_credentials['secret_access_key'], 'iam_url': 'https://iam.cloud.ibm.com/identity/token', 'url': cos_credentials['endpoint_url'] } }Create the connection:
conn_details = client.connections.create(meta_props=conn_meta_props) connection_asset_id = client.connections.get_id(conn_details) print(connection_asset_id)Note: Print and save this
connection_asset_idto use in a later step.
Step 4: Extract text
We are choosing files from our local system to upload to the Cloud Object Storage bucket for text extraction. When the extraction job is finished, we will download the extracted text Markdown file to our local system. (Note: We are extracting text from PDF files.)
def text_extraction(file_paths, extraction, steps, results_path):
for path in file_paths:
local_source_file_path = path
source_file_name = local_source_file_path.split("/")[-1]
if source_file_name=='.DS_Store':
continue
results_file_name = "text_extracted_" + source_file_name.replace("pdf", "json")
connection_asset_id = "<your_connection_asset_id>"
remote_document_reference = DataConnection(connection_asset_id=connection_asset_id, # object for remote data connection to bucket
location=S3Location(bucket = bucketname, path = "."))
remote_document_reference.set_client(wx_client)
try:
remote_document_reference.write(data=local_source_file_path, remote_name=source_file_name) # uploading file to bucket
document_reference = DataConnection(connection_asset_id=connection_asset_id, # creating object for that pdf file
location=S3Location(bucket=bucketname,
path=source_file_name))
results_reference = DataConnection(connection_asset_id=connection_asset_id, # creating object for resultant file
location=S3Location(bucket=bucketname,
path=results_file_name))
# for each file extraction job is created
details = extraction.run_job(document_reference=document_reference,
results_reference=results_reference,
steps=steps,
results_format="markdown")
extraction_job_id = extraction.get_id(extraction_details=details)
print("\n" + source_file_name + " - " + extraction_job_id)
# below while loop is checking for the job status while extraction is going on in every 5 seconds,
# for large files it takes more time to upload file in bucket and then extract, so its better to keep the check
# if job is completed then it will go to the next step,
# if job is failed then job it won't go to the next step
while True:
status_json = extraction.get_job_details(extraction_id=extraction_job_id)
status = status_json['entity']['results']['status']
print(status)
if status=="failed":
print(status_json)
break
if status!="completed":
time.sleep(5)
else:
break
if status=="completed":
results_reference = extraction.get_results_reference(extraction_id=extraction_job_id)
filename = source_file_name.replace("pdf", "md")
results_reference.download(results_path+"/"+filename) # to download the resultant file with extracted text in local
print("saved as " + filename)
except Exception as e:
print("error : ", e, results_reference)
return "done"
# input pdfs folder path and output results path
input_folder_path = "./input" # your input files folder path
results_path = "./output" # folder in which you want to save the text extracted files
from pathlib import Path
file_paths = [file for file in Path(input_folder_path).iterdir() if file.is_file()]
# calling text_extraction function
extraction = TextExtractions(api_client=wx_client, project_id=project_id)
steps = {TextExtractionsMetaNames.OCR: {'languages_list': ['en']},
TextExtractionsMetaNames.TABLE_PROCESSING: {'enabled': True}}
text_extraction(file_paths, extraction, steps,results_path)
Step 5: Clean the Cloud Object Storage bucket
Note: The following step is optional.
When the text extraction is completed, .json files are also saved in the bucket. If you want to clean the bucket after the extraction, run the following code:
# CloudObjectStorage client initialisation
CloudObjectStorage_client = ibm_boto3.client(
service_name='s3',
aws_access_key_id=os.getenv('ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('SECRET_ACCESS_KEY'),
endpoint_url='https://s3.us-south.cloud-object-storage.appdomain.cloud'
)
# delete all objects in bucket
def delete_bucket_objects(bucketname):
try:
# List all objects in the bucket
response = CloudObjectStorage_client.list_objects_v2(Bucket=bucketname)
if 'Contents' in response:
for obj in response['Contents']:
print(f"Deleting {obj['Key']}...")
CloudObjectStorage_client.delete_object(Bucket=bucketname, Key=obj['Key'])
print("All objects deleted successfully.")
else:
print("Bucket is already empty.")
except Exception as e:
print(f"Error: {e}")
# calling the delete_bucket_objects function
delete_bucket_objects(bucketname)
Summary and next steps
In this tutorial, you've learned how to use the TextExtractions Manager to submit text extraction requests, monitor the job status, and download the resulting file. Check out our IBM Watson Machine Learning library for more samples, tutorials, documentation, how-tos, and blog posts.
Simplifying your business documents in this manner is particularly beneficial for retrieval-augmented generation (RAG) tasks, where the goal is to retrieve relevant information in response to a user query and provide it as input to a foundation model. Supplying accurate contextual data helps the model generate outputs that are more factual and up to date.
To continue building your knowledge and skills, see Retrieval-augmented generation in the IBM watsonx documentation.