Tutorial

test data ingestion to watsonx.data using local Spark

Quickly set up a local Spark engine to test and refine your data ingestion

By Pooja Holkar

Apache Spark excels in handling large-scale data processing and seamlessly integrates with various frameworks. IBM watsonx.data offers an open, hybrid, and governed data store with its data lakehouse architecture, enabling storage and retrieval of enterprise data efficiently. By combining Apache Spark's processing capabilities with watsonx.data's architecture, organizations can scale and process massive data sets, empowering them to derive actionable insights with enhanced effectiveness and efficiency.

Primarily, there are three ways that you can configure a Spark engine for parallel processing. The first two ways are managed services plans provided by IBM where you pay for the service while using it, and you receive full support for any issues the services might have.

Use the included Spark runtime environment notebooks from IBM Watson Studio.
Use a serverless IBM Analytics Engine, which offers a consumption-based usage. You can use this link to provision the Analytics Engine. The Analytics Engine handles setting up and taking down Spark clusters so that they're ready when you need them. This helps save time and money by only using resources when necessary
Use your own local Spark setup that you manage. You can use the local setup to test whether it meets your needs. This example is what's covered in this tutorial.

The tutorial explains how to quickly set up a Spark engine locally so that you can try it out on your own system and dive into the lakehouse world. You can download a sample Python file to use with the tutorial.

Note: In this tutorial, the local Spark installation is done on macOS. In your installation, the certs path should be based on your local machine's OpenJDK path.

Prerequisites

To follow this tutorial, you need:

An IBM watsonx.data instance
A catalog with an IBM Cloud Object Storage bucket
IBM Cloud Object Storage bucket credentials
A Hive Metastore (HMS) URL from watsonx.data
HMS Credentials from watsonx.data

Estimated time

It should take you approximately 30 minutes to complete this tutorial.

Steps

Step 1. Install Spark on your local machine

Download Spark on your local machine, and select the version that you want to download.
You see the JAR folders in the Spark installed path after you've downloaded Spark.

Because Spark applications are used for parallel processing, you can use the data from Spark to perform all of the compute-intensive tasks in watsonx.data. To connect your Spark engine with watsonx.data, it's important to download all of the required JAR files and place it in the JARS folder shown previously to make the connection to the watsonx.data Hive metastore. The Hive exec, common, and metastore JAR files are modified JAR files to work with watsonx.data.

Run the following commands to download the Hive Metastore, aws-sdk-bundle, and Hadoop-aws JAR files to your system.

wget https://github.com/IBM-Cloud/IBM-Analytics-Engine/raw/master/wxd-connectors/hms-connector/hive-exec-2.3.9-core.jar
 wget https://github.com/IBM-Cloud/IBM-Analytics-Engine/raw/master/wxd-connectors/hms-connector/hive-common-2.3.9.jar
 wget https://github.com/IBM-Cloud/IBM-Analytics-Engine/raw/master/wxd-connectors/hms-connector/hive-metastore-2.3.9.jar
 wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.648/aws-java-sdk-bundle-1.12.648.jar
 wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar

After you have all of the JAR files in the JARS folder, you can look at the parameters that are needed in the Python file to submit your data in the lakehouse.

Step 2. Configure IBM Cloud Object Storage bucket

Watsonx.data stores its data and metadata in an object storage bucket. This tutorial uses IBM Cloud Object Storage, so if your data resides in an Amazon S3 bucket, you must give the credentials for that bucket.

def init_spark():
spark = SparkSession.builder     .appName("lh-hms-cloud")     .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.endpoint" ,"s3.direct.us-south.cloud-object-storage.appdomain.cloud")     .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.access.key" ,"<lakehouse-bucket-access-key>")     .config("spark.hadoop.fs.s3a.bucket.lakehouse-bucket.secret.key" ,"<lakehouse-bucket-secret-key>")     .config("spark.hadoop.fs.s3a.bucket.source-bucket.endpoint" ,"s3.direct.us-south.cloud-object-storage.appdomain.cloud")     .config("spark.hadoop.fs.s3a.bucket.source-bucket.access.key" ,"<source-bucket-access-key>")     .config("spark.hadoop.fs.s3a.bucket.source-bucket.secret.key" ,"<source-bucket-secret-key>")     .enableHiveSupport()     .getOrCreate()
  return spark

In your Python code:

Initialize Spark using bucket endpoints and access key and secret key. See Getting endpoints for your IBM Cloud Object Storage bucket.
Create the HMAC credentials for your source and target IBM Cloud Object Storage buckets by following these steps.

Check whether the connection is established by creating a database after you have initialized Spark.

def create_database(spark):
     # Create a database in the lakehouse catalog
     spark.sql("create database if not exists lakehouse.localsparkdb LOCATION 's3a:// localsparkbucket/'")

Step 3. Use Spark SUBMIT to run final application file

Run the spark-submit script. The following code shows an example.

spark-submit \
 --conf spark.sql.catalogImplementation=hive \
 --conf spark.driver.extraClassPath=/Users/poojaholkar/GSI/SPARKINSTALL/hive-common-2.3.9.jar:/Users/poojaholkar /SPARKINSTALL/hive-metastore-2.3.9.jar:/Users/poojaholkar /SPARKINSTALL/hive-exec-2.3.9-core.jar \
 --conf spark.executor.extraClassPath=/Users/poojaholkar/GSI/SPARKINSTALL/hive-common-2.3.9.jar:/Users/poojaholkar /SPARKINSTALL/hive-metastore-2.3.9.jar:/Users/poojaholkar /SPARKINSTALLhive-exec-2.3.9-core.jar \
 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
 --conf spark.sql.iceberg.vectorization.enabled=false \
 --conf spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog \
 --conf spark.sql.catalog.lakehouse.type=hive \
 --conf spark.hive.metastore.uris=<thriftmetastoreurl> \
 --conf spark.hive.metastore.client.auth.mode=PLAIN \
 --conf spark.hive.metastore.client.plain.username=ibmlhapikey \
 --conf spark.hive.metastore.client.plain.password=<password>\
 --conf spark.hive.metastore.use.SSL=true \
 --conf spark.hive.metastore.truststore.type=JKS \
 --conf spark.hive.metastore.truststore.path=file:/Library/Java/JavaVirtualMachines/openlogic-openjdk-11.jre/Contents/Home/lib/security/cacerts \
 --conf spark.hive.metastore.truststore.password=changeit \
 localspark-lakehouse.py

You get the metastore.uris and metastore.client.plain.password parameters from the prerequistes steps. The metastore.truststore.path is the location of your cacerts of your local Java installation. This varies between systems.

You should see the database in your watsonx.data Data manager after you've run the script.

To see what your Python code is pushing into the watsonx.data from the console, you must associate the bucket in watsonx.data. Add a bucket-catalog pair to add your bucket and see the data and metadata files that are created. Then, you are able to view the database that was created in your data manager associated catalog.

Summary

In this tutorial, you've gained insight into the process of setting up a Spark engine locally from the ground up, a crucial step that lets you test it out on your own. The hands-on experience helps you grasp the details of lakehouse technology more effectively and explore its potential applications within your specific context.

Explore more articles and tutorials about watsonx on IBM Developer. You can also start your free watsonx.data trial.

Topics

Languages

Products

Open Source