This is a cache of https://developer.ibm.com/components/spark/. It is a snapshot of the page as it appeared on 2026-02-23T17:16:24.949+0000.
Apache Spark - IBM Developer
IBM Developer

Apache Spark

An open-source analytics engine for large-scale data analytics

Simplify the challenging and computationally intensive task of processing high volumes of real-time data.

20 January 2020

Tutorial

Getting started with PySpark

This tutorial covers Big Data via PySpark (a Python package for spark programming). We explain SparkContext by using map and filter methods with Lambda functions in Python. We also create RDD from object and external files, transformations and actions on RDD and pair RDD, SparkSession, and PySpark DataFrame from RDD, and external files. In addition, we use sql queries with DataFrames (by using Spark SQL module). And finally, machine learning with PySpark MLlib library.

Getting started with PySpark