Anatomy of Apache Spark Application

#spark #data-engineering

If you've spent any time wrangling data at scale, you've probably heard of Apache Spark. Maybe you've even cursed at it once or twice — don't worry, you're in good company. Spark has become the go-to framework for big data processing, and for good reason: it's fast, versatile, and (once you get the hang of it) surprisingly elegant. But mastering it? That's a whole other story.

Spark is packed with features and an architecture that feels simple on the surface but gets deep real quick. If you've ever struggled with runaway stages, weird partitioning issues, or mysterious memory errors, you know exactly what I mean.

That's why I put together this series: to help you get past the basics and into the real nuts and bolts of how Spark works — and how to make it work for you.

Whole series:

Architecture Essentials

A Spark application consists of several critical components, each playing a unique role in distributed data processing:

Driver: The central coordinator managing tasks and their execution.
Application Master: Negotiates resources and monitors tasks.
Spark Context: The entry point for Spark's functionality, managing cluster interactions.
Cluster Resource Manager(aka Cluster Manager or Resource Manager): Allocates resources to executors and tasks.
Executors: Perform distributed computations and store intermediate results.

At its core, Spark operates on a master/slave architecture, with the Driver acting as the master node that directs the operations of executors distributed across the cluster. This architecture is designed to handle massive datasets efficiently, leveraging parallelism and data locality to optimize performance.

Spark yarn architecture

Each component plays an important role in transforming high-level user code into a series of tasks that are executed on distributed nodes. Let's dive deeper into these components to understand their roles.

Spark Driver

The Driver is the central coordinator of a Spark application. Its primary job is to convert user code into smaller execution units - tasks - and schedule them across executors via the cluster manager. The Driver orchestrates the execution process and provides users with results or status updates.

The Driver contains several core components that work together to execute a Spark job:

DAGScheduler: Breaks down the logical DAG (Directed Acyclic Graph) into physical stages, optimizing the workflow for parallel execution.
TaskScheduler: Schedules individual tasks within each stage, assigning them to executors based on data locality and available resources.
BackendScheduler: Communicates with the cluster manager (like YARN, Mesos, Kubernetes) to manage resource allocation.
BlockManager: Manages storage and retrieval of intermediate RDDs, ensuring efficient data sharing between tasks and executors.

The Driver's responsibilities extend beyond task scheduling. It maintains detailed metadata about RDDs, partitions, and stages, optimizing the execution process by combining narrow transformations into single stages and creating physical execution plans from logical DAGs. Additionally, it generates a Spark Web UI, giving users real-time insights into the job's progress, including resource utilization, task breakdowns, and execution times.

Running in its own JVM, the Driver can be hosted on a dedicated node for fault tolerance or co-located on a worker node for high availability. Regardless of its location, the Driver remains the brain of a Spark application, ensuring smooth and efficient execution.

Application Master

The Application Master is a framework-specific process that manages resources and orchestrates application tasks in a distributed environment. It acts as the bridge between the Spark Driver and the Resource Manager, ensuring that the application gets the resources it needs to execute efficiently. Every application running on a cluster has its own dedicated Application Master instance.

In cluster mode, the Application Master and the Driver are launched together on the same node when a Spark application is submitted. The Driver communicates its requirements, such as the number of executors and their resource specifications, to the Application Master. The Application Master then negotiates with the Resource Manager to secure these resources. Once the resources are allocated, the Application Master collaborates with Node Managers to deploy executors and monitor their progress throughout the application's lifecycle.

For setups where the application does not rely on external resource managers (in standalone mode or local setups), the Spark Master can assume the responsibilities of a Cluster Manager, bypassing the need for an Application Master. While simpler, this configuration is primarily suited for smaller or offline workloads.

Spark Context

The Spark Context serves as the primary entry point for all Spark operations, acting as a bridge between the Spark Driver and the cluster's resources. Through the Spark Context, the Driver communicates with the Resource Manager, requests executors, and coordinates the execution of distributed computations. It's also the mechanism for creating RDDs, managing shared variables like accumulators and broadcast variables, and tracking the status of executors via regular heartbeat messages.

Every Spark application operates with its own dedicated Spark Context, instantiated by the Driver when the application is submitted. This context remains active throughout the application's lifecycle, serving as the glue that holds together the distributed components. Once the application completes, the context is terminated, releasing the associated resources.

A critical limitation of Spark Context is that only one active context is allowed per JVM. If you need to initialize a new Spark Context, you must explicitly call stop() on the existing one. This design ensures efficient resource management and prevents conflicts within the application.

Cluster Resource Manager

In a distributed Spark application, the Resource Manager is the central authority responsible for managing compute resources across the cluster. It reserves these resources in the form of containers at the request of the Application Master, which acts as the intermediary between the Driver and the cluster. Once the resources are allocated, the Application Master coordinates with Node Managers to deploy the containers. These containers are then handed over to the Spark Driver, which orchestrates the execution of tasks across the cluster.

The Spark Context acts as the interface between the Driver and the cluster's resource management system. It can seamlessly connect to a variety of Resource Managers, including industry-standard options like YARN, Mesos, and Kubernetes. Spark also supports Nomad for flexible orchestration and includes its own standalone cluster manager for simpler setups.

Executors

Executors are the backbone of a Spark application, running on worker nodes to execute tasks assigned by the Driver. These processes handle the actual data processing and data storage, ensuring distributed processing is efficient and scalable.

By default, executors are allocated statically, meaning their number remains fixed for the duration of the application. However, Spark also supports dynamic allocation, where executors can be added or removed to adapt to workload changes. While this flexibility can optimize resource utilization, it may affect other running application and introduce contention with other applications sharing the same cluster.

In addition to task execution, executors manage data storage. They use the BlockManager to store intermediate RDD data, which can be cached in-memory for quick access or spilled to disk when memory is insufficient (e.g., using localCheckpoint).

Spark Application Running Steps

To understand how a Spark application runs, let's break down the steps involved, using the example of a simple word count application.

from pyspark.sql import SparkSession

# Initialize Spark session
conf = SparkConf().setAppName(appName).setMaster(master) 
sc = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .config(conf=conf)\
        .getOrCreate()

# Read data from HDFS, creating an RDD of lines
linesRDD = sc.textFile("hdfs://...")

# Transform lines into lists of words
wordsRDD = linesRDD.flatMap(lambda line: line.split(" "))

# Map words to (word, 1) tuples
wordCountRDD = wordsRDD.map(lambda word: (word, 1))

# Combine elements with the same word key
resultRDD = wordCountRDD.reduceByKey(lambda a, b: a + b)

# Save results back to HDFS
resultRDD.saveAsTextFile("hdfs://...")
spark.stop()

Step-by-Step Execution

Submitting the Application: When the spark-submit utility is used in cluster mode, it communicates with the Cluster Resource Manager to initialize the Application Master for the job.
Application Master Initialization: The Resource Manager selects an available container and instructs a Node Manager to launch the Application Master in that container.
Application Master Registration: The Application Master registers itself with the Resource Manager, allowing the Spark client to communicate directly with it. This setup allows the client to monitor the application and retrieve status updates.
Driver Initialization: In cluster mode, the Spark Driver runs in the same container as the Application Master. It interprets the user code and translates the transformations and actions into a logical execution plan — a Directed Acyclic Graph (DAG). RDDs are created at this stage but remain inactive until an action is triggered.
Logical Plan Optimization: The Driver optimizes the logical DAG by combining narrow transformations into single stages and determining data locality for efficient execution.

Spark word count example

Physical Plan Generation: The optimized DAG is converted into a physical execution plan, where the Driver breaks down stages into tasks, each associated with a data partition.
Executor Launch: The Application Master negotiates with the Cluster Resource Manager for additional containers. These containers host the executors, which register with the Driver, enabling the Driver to orchestrate task execution.
Task Scheduling: The Driver assigns tasks to executors, ensuring that tasks are as close as possible to the data they need to process, minimizing network overhead.
Parallel Data Loading: The first RDD reads data from HDFS into partitions distributed across the cluster. Each partition is processed in parallel by the executors.
Transformation Execution: The first map transformation splits lines into words, followed by another map that converts words into (word, 1) tuples. These transformations are narrow, meaning they don't require data shuffling and can execute within a single stage.
Wide Transformation: The reduceByKey operation groups and aggregates data by key. This is a wide transformation that requires shuffling, triggering an additional stage. Data is redistributed across partitions to group records with the same key.
Action Execution: The saveAsTextFile action writes the final results back to HDFS. This triggers the full DAG execution, and all transformations are processed to produce the output.
Application Monitoring: During execution, the client can communicate with the Application Master to track progress, view metrics, and check the status of tasks.
Completion and Cleanup: Once the application finishes, the Application Master deregisters itself from the Resource Manager, releases its resources, and shuts down. The executors are also terminated, freeing their containers for other applications.

This workflow demonstrates how Spark coordinates between its components to distribute and execute tasks efficiently, from logical planning to physical execution and resource cleanup.