If you've spent any time wrangling data at scale, you've probably heard of Apache Spark. Maybe you've even cursed at it once or twice — don't worry, you're in good company. Spark has become the go-to framework for big data processing, and for good reason: it's fast, versatile, and (once you get the hang of it) surprisingly elegant. But mastering it? That's a whole other story.
Spark is packed with features and an architecture that feels simple on the surface but gets deep real quick. If you've ever struggled with runaway stages, weird partitioning issues, or mysterious memory errors, you know exactly what I mean.
That's why I put together this series: to help you get past the basics and into the real nuts and bolts of how Spark works — and how to make it work for you.
Whole series:
- Hadoop and Yarn for Apache Spark Developer
- Spark Core Concepts Explained
- Anatomy of Apache Spark Application
- Spark Partitions
- Deep Dive into Spark Memory Management
- Explaining the mechanics of Spark caching
Apache Spark Key Abstractions
The Spark architecture relies on two key abstractions:
- Resilient Distributed Dataset (RDD)
- Directed Acyclic Graph (DAG)
These concepts form the backbone of Spark's functionality and are what make it so powerful. Let's break them down.
RDD — The Core of Spark
At the heart of Apache Spark lies the RDD — Resilient Distributed Dataset. Think of it as a distributed collection of objects, spread across cluster nodes for parallel processing. RDDs are Spark's original abstraction and remain central to its design, even as newer APIs like DataFrames and Datasets have emerged.
Physically, RDDs are stored as objects in the JVM, with data referenced from external storage (like HDFS, Cassandra, or HBase), cached in memory, or materialized on disk. They come with built-in metadata that enables fault tolerance and efficient execution:
- Partitions — a set of data chunks associated with this RDD. They are located at the cluster nodes. One partition is the smallest data unit that can be processed by each node of the cluster.
- Dependencies — lineage information, a list of parent RDDs involved in the calculation, the so-called lineage graph. This lineage graph allows RDDs to be recomputed if needed.
- Computation — a function to calculate children's RDD from parent RDD.
- Preferred Locations — where partitions are stored, enabling data-local execution.
- Partitioner — logic for splitting data into partitions, often based on a HashPartitioner by default.
RDDs are immutable, meaning you can't modify an RDD after it's created. Instead, any operation (like filtering or mapping) creates a new RDD. This immutability is key to Spark's fault tolerance: each RDD knows its origin and how it was created, thanks to the lineage graph. If a partition is lost, Spark can regenerate it by replaying the transformations on the original data source.
Example:
>>> rdd = sc.parallelize(range(20)) # create RDD
>>> rdd
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195
>>> rdd.collect() # collect data on driver and show
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
RDDs can be cached to optimize repeated computations. For example, if you repeatedly process the same RDD in a pipeline, caching avoids reprocessing the entire lineage graph. You can also manually control partitioning to ensure balanced workloads. Smaller partitions often improve performance by distributing data more evenly across executors.
Let's inspect the number of partitions and their contents:
>>> rdd.getNumPartitions() # get current number of paritions
4
>>> rdd.glom().collect() # collect data on driver based on partitions
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]
Smaller, evenly distributed partitions lead to better parallelism, especially during shuffle-heavy operations like reduceByKey
or groupBy
.
All the exciting things that happen in Spark happen through RDD operations, generally taking the following form: creating an RDD (for example, reading a file from HDFS), transforming it (map
, reduce
, join
, groupBy
, aggregate
), and doing something with the result (such as writing it back to HDFS).
Two types of operations can be performed via RDD (and, accordingly, all work with data is performed in the sequence of these two types): transformations and actions.
Transformations
Transformations are the backbone of Spark's RDD processing. They allow you to create a new RDD by transforming elements from an existing one. Transformations are lazy — nothing is actually computed when you call them. Instead, Spark records the operations (via DAG, more on that later), deferring execution until an action is triggered.
There are two main types of transformations in Spark:
- Narrow transformations: These transformations, like map and filter, don't require data to move between partitions. Each partition can be processed independently, making these operations fast and highly parallelizable.
- Wide transformations: These transformations, like groupByKey and reduceByKey, require data shuffling — moving records across partitions to meet the transformation's requirements. Wide transformations introduce more complexity and often result in a new stage.
Narrow transformations operate entirely within a single partition, avoiding the overhead of data shuffling. For example:
# Filter elements greater than 10
>>> filteredRDD = rdd.filter(lambda x: x > 10)
>>> print(filteredRDD.toDebugString()) # Observe the execution graph
(4) PythonRDD[1] at RDD at PythonRDD.scala:53 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []
>>> filteredRDD.collect()
[11, 12, 13, 14, 15, 16, 17, 18, 19]
No shuffle occurs here; each partition processes its data independently. Spark can optimize and stack multiple narrow transformations into the same stage, making them highly efficient.
Wide transformations involve dependencies across multiple partitions, requiring Spark to redistribute data (shuffle). This is necessary for operations like grouping or aggregating records spread across partitions. For example:
>>> groupedRDD = filteredRDD.groupBy(lambda x: x % 2) # group data based on mod
>>> print(groupedRDD.toDebugString()) # two separate stages are created, because of the shuffle
(4) PythonRDD[6] at RDD at PythonRDD.scala:53 []
| MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:133 []
| ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[3] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| PythonRDD[2] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []
Here, Spark reshuffles data across partitions, creating a new stage. Shuffling is expensive, as it involves network I/O, data serialization, and disk writes. Minimizing unnecessary shuffling is key to optimizing Spark jobs.
Actions
While transformations define how data should be manipulated, actions trigger the actual computation. When you call an action, Spark materializes the result, executing the DAG and producing the final output - save the data to disk, write the data to a database or output a part of the data to the console.
Common actions include:
collect
: Gathers all data to the driver (use cautiously with large datasets).reduce
: Aggregates data using a specified function.count
: Counts the number of elements in the RDD.
>> filteredRDD.reduce(lambda a, b: a + b)
135
Actions are not lazy — they will actually trigger the data processing — they force Spark to process the transformations, execute tasks, and return results.
Beyond RDDs: DataFrames and Datasets
While RDDs form the foundation of Spark, modern APIs like DataFrames and Datasets are preferred for handling structured data due to their optimization and simplicity. Here's a quick comparison:
- RDDs: Provide low-level control over distributed data but lack query optimizations. Best for unstructured data and fine-grained transformations.
- DataFrames: Represent data in a table-like format with named columns (like a SQL table). They leverage Spark's Catalyst Optimizer for efficient query planning and execution.
- Datasets: Offer the benefits of DataFrames with type safety, making them ideal for statically typed languages like Scala and Java.
In general, for structured data, DataFrames and Datasets are preferred over RDDs due to Spark's internal optimizations and simplified syntax. While RDDs give more control, DataFrames and Datasets reduce boilerplate code and improve performance for ETL and analytical tasks.
Directed Acyclic Graph
One of the foundational differences between Spark and Hadoop MapReduce is Spark's use of the Directed Acyclic Graph (DAG) to manage task execution. The DAG is essentially Spark's blueprint for processing: a graph where nodes represent operations (like map
, filter
, reduce
) and edges represent dependencies between these steps.
Once a Spark application is submitted, the DAG is what transforms your code into an optimized, executable plan, ensuring that data is processed efficiently and that redundant computations are minimized.
So, how does this DAG mechanism actually work, and why should you care? Let's break it down.
Building the DAG
When you write transformations in Spark (like map
, filter
, join
), Spark doesn't immediately execute them. Instead, it records these transformations and their dependencies as a logical DAG, which represents your job's entire workflow. This plan shows Spark how data flows from one stage to the next, allowing it to:
- Optimize the Execution: Spark chains together operations that can be performed in a single stage (e.g., multiple
map
orfilter
transformations) to reduce overhead on shuffling. - Minimize Data Shuffling: By carefully analyzing dependencies, Spark reduces costly shuffling operations — one of the biggest performance bottlenecks in distributed systems.
- Recover from Failures: Each RDD tracks its lineage (its creation history), allowing Spark to recompute lost partitions without restarting the entire job.
The logical DAG doesn't actually execute anything. It's just a plan, waiting for an action (like collect
or saveAsTextFile
) to kickstart execution. When an action is triggered, Spark translates the logical DAG into a physical execution plan.
DAGScheduler and Task Execution
Here's what happens when the DAG gets executed:
- Stage Creation: The DAGScheduler breaks the logical DAG into stages based on shuffle points. Stages consist of narrow transformations that don't require shuffling data between partitions. For instance, a
map
followed by afilter
would belong to the same stage, while areduceByKey
(which requires shuffling) would start a new stage. - Task Creation: Each stage is further subdivided into tasks, where each task processes a single partition of data. This fine granularity enables high parallelism and efficient cluster utilization.
- Data Locality Optimization: Spark tries to execute tasks on nodes where the data resides (or as close as possible), minimizing costly network I/O.
The TaskScheduler then takes over, assigning tasks to cluster nodes (executors) via a cluster manager like Yarn, Kubernetes, or Mesos. While the DAGScheduler focuses on planning and dependencies, the TaskScheduler handles task execution, delegation, and monitoring.
Benefits of DAG-Driven Execution
The DAG architecture is the backbone of Spark's performance and reliability, providing several key benefits:
- Fault Tolerance: Each RDD stores lineage information, so Spark can recompute only the lost partitions instead of re-executing the entire job. This granularity ensures minimal disruption during failures.
- Resource Efficiency: By chaining narrow transformations and optimizing shuffles, Spark uses resources only when needed, reducing the overhead of unnecessary computations.
- Optimized Workflows: The DAG allows Spark to analyze your job holistically, identifying opportunities to consolidate operations and minimize redundant tasks.
The DAG-based execution model is a big reason why Spark can handle complex workloads across massive datasets while minimizing the overhead typically associated with distributed processing. Without the DAG, Spark would lose much of its flexibility, speed, and fault tolerance.
Additional materials
- Spark: The Definitive Guide by Bill Chambers, Matei Zaharia
- Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee