Home
Series About Subscribe

Mastering Apache Spark

From core concepts to production tuning β€” everything you need to understand and optimize Spark.

16 posts Β· 153 min total read

Foundations

Apache Spark Core Concepts Explained

Let's deep dive into Apache Spark core abstractions

Why Apache Spark RDD is immutable?

Let's discuss Apache Spark's RDD immutability, key to its efficiency in distributed data processing

Anatomy of Apache Spark Application

Apache Spark job autopsy

Cluster Managers for Apache Spark: from YARN to Kubernetes

Deep dive into machinery that orchestrates Spark
Internals

Spark Partitions

How partitioning shapes Spark performance, and what to do when it doesn't

Deep Dive into Spark Memory Management

The real reason your Spark cluster is burning money

Spark Caching Explained: What Really Happens Under the Hood

You called .cache(). Spark said "maybe". Let’s talk about what actually happens.

Spark History Server Deep Dive

For performance tuning of Spark job and debugging of finished tasks, engineers need information. It can be obtained from Spark events on the History server
Spark Tips

Spark Tips. Use DataFrame API

Apache Spark is the main topic for discussion in Big Data, which can boost performance 10-100 times higher than comparable tools. But what about Pyspark?

Spark Tips. Partition Tuning

Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips

Spark Tips. Caching

Another portion of tips to Apache Spark usage, now it's about caching and checkpointing data

Spark Tips. Don't collect data on driver

Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. These speeds can be achievable using described tips.

Spark Tips. Optimizing JDBC data source reads

Spark supports loading data from tables using JDBC, but sometimes engineers use this interface blindly without thinking that it is not optimized by default
PySpark in Practice

Introduction to Pyspark join types

DataFrames and Spark SQL API are the waves of the future in the Spark world. Here, I will push your Pyspark SQL knowledge into using different types of joins

The 5-minute guide to using bucketing in Pyspark

Guide into Pyspark bucketing β€” an optimization technique that uses buckets to determine data partitioning and avoid data shuffle.

How to Speed Up Spark Jobs on Small Test Datasets

There are some effective strategies for speeding up your Apache Spark jobs on small datasets under 1 million records