Mastering Apache Spark
From core concepts to production tuning β everything you need to understand and optimize Spark.
16 posts Β· 153 min total read
Foundations
Apache Spark Core Concepts Explained
Let's deep dive into Apache Spark core abstractions
Why Apache Spark RDD is immutable?
Let's discuss Apache Spark's RDD immutability, key to its efficiency in distributed data processing
Anatomy of Apache Spark Application
Apache Spark job autopsy
Cluster Managers for Apache Spark: from YARN to Kubernetes
Deep dive into machinery that orchestrates Spark
Internals
Spark Partitions
How partitioning shapes Spark performance, and what to do when it doesn't
Deep Dive into Spark Memory Management
The real reason your Spark cluster is burning money
Spark Caching Explained: What Really Happens Under the Hood
You called .cache(). Spark said "maybe". Letβs talk about what actually happens.
Spark History Server Deep Dive
For performance tuning of Spark job and debugging of finished tasks, engineers need information. It can be obtained from Spark events on the History server
Spark Tips
Spark Tips. Use DataFrame API
Apache Spark is the main topic for discussion in Big Data, which can boost performance 10-100 times higher than comparable tools. But what about Pyspark?
Spark Tips. Partition Tuning
Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips
Spark Tips. Caching
Another portion of tips to Apache Spark usage, now it's about caching and checkpointing data
Spark Tips. Don't collect data on driver
Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. These speeds can be achievable using described tips.
Spark Tips. Optimizing JDBC data source reads
Spark supports loading data from tables using JDBC, but sometimes engineers use this interface blindly without thinking that it is not optimized by default
PySpark in Practice
Introduction to Pyspark join types
DataFrames and Spark SQL API are the waves of the future in the Spark world. Here, I will push your Pyspark SQL knowledge into using different types of joins
The 5-minute guide to using bucketing in Pyspark
Guide into Pyspark bucketing β an optimization technique that uses buckets to determine data partitioning and avoid data shuffle.
How to Speed Up Spark Jobs on Small Test Datasets
There are some effective strategies for speeding up your Apache Spark jobs on small datasets under 1 million records