Mastering Apache Spark

For performance tuning of Spark job and debugging of finished tasks, engineers need information. It can be obtained from Spark events on the History server

Nov 10 2024

#spark #data-engineering

Spark Tips

Spark Tips. Use DataFrame API

Apache Spark is the main topic for discussion in Big Data, which can boost performance 10-100 times higher than comparable tools. But what about Pyspark?

May 07 2020

#spark #data-engineering #programming #tips

Spark Tips. Partition Tuning

Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips

Jan 18 2025

#spark #concurrency #data-engineering

Spark Tips. Caching

Another portion of tips to Apache Spark usage, now it's about caching and checkpointing data

Jul 25 2024

#spark #data-engineering #tips

Spark Tips. Don't collect data on driver

Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. These speeds can be achievable using described tips.

May 16 2020

#spark #tips #data-engineering

Spark Tips. Optimizing JDBC data source reads

Spark supports loading data from tables using JDBC, but sometimes engineers use this interface blindly without thinking that it is not optimized by default

Jul 25 2024

#spark #concurrency #database #data-engineering

PySpark in Practice

Introduction to Pyspark join types

DataFrames and Spark SQL API are the waves of the future in the Spark world. Here, I will push your Pyspark SQL knowledge into using different types of joins

Jul 25 2024

#spark #data-engineering #database

The 5-minute guide to using bucketing in Pyspark

Guide into Pyspark bucketing — an optimization technique that uses buckets to determine data partitioning and avoid data shuffle.

Jul 25 2024

#spark #data-engineering

How to Speed Up Spark Jobs on Small Test Datasets

There are some effective strategies for speeding up your Apache Spark jobs on small datasets under 1 million records

Oct 30 2024

#spark #data-engineering