Choosing the Right Compression Codec

1 min read Last updated Jul 28 2025

#data-engineering #premium

Let me tell you about the time I shot myself in the foot with Gzip.

It started innocently enough. We were building a new data pipeline — daily ingestion of CSV files from an upstream team, landing in Azure Blob Storage, and then feeding into a Apache Spark job downstream. You've probably built something like it.

And because I thought I was being smart — saving costs, shaving off network transfer time, keeping the files tidy (I don't remember the exact "why" to be fair) — I told the data producer team, "Hey, compress the files with Gzip before dropping them in the object storage". They said, "cool". I said, "great".

Fast forward a couple weeks — things started to smell.

Jobs were slowing down. Stages were stalling — half the tasks were stuck at "0%", while others finished in seconds. Executors would randomly spin up, read one file, and then sit there doing nothing.

At first, I thought I'd possibly messed up the partitioning. Maybe I needed more shuffle memory. Maybe autoscaling was drunk again. You know how it goes — blame literally everything except the obvious — isn't it the first rule of software engineering?

🔒

This is a premium deep-dive

You just read the free excerpt. The full analysis continues on Substack.

Read full article → Or subscribe to get all premium posts