Home
Series About Subscribe
Choosing the Right Compression Codec

Choosing the Right Compression Codec

Let me tell you about the time I shot myself in the foot with Gzip.

It started innocently enough. We were building a new data pipeline β€” daily ingestion of CSV files from an upstream team, landing in Azure Blob Storage, and then feeding into a Apache Spark job downstream. You've probably built something like it.

And because I thought I was being smart β€” saving costs, shaving off network transfer time, keeping the files tidy (I don't remember the exact "why" to be fair) β€” I told the data producer team, "Hey, compress the files with Gzip before dropping them in the object storage". They said, "cool". I said, "great".

Fast forward a couple weeks β€” things started to smell.

Jobs were slowing down. Stages were stalling β€” half the tasks were stuck at "0%", while others finished in seconds. Executors would randomly spin up, read one file, and then sit there doing nothing.

At first, I thought I'd possibly messed up the partitioning. Maybe I needed more shuffle memory. Maybe autoscaling was drunk again. You know how it goes β€” blame literally everything except the obvious β€” isn't it the first rule of software engineering?

πŸ”’

This is a premium deep-dive

You just read the free excerpt. The full analysis continues on Substack.

Read full article β†’ Or subscribe to get all premium posts

Enjoyed what you just read? Others like these as well:

Azure Blob Storage ΠΈ PySpark

Table Selection in Software Engineering

How the Community Turned Into a SaaS Commercial