INSERT vs. UPDATE and Their Impact on Data Ingestion

#data-engineering #database

I recently stumbled across an interesting interview question: "Which is harder for a database to handle — INSERT or UPDATE?"

At first, you might think, "Does it even matter? They serve different purposes, right?" INSERT adds new rows, and UPDATE modifies existing ones — case closed. But then when it comes to Data Ingestion, these two operations often collide. When you're loading data into a system — syncing external sources, streaming events, or updating analytics tables — choosing between INSERT and UPDATE can significantly affect your database's performance and scalability.

So, let's explore how INSERT and UPDATE work under the hood (we will focus on the data ingestion context, but the concept should apply to other scenarios as well). We'll dig into the challenges they create, the trade-offs you should consider, and when you might use one over the other—or a hybrid approach like UPSERT.

The INSERT Operation

At first glance, INSERT seems straightforward — add a new row into the system. However, behind the scenes, it's far from trivial. Let's dive into the mechanics.

Indexes

When you insert a row, it's not just about adding data to the table. The database engine must also update all associated indexes. If you think of indexes as catalogs, every new entry needs to be carefully placed into its correct place, ensuring the catalog stays sorted and useful.

More indexes mean more overhead. Each additional index requires extra effort to calculate, locate, and store the new data. While a single-column index is relatively quick to update, composite or unique indexes (like (user_id, created_at)) add complexity and latency.

Constraints and Triggers

Before new data officially lands in a table, it must pass several integrity checks. These constraints include foreign keys to enforce relationships, UNIQUE constraints to prevent duplicate entries, and NOT NULL requirements to ensure data quality. If the schema includes computed columns, the database must calculate their values on-the-fly.

Additionally, triggers can execute custom logic during insertion, like logging or cascading updates (or whatever custom logic you might have). While useful, they can significantly increase the workload, especially for complex data schemas.

Concurrency

INSERT operations tend to play nice with locks, using minimal locking mechanisms like row or page locks. However, concurrency can still cause headaches in high-load systems.

If multiple threads attempt to insert data into the same table or partition concurrently, lock contention can occur. Or worse, deadlocks — where two transactions block each other indefinitely.

Storage Bloat

Bulk inserts or frequent small inserts can lead to storage inefficiencies over time:

Fragmentation: Frequent inserts can fragment the underlying storage, reducing performance for subsequent reads and writes.
Bloat: Tables without proper maintenance (e.g., periodic purging or archiving) can grow unnecessarily large, eating into storage and degrading performance.

The UPDATE Operation

While INSERT introduces new data, UPDATE modifies what already exists — and this is where things get tricky. UPDATE operations often involve multiple subsystems within the database, making them resource-intensive.

Finding the Rows

Every UPDATE starts with a search. The database engine must locate the exact rows that match the WHERE clause.

If the WHERE clause targets indexed columns, this can be relatively efficient. However, if the query filters on non-indexed columns, the engine may take some time to scan the entire table. On small datasets, this is manageable, but as the table grows, the lookup process can become painfully slow.

Indexed Columns

While indexes are great for speeding up searches, they add significant overhead when modified. The database must not only update the row but also revise every associated index. Hence, updating indexed columns can be slow.

For composite indexes, the complexity multiplies. Updating a single value can lead to recalculating and repositioning the index entry. The more indexes tied to a column, the more time the database spends shuffling data, slowing down the UPDATE.

Locking

Unlike the relatively lock-friendly INSERT, UPDATE operations can be lock-heavy. To ensure data integrity, the database often locks the rows it's updating, preventing other operations from modifying them simultaneously. On small, low-traffic tables, this is fine. But in high-concurrency environments, these locks can snowball into the same two suspects: lock contention or deadlocks.

Transaction Logs

Databases maintain transaction logs to ensure consistency and support rollbacks. While both INSERT and UPDATE write to the log, UPDATE tends to be more verbose.

This is especially true in databases like PostgreSQL, which use Multi-Version Concurrency Control (MVCC). Here, UPDATE doesn't overwrite the existing row. Instead, it creates a new version of the row and marks the old one as "dead". This process ensures consistency and supports concurrent reads, but can lead to fragmentation over time and necessitate regular VACUUM operations to reclaim storage. Inserts are generally faster because they simply append new tuples, but high concurrency on indexed tables can still create bottlenecks.

Which One's Harder?

So, given all of this, which operation is actually tougher on the database?

If you're adding new rows with few or no indexes and minimal constraints, INSERT will usually be the easier option. It's generally simpler and more straightforward when the data is fresh and the table isn't heavily indexed. But once indexes, triggers, and high concurrency enter the mix, INSERT can start to strain the database's resources.

UPDATE, on the other hand, often demands more from the database right from the start. From locating the rows, updating data, managing indexes, to locking resources — it's a layered process, and each step can lead to performance bottlenecks. With multiple indexes or high transaction volume, UPDATE can become a costly operation, both in terms of time and database performance.

Wrapping it Up

Choosing between INSERT and UPDATE depends heavily on your use case and your database's structure. For instance, if you're dealing with high volumes of updates, you might want to consider alternative strategies, like inserting a new row and deleting the old one rather than performing a heavy update, especially if updates risk fragmenting the table or impacting multiple indexes.

Another tactic is to avoid large-scale updates altogether when possible. Instead, leverage optimized methods like INSERT...ON DUPLICATE KEY UPDATE in MySQL/MariaDB or INSERT ON CONFLICT DO UPDATE in PostgreSQL or MERGE in SQL Server, which can handle both insertions and updates more efficiently. These commands are designed to reduce overhead, making the database's job easier in many cases.

Additional materials

Database Internals by Alex Petrov - very underrated book by the community
Seven Databases in Seven Weeks by Luc Perkins, Eric Redmond, Jim Wilson