I recently stumbled across an interesting interview question: "Which is harder for a database to handle — INSERT
or UPDATE
?"
At first, you might think, "Does it even matter? They serve different purposes, right?" INSERT
adds new rows, and UPDATE
modifies existing ones — case closed. But then when it comes to Data Ingestion, these two operations often collide. When you're loading data into a system — syncing external sources, streaming events, or updating analytics tables — choosing between INSERT
and UPDATE
can significantly affect your database's performance and scalability.
So, let's explore how INSERT
and UPDATE
work under the hood (we will focus on the data ingestion context, but the concept should apply to other scenarios as well). We'll dig into the challenges they create, the trade-offs you should consider, and when you might use one over the other—or a hybrid approach like UPSERT
.
The INSERT Operation
At first glance, INSERT
seems straightforward — add a new row into the system. However, behind the scenes, it's far from trivial. Let's dive into the mechanics.
Indexes
When you insert a row, it's not just about adding data to the table. The database engine must also update all associated indexes. If you think of indexes as catalogs, every new entry needs to be carefully placed into its correct place, ensuring the catalog stays sorted and useful.
More indexes mean more overhead. Each additional index requires extra effort to calculate, locate, and store the new data. While a single-column index is relatively quick to update, composite or unique indexes (like (user_id, created_at)
) add complexity and latency.
Constraints and Triggers
Before new data officially lands in a table, it must pass several integrity checks. These constraints include foreign keys to enforce relationships, UNIQUE
constraints to prevent duplicate entries, and NOT NULL
requirements to ensure data quality. If the schema includes computed columns, the database must calculate their values on-the-fly.
Additionally, triggers can execute custom logic during insertion, like logging or cascading updates (or whatever custom logic you might have). While useful, they can significantly increase the workload, especially for complex data schemas.
Concurrency
INSERT
operations tend to play nice with locks, using minimal locking mechanisms like row or page locks. However, concurrency can still cause headaches in high-load systems.
If multiple threads attempt to insert data into the same table or partition concurrently, lock contention can occur. Or worse, deadlocks — where two transactions block each other indefinitely.
Storage Bloat
Bulk inserts or frequent small inserts can lead to storage inefficiencies over time:
- Fragmentation: Frequent inserts can fragment the underlying storage, reducing performance for subsequent reads and writes.
- Bloat: Tables without proper maintenance (e.g., periodic purging or archiving) can grow unnecessarily large, eating into storage and degrading performance.
The UPDATE Operation
While INSERT
introduces new data, UPDATE
modifies what already exists — and this is where things get tricky. UPDATE
operations often involve multiple subsystems within the database, making them resource-intensive.
Finding the Rows
Every UPDATE
starts with a search. The database engine must locate the exact rows that match the WHERE
clause.
If the WHERE
clause targets indexed columns, this can be relatively efficient. However, if the query filters on non-indexed columns, the engine may take some time to scan the entire table. On small datasets, this is manageable, but as the table grows, the lookup process can become painfully slow.
Indexed Columns
While indexes are great for speeding up searches, they add significant overhead when modified. The database must not only update the row but also revise every associated index. Hence, updating indexed columns can be slow.
For composite indexes, the complexity multiplies. Updating a single value can lead to recalculating and repositioning the index entry. The more indexes tied to a column, the more time the database spends shuffling data, slowing down the UPDATE
.
Locking
Unlike the relatively lock-friendly INSERT
, UPDATE
operations can be lock-heavy. To ensure data integrity, the database often locks the rows it's updating, preventing other operations from modifying them simultaneously. On small, low-traffic tables, this is fine. But in high-concurrency environments, these locks can snowball into the same two suspects: lock contention or deadlocks.
Transaction Logs
Databases maintain transaction logs to ensure consistency and support rollbacks. While both INSERT
and UPDATE
write to the log, UPDATE
tends to be more verbose.
This is especially true in databases like PostgreSQL, which use Multi-Version Concurrency Control (MVCC). Here, UPDATE
doesn't overwrite the existing row. Instead, it creates a new version of the row and marks the old one as "dead". This process ensures consistency and supports concurrent reads, but can lead to fragmentation over time and necessitate regular VACUUM operations to reclaim storage. Inserts are generally faster because they simply append new tuples, but high concurrency on indexed tables can still create bottlenecks.
Which One's Harder?
So, given all of this, which operation is actually tougher on the database?
If you're adding new rows with few or no indexes and minimal constraints, INSERT
will usually be the easier option. It's generally simpler and more straightforward when the data is fresh and the table isn't heavily indexed. But once indexes, triggers, and high concurrency enter the mix, INSERT
can start to strain the database's resources.
UPDATE
, on the other hand, often demands more from the database right from the start. From locating the rows, updating data, managing indexes, to locking resources — it's a layered process, and each step can lead to performance bottlenecks. With multiple indexes or high transaction volume, UPDATE
can become a costly operation, both in terms of time and database performance.
Wrapping it Up
Choosing between INSERT
and UPDATE
depends heavily on your use case and your database's structure. For instance, if you're dealing with high volumes of updates, you might want to consider alternative strategies, like inserting a new row and deleting the old one rather than performing a heavy update, especially if updates risk fragmenting the table or impacting multiple indexes.
Another tactic is to avoid large-scale updates altogether when possible. Instead, leverage optimized methods like INSERT...ON DUPLICATE KEY UPDATE
in MySQL/MariaDB or INSERT ON CONFLICT DO UPDATE
in PostgreSQL or MERGE
in SQL Server, which can handle both insertions and updates more efficiently. These commands are designed to reduce overhead, making the database's job easier in many cases.
Additional materials
- Database Internals by Alex Petrov - very underrated book by the community
- Seven Databases in Seven Weeks by Luc Perkins, Eric Redmond, Jim Wilson