I recently stumbled across an interesting interview question: "Which is harder for a database to handle — INSERT or UPDATE?"
At first, you might think, "Does it even matter? They serve different purposes, right?" INSERT adds new rows, and UPDATE modifies existing ones — case closed. But then when it comes to Data Ingestion, these two operations often overlap. When you're loading data into a system — syncing external sources, inserting data, or updating analytics tables — choosing between INSERT and UPDATE can significantly affect your database's performance and scalability.
In this blog post, I want to explore how INSERT and UPDATE work under the hood (we will focus on the data ingestion context, but the concept should apply to other scenarios as well).
The INSERT Operation
At first glance, INSERT seems straightforward — add a new row into the system. But if you look under the hood, things are more complicated than they may seem.
Indexes
When you insert a row, it's not just about adding data to the table. The database engine must also update all associated indexes. If you think of indexes as catalogs, every new entry needs to be carefully placed into its correct place, ensuring the catalog stays sorted and useful.
More indexes mean more overhead. Each additional index requires extra effort to calculate, locate, and store the new data. While a single-column index is relatively quick to update, composite or unique indexes (like (user_id, created_at)) add complexity and latency.
Constraints and Triggers
Before new data officially lands in a table, it must pass several integrity checks. These constraints include foreign keys to enforce relationships, UNIQUE constraints to prevent duplicate entries, and NOT NULL requirements to ensure data quality. If the schema includes computed columns, the database must calculate their values on-the-fly.
Additionally, triggers can execute custom logic during insertion, like logging or cascading updates (or whatever custom logic you might have). While useful, they can significantly increase the workload, especially for complex data schemas.
Concurrency
INSERT operations tend to play nice with locks, using minimal locking mechanisms like row or page locks. However, concurrency can still cause headaches in high-load systems. If multiple threads attempt to insert data into the same table or partition concurrently, lock contention can occur. Or worse, deadlocks — where two transactions block each other indefinitely.
Storage Bloat
Bulk inserts or frequent small inserts can lead to bloat over time when the physical size of a table increases, even though the useful data in the table may not increase significantly:
- Fragmentation: Frequent inserts can fragment the underlying storage, reducing performance for subsequent reads and writes.
- Bloat: Tables without proper maintenance (e.g., periodic purging or archiving) can grow unnecessarily large, eating into storage and degrading performance.
The UPDATE Operation
While INSERT introduces new data, UPDATE modifies what already exists — and this is where things get tricky. UPDATE operations often involve multiple subsystems within the database, making it resource-intensive.
Finding the Rows
Every UPDATE starts with a search. The database engine must locate the exact rows that match the WHERE clause.
If the WHERE clause targets indexed columns, this can be relatively efficient. However, if the query filters on non-indexed columns, the engine may take some time to scan the entire table. On small datasets, this is manageable, but as the table grows, the lookup process can become painfully slow.
Indexed Columns
While indexes are great for speeding up searches, they add significant overhead when modified. The database must not only update the row but also revise every associated index. Hence, updating indexed columns can be slow.
For composite indexes, the complexity multiplies. Updating a single value can lead to recalculating and repositioning the index entry. The more indexes tied to a column, the more time the database spends shuffling data, slowing down the UPDATE.
Concurrency
Unlike the relatively concurrency-friendly INSERT, UPDATE operations can be lock-heavy. To ensure data integrity, the database often locks the rows it's updating, preventing other operations from modifying them simultaneously. On small, low-traffic tables, this is fine. But in high-concurrency environments, these locks can snowball into the same two suspects: lock contention or deadlocks.
Transaction Logs
Databases maintain transaction logs to ensure consistency and support rollbacks. While both INSERT and UPDATE write to the transaction log, UPDATE tends to be more verbose.
This is especially true in databases like PostgreSQL, which use Multi-Version Concurrency Control (MVCC). Here, UPDATE doesn't overwrite the existing row, instead, it creates a new version of the row and marks the old one as "dead". This process ensures consistency and supports concurrent reads, but can lead to fragmentation over time and necessitate regular VACUUM operations to reclaim storage.
Which One's Harder?
If you're adding new rows with few or no indexes and minimal constraints, INSERT will usually be the easier option. It's generally simpler and more straightforward when the data is fresh and the table isn't heavily indexed. But once indexes, triggers, and high concurrency enter the mix, INSERT can start to strain the database's resources.
UPDATE, on the other hand, often demands more from the database right from the start. From locating the rows, updating data, managing indexes, to locking resources — it's a layered process, and each step can lead to performance bottlenecks. With multiple indexes or high transaction volume, UPDATE can become a costly operation, both in terms of time and database performance.
Wrapping it Up
Choosing between INSERT and UPDATE depends heavily on your use case and your database's structure. For instance, if you're dealing with high volumes of updates, you might want to consider alternative strategies, like inserting a new row and deleting the old one rather than performing a heavy update, especially if updates risk fragmenting the table or impacting multiple indexes.
Another tactic is to avoid large-scale updates altogether when possible. Instead, leverage optimized methods like INSERT...ON DUPLICATE KEY UPDATE in MySQL/MariaDB or INSERT ON CONFLICT DO UPDATE in PostgreSQL or MERGE in SQL Server, which can handle both insertions and updates more efficiently. These commands are designed to reduce overhead, making the database's job easier in many cases.
Additional materials
- Database Internals by Alex Petrov — very underrated book by the community
- Seven Databases in Seven Weeks by Luc Perkins, Eric Redmond, Jim Wilson