bulk_updateDate: 2024-26-10
Django introduced QuerySet.bulk_update
in Django 4.x. It’s a great way to efficiently update a larger number of
rows in a single, atomic update.
I’ve encountered numerous situations in which I needed bulk updates,
but Django’s QuerySet.bulk_update
would’ve made the process less efficient and racy because it only
accepts model instances.
Suppose we have the following model:
class Listing(models.Model):
title = models.TextField()
score = models.IntegerField()Our goal is to increment the score for a subset of the listings by a specific number for each listing.
listing_ids = [455, 732, 9312, 134, ...]with transaction.atomic():
listings = Listing.objects.filter(listing_ids__in=listing_ids)
for listing in listings:
listing.score += compute_score_increment(listing.id)
Listing.objects.bulk_update(listings, fields=["score"])There are a couple of problems with this approach:
It’s inefficient. We have to fetch the all the affected
Listing objects from the database using a
SELECT query before we can update them.
It suffers from a race condition. In-between fetching the
Listing from the database and executing the update, the
score might have been modified. We end up incrementing using the old
score.
We can fix the second problem using a lock:
with transaction.atomic():
listings = (
Listing
.objects
.filter(listing_ids__in=listing_ids)
.select_for_update()
)
for listing in listings:
listing.score += compute_score_increment(listing.id)
Listing.objects.bulk_update(listings, fields=["score"])This would ensure that the listings aren’t modified while we’re performing the update. This is still highly inefficient. Not only do we have to do two round trips to the database, we’re also keeping the rows locked while we’re performing the update. This prevents other transactions from writing to the same rows. Depending on the scale, this can reduce throughput.
For a larger table, we could make this slightly more efficient by
using QuerySet.only
to reduce the number of fields we fetch.
Django’s QuerySet.bulk_update
works by crafting an UPDATE statement that utilizes conditional
expressions.
Our queries actually end up looking something like this:
Listing
.objects
.update(
id__in=listing_ids,
score=Case(
When(id=455, then=Value(40)),
When(id=732, then=Value(12)),
When(id=9312, then=Value(28)),
When(id=134, then=Value(8)),
default=F("score"),
),
)We can avoid the pitfalls of Django’s QuerySet.bulk_update
by crafting such a query ourselves:
cases = [
When(
id=listing_id,
then=F("score") + compute_score_increment(listing.id)
)
for listing_id in listing_ids
]
(
Listing
.objects
.update(
id__in=listing_ids,
score=Case(
*cases,
default=F("score"),
),
)
)This solves all our problems:
It’s efficient. It does not require an additional round trip to the database.
It’s race free. We’re using an expression to add the increment to the score as the database sees it. The database will correctly lock the affected rows, perform the increment, update the row and release the lock.
In raw SQL, the query above would look like this:
UPDATE
myapp_listing
SET score = CASE
WHEN id = 455 THEN score + 40
WHEN id = 732 THEN score + 12
WHEN id = 9312 THEN score + 28
WHEN id = 134 THEN score + 8
ELSE score
END
WHERE id IN (455, 732, 9312, 134);The Django docs for QuerySet.bulk_update
warn that such queries can grow very large and to carefully tune the
batch size:
- If updating a large number of columns in a large number of rows, the SQL generated can be very large. Avoid this by specifying a suitable
batch_size.
With our improved version, we no longer have a
batch_size parameter. We should batch the updates ourselves
so that individual queries don’t grow to a ridiculous size. You’ll most
likely see degraded performance by using exceptionally large batch
sizes. Besides degraded performance, you’ll most likely hit the
following limits: