Expressions in Django’s `bulk_update`

Author: Swen Kooij

Date: 2024-26-10

Django introduced QuerySet.bulk_update in Django 4.x. It’s a great way to efficiently update a larger number of rows in a single, atomic update.

I’ve encountered numerous situations in which I needed bulk updates, but Django’s QuerySet.bulk_update would’ve made the process less efficient and racy because it only accepts model instances.

Example case

Suppose we have the following model:

class Listing(models.Model):
    title = models.TextField()
    score = models.IntegerField()

Our goal is to increment the score for a subset of the listings by a specific number for each listing.

listing_ids = [455, 732, 9312, 134, ...]

Approach 1: Naive

with transaction.atomic():
  listings = Listing.objects.filter(listing_ids__in=listing_ids)
  for listing in listings:
    listing.score += compute_score_increment(listing.id)

  Listing.objects.bulk_update(listings, fields=["score"])

There are a couple of problems with this approach:

It’s inefficient. We have to fetch the all the affected Listing objects from the database using a SELECT query before we can update them.
It suffers from a race condition. In-between fetching the Listing from the database and executing the update, the score might have been modified. We end up incrementing using the old score.

We can fix the second problem using a lock:

with transaction.atomic():
  listings = (
    Listing
    .objects
    .filter(listing_ids__in=listing_ids)
    .select_for_update()
  )
  
  for listing in listings:
    listing.score += compute_score_increment(listing.id)

  Listing.objects.bulk_update(listings, fields=["score"])

This would ensure that the listings aren’t modified while we’re performing the update. This is still highly inefficient. Not only do we have to do two round trips to the database, we’re also keeping the rows locked while we’re performing the update. This prevents other transactions from writing to the same rows. Depending on the scale, this can reduce throughput.

For a larger table, we could make this slightly more efficient by using QuerySet.only to reduce the number of fields we fetch.

Approach 2: Efficient

Django’s QuerySet.bulk_update works by crafting an UPDATE statement that utilizes conditional expressions.

Our queries actually end up looking something like this:

Listing
    .objects
    .update(
         id__in=listing_ids,
         score=Case(
                When(id=455, then=Value(40)),
                When(id=732, then=Value(12)),
                When(id=9312, then=Value(28)),
                When(id=134, then=Value(8)),
                default=F("score"),
         ),
    )

We can avoid the pitfalls of Django’s QuerySet.bulk_update by crafting such a query ourselves:

cases = [
    When(
      id=listing_id,
      then=F("score") + compute_score_increment(listing.id)
    )
    for listing_id in listing_ids
]

(
  Listing
    .objects
    .update(
       id__in=listing_ids,
       score=Case(
          *cases,
          default=F("score"),
       ),
    )
)

This solves all our problems:

It’s efficient. It does not require an additional round trip to the database.
It’s race free. We’re using an expression to add the increment to the score as the database sees it. The database will correctly lock the affected rows, perform the increment, update the row and release the lock.

In raw SQL, the query above would look like this:

UPDATE
    myapp_listing
SET score = CASE
    WHEN id = 455 THEN score + 40
    WHEN id = 732 THEN score + 12
    WHEN id = 9312 THEN score + 28
    WHEN id = 134 THEN score + 8
    ELSE score
END
WHERE id IN (455, 732, 9312, 134);

Be careful about the batch size

The Django docs for QuerySet.bulk_update warn that such queries can grow very large and to carefully tune the batch size:

If updating a large number of columns in a large number of rows, the SQL generated can be very large. Avoid this by specifying a suitable batch_size.

With our improved version, we no longer have a batch_size parameter. We should batch the updates ourselves so that individual queries don’t grow to a ridiculous size. You’ll most likely see degraded performance by using exceptionally large batch sizes. Besides degraded performance, you’ll most likely hit the following limits:

Parameter limit. Values are passed separately to the database server as injectable parameters. This allows the database to escape them. Each database enforces a different limit for these. Both PostgreSQL and MySQL have a limit of 65,535 parameters per query.