I’ve run into the same problem in an open data project (Brasil.IO), in which I’ve made lots of database optimizations, and since I have lots of indexes already, I’d like to delete all the unnecessary ones. The problem does not affect only django.contrib.auth
, but any app which uses ManyToManyField
. If you have a huge m2m table, than this unnecessary index will use a lot of space and will never be used!
Understanding the Problem
The problem source is on the way ManyToManyField
creates the new model to store from/to IDs:
- It adds two
ForeignKey
s (one to “from” field and another to “to” field), each one having its own index (it’s automatic for ForeignKey
)
- It adds
Meta.unique_together = (from_, to)
, which creates a compound index using these two fields
If you’re querying for a specific “from” id inside the m2m table, the compound index could be used instead of the FK index on the “from” field, so the last index could be removed.
Reproducing
Let’s write a simple app to get some metrics:
$ django-admin startapp testapp
# testapp/models.py
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)
class Book(models.Model):
title = models.CharField(max_length=256)
author = models.ManyToManyField("Author")
After running python manage.py makemigrations
and python manage.py migrate
, we can inspect the created m2m table in PostgreSQL:
mydb=# \d testapp_book_author
Table "public.testapp_book_author"
Column | Type | Collation | Nullable | Default
-----------+---------+-----------+----------+-------------------------------------------------
id | integer | | not null | nextval('testapp_book_author_id_seq'::regclass)
book_id | integer | | not null |
author_id | integer | | not null |
Indexes:
"testapp_book_author_pkey" PRIMARY KEY, btree (id)
"testapp_book_author_author_id_657c9727" btree (author_id)
"testapp_book_author_book_id_author_id_ae92da66_uniq" UNIQUE CONSTRAINT, btree (book_id, author_id)
"testapp_book_author_book_id_e2a2a45a" btree (book_id)
Foreign-key constraints:
"testapp_book_author_author_id_657c9727_fk_testapp_author_id" FOREIGN KEY (author_id) REFERENCES testapp_author(id) DEFERRABLE INITIALLY DEFERRED
"testapp_book_author_book_id_e2a2a45a_fk_testapp_book_id" FOREIGN KEY (book_id) REFERENCES testapp_book(id) DEFERRABLE INITIALLY DEFERRED
Since testapp_book_author_book_id_author_id_ae92da66_uniq
is a compound index and have book_id
as its first field, then the index testapp_book_author_book_id_e2a2a45a
is completely useless (more info). Let’s check this affirmation by using EXPLAIN
:
If we query for both author_id
and book_id
, the compound index is used:
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1 AND author_id = 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using testapp_book_author_book_id_author_id_ae92da66_uniq on testapp_book_author (cost=0.15..8.17 rows=1 width=12)
Index Cond: ((book_id = 1) AND (author_id = 1))
If we query for author_id
, the FK index on author_id
is used:
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE author_id = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on testapp_book_author (cost=4.23..14.79 rows=10 width=12)
Recheck Cond: (author_id = 1)
-> Bitmap Index Scan on testapp_book_author_author_id_657c9727 (cost=0.00..4.23 rows=10 width=0)
Index Cond: (author_id = 1)
(4 rows)
If we query for book_id
, the FK index on book_id
is used:
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on testapp_book_author (cost=4.23..14.79 rows=10 width=12)
Recheck Cond: (book_id = 1)
-> Bitmap Index Scan on testapp_book_author_book_id_e2a2a45a (cost=0.00..4.23 rows=10 width=0)
Index Cond: (book_id = 1)
But if we disable the FK index on book_id
and query for book_id
, the compound index will be used with the same cost, which proves this index is unnecessary:
mydb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1
mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on testapp_book_author (cost=4.23..14.79 rows=10 width=12)
Recheck Cond: (book_id = 1)
-> Bitmap Index Scan on testapp_book_author_book_id_author_id_ae92da66_uniq (cost=0.00..4.23 rows=10 width=0)
Index Cond: (book_id = 1)
Wasted Database Space
Database space is gold. Let’s check how this behavior affects the whole database size. First, let’s re-enable the index:
mydb=# UPDATE pg_index SET indisvalid = TRUE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1
And populate the database with some thousands of rows:
import math
total_authors = 100_000
books_per_pair = 10
bulk_size = 100
total_iters = math.ceil(total_authors / bulk_size)
for number in range(1, total_iters + 1):
authors = Author.objects.bulk_create([
Author(first_name=f"John #{number * (bulk_size - 1) + n}", last_name="Doe")
for n in range(bulk_size)
])
for author_1, author_2 in zip(authors, authors[1:]):
# Create new books
books = Book.objects.bulk_create([
Book(title=f"Book #{n} by #{author_1.id} + #{author_2.id}")
for n in range(books_per_pair)
])
# Assign these books to this pair of authors
Book.author.through.objects.bulk_create(
[
Book.author.through(book_id=book.id, author_id=author_1.id)
for book in books
]
+
[
Book.author.through(book_id=book.id, author_id=author_2.id)
for book in books
]
)
print(f"\r{number}/{total_iters}", flush=True, end="")
print()
(this will take some minutes to run)
I needed to stop the inserting here when it was finishing, but it inserted more than 100k authors, almost 1M books and almost 2M m2m rows:
In [1]: Author.objects.count()
Out[1]: 100100
In [2]: Book.objects.count()
Out[2]: 990200
In [3]: Book.author.through.objects.count()
Out[3]: 1980400
Now, we can finally check the total sizes of testapp’s indexes:
mydb=# SELECT indexrelid::regclass AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_index WHERE indexrelid::regclass::text LIKE 'testapp_%';
index | size
-----------------------------------------------------+---------
testapp_author_pkey | 2208 kB
testapp_book_pkey | 21 MB
testapp_book_author_pkey | 42 MB
testapp_book_author_book_id_author_id_ae92da66_uniq | 42 MB
testapp_book_author_author_id_657c9727 | 42 MB
testapp_book_author_book_id_e2a2a45a | 42 MB
So, for 2M relationships, we have 42MB of wasted space in testapp_book_author_book_id_e2a2a45a
.
I’ve been working on projects including tables with 10x this size in rows. The cost per database GB in Heroku PostgreSQL, for instance, goes from 1.28 USD to 2.34 USD, depending on the plan. With this optimization, not only the cost will be lower, but also the insert time (I have some tasks bulk inserting millions of rows per day).
Fixing the problem (Optimizing ManyToManyField
)
I’ve read the related Django code and think I have a simple solution. I’m willing to create a pull request optimizing this, but don’t know if and how I should I proceed. Should I create a ticket on code.djangoproject first or the pull request directly?