PgHero - Duplicate Indexes

Hi, I am using PgHero as monitoring tool for my Django’s PostgreSQL DB. And it reports very big number of duplicate indexes (69). Good example is the auth app:

And I would just ask if this is normal state. Has it some impact on performance? Is possible to make Django more clever to not create redundant indexes? Thank you for your attention :slight_smile:

1 Like

This is very normal - there’s an index for each way of accessing the table. It’ll slightly hurt your write performance and on-disk size, but not enough that I would worry about it in any way; generally, unless you know you’re going to be write-constrained (which is rare and limited to things like rapid trading), the more indexes the better.

Specifically, here it looks like every foreign key has its own implicit index in a many to many join table, along with one for the (a, b) column pair. I’d say this is quite normal and actually implicit in the FOREIGN KEY statement if I remember right; I’m not even sure how we could easily turn this off.

I’ve run into the same problem in an open data project (Brasil.IO), in which I’ve made lots of database optimizations, and since I have lots of indexes already, I’d like to delete all the unnecessary ones. The problem does not affect only django.contrib.auth, but any app which uses ManyToManyField. If you have a huge m2m table, than this unnecessary index will use a lot of space and will never be used!

Understanding the Problem

The problem source is on the way ManyToManyField creates the new model to store from/to IDs:

  • It adds two ForeignKeys (one to “from” field and another to “to” field), each one having its own index (it’s automatic for ForeignKey)
  • It adds Meta.unique_together = (from_, to), which creates a compound index using these two fields

If you’re querying for a specific “from” id inside the m2m table, the compound index could be used instead of the FK index on the “from” field, so the last index could be removed.

Reproducing

Let’s write a simple app to get some metrics:

$ django-admin startapp testapp
# testapp/models.py
from django.db import models


class Author(models.Model):
    first_name = models.CharField(max_length=30)
    last_name = models.CharField(max_length=30)


class Book(models.Model):
    title = models.CharField(max_length=256)
    author = models.ManyToManyField("Author")

After running python manage.py makemigrations and python manage.py migrate, we can inspect the created m2m table in PostgreSQL:

mydb=# \d testapp_book_author
                              Table "public.testapp_book_author"
  Column   |  Type   | Collation | Nullable |                     Default                     
-----------+---------+-----------+----------+-------------------------------------------------
 id        | integer |           | not null | nextval('testapp_book_author_id_seq'::regclass)
 book_id   | integer |           | not null | 
 author_id | integer |           | not null | 
Indexes:
    "testapp_book_author_pkey" PRIMARY KEY, btree (id)
    "testapp_book_author_author_id_657c9727" btree (author_id)
    "testapp_book_author_book_id_author_id_ae92da66_uniq" UNIQUE CONSTRAINT, btree (book_id, author_id)
    "testapp_book_author_book_id_e2a2a45a" btree (book_id)
Foreign-key constraints:
    "testapp_book_author_author_id_657c9727_fk_testapp_author_id" FOREIGN KEY (author_id) REFERENCES testapp_author(id) DEFERRABLE INITIALLY DEFERRED
    "testapp_book_author_book_id_e2a2a45a_fk_testapp_book_id" FOREIGN KEY (book_id) REFERENCES testapp_book(id) DEFERRABLE INITIALLY DEFERRED

Since testapp_book_author_book_id_author_id_ae92da66_uniq is a compound index and have book_id as its first field, then the index testapp_book_author_book_id_e2a2a45a is completely useless (more info). Let’s check this affirmation by using EXPLAIN:

If we query for both author_id and book_id, the compound index is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1 AND author_id = 1;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using testapp_book_author_book_id_author_id_ae92da66_uniq on testapp_book_author  (cost=0.15..8.17 rows=1 width=12)
   Index Cond: ((book_id = 1) AND (author_id = 1))

If we query for author_id, the FK index on author_id is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE author_id = 1;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (author_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_author_id_657c9727  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (author_id = 1)
(4 rows)

If we query for book_id, the FK index on book_id is used:

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_e2a2a45a  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)

But if we disable the FK index on book_id and query for book_id, the compound index will be used with the same cost, which proves this index is unnecessary:

mydb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1

mydb=# EXPLAIN SELECT * FROM testapp_book_author WHERE book_id = 1;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testapp_book_author  (cost=4.23..14.79 rows=10 width=12)
   Recheck Cond: (book_id = 1)
   ->  Bitmap Index Scan on testapp_book_author_book_id_author_id_ae92da66_uniq  (cost=0.00..4.23 rows=10 width=0)
         Index Cond: (book_id = 1)

Wasted Database Space

Database space is gold. Let’s check how this behavior affects the whole database size. First, let’s re-enable the index:

mydb=# UPDATE pg_index SET indisvalid = TRUE WHERE indexrelid = 'testapp_book_author_book_id_e2a2a45a'::regclass;
UPDATE 1

And populate the database with some thousands of rows:

import math


total_authors = 100_000
books_per_pair = 10
bulk_size = 100

total_iters = math.ceil(total_authors / bulk_size)
for number in range(1, total_iters + 1):
    authors = Author.objects.bulk_create([
        Author(first_name=f"John #{number * (bulk_size - 1) + n}", last_name="Doe")
        for n in range(bulk_size)
    ])
    for author_1, author_2 in zip(authors, authors[1:]):
        # Create new books
        books = Book.objects.bulk_create([
            Book(title=f"Book #{n} by #{author_1.id} + #{author_2.id}")
            for n in range(books_per_pair)
        ])
        # Assign these books to this pair of authors
        Book.author.through.objects.bulk_create(
            [
                Book.author.through(book_id=book.id, author_id=author_1.id)
                for book in books
            ]
            +
            [
                Book.author.through(book_id=book.id, author_id=author_2.id)
                for book in books
            ]
        )
    print(f"\r{number}/{total_iters}", flush=True, end="")
print()

(this will take some minutes to run)

I needed to stop the inserting here when it was finishing, but it inserted more than 100k authors, almost 1M books and almost 2M m2m rows:

In [1]: Author.objects.count()
Out[1]: 100100

In [2]: Book.objects.count()
Out[2]: 990200

In [3]: Book.author.through.objects.count()
Out[3]: 1980400

Now, we can finally check the total sizes of testapp’s indexes:

mydb=# SELECT indexrelid::regclass AS index, pg_size_pretty(pg_relation_size(indexrelid)) AS size FROM pg_index WHERE indexrelid::regclass::text LIKE 'testapp_%';
                        index                        |  size   
-----------------------------------------------------+---------
 testapp_author_pkey                                 | 2208 kB
 testapp_book_pkey                                   | 21 MB
 testapp_book_author_pkey                            | 42 MB
 testapp_book_author_book_id_author_id_ae92da66_uniq | 42 MB
 testapp_book_author_author_id_657c9727              | 42 MB
 testapp_book_author_book_id_e2a2a45a                | 42 MB

So, for 2M relationships, we have 42MB of wasted space in testapp_book_author_book_id_e2a2a45a.

I’ve been working on projects including tables with 10x this size in rows. The cost per database GB in Heroku PostgreSQL, for instance, goes from 1.28 USD to 2.34 USD, depending on the plan. With this optimization, not only the cost will be lower, but also the insert time (I have some tasks bulk inserting millions of rows per day).

Fixing the problem (Optimizing ManyToManyField)

I’ve read the related Django code and think I have a simple solution. I’m willing to create a pull request optimizing this, but don’t know if and how I should I proceed. Should I create a ticket on code.djangoproject first or the pull request directly?

1 Like