How to optimize this query without raw?

Hey folks,

I’m in the process of optimizing some filter queries that join span several tables and noticed a repeating pattern of including “unnecessary” tables among joins. I found no obvious way other than using raw(), but I’d very much like to keep using filter() if possible. Does anyone better versed in Django queries have suggestions? Thanks!

class Person(models.Model):
    pass


class Dog(models.Model):
    owner = models.ForeignKey(Person, on_delete=models.DO_NOTHING)


class Cat(models.Model):
    owner = models.ForeignKey(Person, on_delete=models.DO_NOTHING)

person = Person.objects.create()
someone_else = Person.objects.create()
cat = Cat.objects.create(owner=person)
cat_of_someone_else = Cat.objects.create(owner=someone_else)
dog = Dog.objects.create(owner=person)
dog_of_someone_else = Dog.objects.create(owner=someone_else)

# Let's find dogs owned by the same owner as the owner of this cat

self.assertEqual(
    [dog],
    list(
        Dog.objects.filter(
            owner__cat=cat
        )
    )
)

# Generates a JOIN that includes owners
#
# SELECT "my_app_dog"."id", "my_app_dog"."owner_id"
# FROM "my_app_dog"
#          INNER JOIN "my_app_person"
#                     ON ("my_app_dog"."owner_id" = "my_app_person"."id")
#          INNER JOIN "my_app_cat"
#                     ON ("my_app_person"."id" = "my_app_cat"."owner_id")
# WHERE "my_app_cat"."id" = 1

# But the same can be achieved without joining owners
self.assertEqual(
    [dog],
    list(
        Dog.objects.raw("""
            SELECT * 
            FROM my_app_dog
            INNER JOIN my_app_cat
                ON my_app_dog.owner_id = my_app_cat.owner_id 
            WHERE
                my_app_cat.id = %s  
        """, [cat.id])
    )
)

# Is there a way to achieve this without raw?
# Can I set up a model relation that allows me to do
# something like this?
# Dog.objects.filter(
#     cats_by_the_same_owner__id=cat.id
# )

Hi,

I think you can try:

dog = Dog.objects.filter(onwer_id=cat.owner_id)
1 Like

Thanks, I may have oversimplified the example. The actual filter is a long chain of JOIN, in the middle of which dogs are joined with cats via owners, like this:

.filter(
            some__more__stuff__dog__owner__cat__even_more___relations__id=id
        )

…which I would like to turn into something that does not result in JOIN person:

.filter(
            some__more__stuff__dog__cats_by_the_same_owner__cat__even_more___relations__id=id
        )

…or back to my original example refined a bit: given a cat_id, I want to get dogs without JOIN person:

        Dog.objects.filter(
            owner__cat__id=cat_id
        )

To directly answer your question, no. The Django ORM does not provide a facility to do joins between tables based on a non-related field.

From a practical perspective, have you done an explain analyze on both of your queries to actually prove that one is better than the other? I wouldn’t second-guess the query planner without some solid evidence that the query being executed is actually using all the resources you think it’s using.

I have done some explain analyze and although this kind of optimization does not usually come first, below are benchmarks from a real query I’m working on (not the dogs/cats example). The original Django-generated query joins 7 tables, the optimized raw query only joins 4:

Before:

Explain analyze in production:
Planning time: 7-10ms
Execution time: 25-70ms, avg ~35ms
pgbench with 1000 repeats, non-production system with a clone of production data:
latency average = 15.315 ms
tps = 65.295739 (without initial connection time)

After:

Explain analyze in production:
Planning time: 2ms
Execution time: 18-40ms, avg ~22ms
pgbench with 1000 repeats, non-production system with a clone of production data:
latency average = 2.580 ms
tps = 387.647384 (without initial connection time)

That seems like a significant gain. The results might of course still be different in the wild and if there was an easy way of keeping a dozen lines of Python code and not turning it into ~80 or more lines of raw SQL I’d go for it without thinking much (and would come to a final conclusion after having collected performance monitoring data).

As there seems to be no Django-way to doing this, I might look elsewhere for optimization opportunities.

Thanks for the insights everyone!

1 Like