Lesson learned--Check your queries' SQL

I want to share with the community something I learned today, after spending a lot of time (inefficiently) chasing a bug.

My app implements an online school system that allows students to take tests made up of multiple questions. I’m working, among the others, with these models (I redacted most of the fields that weren’t relevant to the issue):

class Question(models.Model):
    text = models.TextField()
    exam = models.ForeignKey(
        Exam, null=True, on_delete=models.SET_NULL, related_name="questions"
    )

class ExamProgress(models.Model):
    user = models.ForeignKey(
        settings.AUTH_USER_MODEL,
        related_name="exams_progress",
        on_delete=models.CASCADE,
    )
    exam = models.ForeignKey(
        Exam, on_delete=models.CASCADE, related_name="participations"
    )
    questions = models.ManyToManyField(
        Question,
        through="ExamProgressQuestionsThroughModel",
        related_name="question_assigned_in_exams",
        blank=True,
    )

    current_item_cursor = models.PositiveIntegerField(
        default=0
    )  # current item being displayed


class ExamProgressQuestionsThroughModel(models.Model):
    ordering = models.PositiveIntegerField()
    exam_progress = models.ForeignKey(ExamProgress, on_delete=models.CASCADE)
    question = models.ForeignKey(Question, on_delete=models.CASCADE)

    class Meta:
        ordering = ["ordering"]

The model ExamProgress represents an instance of a student’s session during an exam. Different students will see different questions in general, because they are randomized, and only one question is shown at once. Therefore, we have a m2m field that holds the questions assigned to the user, and we use a through model to keep track of the ordering. To access a question, we get the one that has the field ordering equal to the value of current_item_cursor in ExamProgress.

After the exam, my app generates pdf reports for the participants, showing the questions they were assigned and the answers they gave.

Let’s say I get a participant’s questions like this:

# this is inside an ExamProgress method, so `self` is an instance of ExamProgress
questions = (
        self.exam.questions.filter(pk__in=self.questions.all())
        .prefetch_related("answers")
        .prefetch_related("given_answers")
)

This is mostly fine, except I want to retain the order in which they were assigned to the user, i.e. I want to order by the through model’s ordering field.

Fine, let’s add:

questions = (
        self.exam.questions.filter(pk__in=self.questions.all())
        .order_by("examprogressquestionsthroughmodel__ordering") # <-----
        .prefetch_related("answers")
        .prefetch_related("given_answers")
)

Now, fast forward to this morning–the exam went by, at the end I hit the button to generate the pdf and, lo and behold, I got a bunch of pdf files with about 200 questions per user! Note that the exam only contained 14 questions per user.

I was extremely confused and didn’t really know what was going on. All I knew was that the right amount of questions was returned if I omitted the order_by method. Another clue I eventually picked up on was that the correct number of questions was also returned if only one user had taken the exam.

It was at this time that I had the idea: let’s check the generated SQL by the query with and without the order by clause.

This is the generated SQL without it:

SELECT "jsplatform_question"."id",
    "jsplatform_question"."text",
    "jsplatform_question"."exam_id",
FROM "jsplatform_question"
WHERE (
        "jsplatform_question"."exam_id" = 10
        AND "jsplatform_question"."id" IN (
            SELECT U0."id"
            FROM "jsplatform_question" U0
                INNER JOIN "jsplatform_examprogressquestionsthroughmodel" U1 ON (U0."id" = U1."question_id")
            WHERE U1."exam_progress_id" = 294
        )
    )
ORDER BY "jsplatform_question"."id" ASC

Now this is the generated SQL with the order_by:

SELECT "jsplatform_question"."id",
    "jsplatform_question"."text",
    "jsplatform_question"."exam_id",
FROM "jsplatform_question"
    LEFT OUTER JOIN "jsplatform_examprogressquestionsthroughmodel" ON (
        "jsplatform_question"."id" = "jsplatform_examprogressquestionsthroughmodel"."question_id"
    )
WHERE (
        "jsplatform_question"."exam_id" = 10
        AND "jsplatform_question"."id" IN (
            SELECT U0."id"
            FROM "jsplatform_question" U0
                INNER JOIN "jsplatform_examprogressquestionsthroughmodel" U1 ON (U0."id" = U1."question_id")
            WHERE U1."exam_progress_id" = 294
        )
    )
ORDER BY "jsplatform_examprogressquestionsthroughmodel"."ordering" ASC

After inspecting the second query’s SQL, I realized what was really going on: ExamProgress was being left-joined with the through model and the results were correctly filtered by exam, but then all the rows of this joined table were being returned, meaning I would get an entry in my resulting queryset for each time a question assigned to that ExamProgress had been assigned to any ExamProgress. What really needed to happen was filtering on the results of the joined table too.

After gaining this knowledge thanks to inspecting the generated SQL, I went back to my python code and added this:

questions = (
            self.exam.questions.filter(pk__in=self.questions.all())
            .order_by("examprogressquestionsthroughmodel__ordering")
            .filter(
                examprogressquestionsthroughmodel__exam_progress=self
            )  # <---------
            .prefetch_related("answers")
            .prefetch_related("given_answers")
)

which in turn added a condition to the outermost WHERE clause:

...
WHERE (
        "jsplatform_question"."exam_id" = 10
        AND "jsplatform_question"."id" IN (
            SELECT U0."id"
            FROM "jsplatform_question" U0
                INNER JOIN "jsplatform_examprogressquestionsthroughmodel" U1 ON (U0."id" = U1."question_id")
            WHERE U1."exam_progress_id" = 294
        )
        AND "jsplatform_examprogressquestionsthroughmodel"."exam_progress_id" = 294
    )

So today I learned two lessons that might come in handy to some of you in the future:

  • when you query a many-to-many relationship and you order by a field that is in the through model, rather than the target model of the relationship, what you get back from the db is actually the rows of a joined table between the target model and the through model (as opposed to, say, the joined table only being used to get the ordering and being discarded later)
  • when your queries aren’t behaving like you expect them to, a valuable option is to check the resulting SQL and see if it gives you any clues.

Thoughts and comments are appreciated :slight_smile: I’m putting this out here just in case anybody with a similar issue bumps into this in the future.

2 Likes

Hi Samuele,

This was a nice write up and a great example of needing to have an understanding of what the ORM is doing. I don’t think I’ve been bitten by this exact issue, but plenty of others that are similar.

I’d like to plug the Django Debug Toolbar as a utility that can be useful in determining SQL inefficiencies in the flow of a web app. It doesn’t work as well with background processes so I’m not sure if you could have actually used it for this particular issue.

-Tim

1 Like