Complex filtering condition on m2m relation

Let’s say I have the two following models:

class Item(models.Model):
    name = models.TextField(...)
    tags = models.ManyToManyField("Tag", related_name="tags")
class Tag(models.Model):
    name = models.TextField(unique=True)

I want to query items using conditions that look like this:
Give me items that are tagged with one among [tag1, tag2, tag3] AND one among [tag4, tag5, tag6]

What I’ve been trying is the following:


but it produces the following SQL

SELECT "item"."id",
FROM "items"
    INNER JOIN "item_tag" ON (
        "item"."id" = "item_tag"."item_id"
            "item_tag"."tag_id" IN (1)
            OR "item_tag"."tag_id" IN (2)
        AND (
            "item_tag"."tag_id" IN (3)
            OR "item_tag"."tag_id" IN (4)

Which, understandably, always results in an empty queryset, because it looks for a single row of the joined table that will satisfy both clauses of the AND sentence, which isn’t possible.

What’s a way I can achieve this? Is it even possible to make such a query at database level using the Django ORM?

Turns out there is a way to do this:


Which generates this SQL

SELECT DISTINCT "item"."id",
FROM "item"
    INNER JOIN "item_tag" ON (
        "item"."id" = "item_tag"."item_id"
    INNER JOIN "item_tag" T4 ON ("item"."id" = T4."item_id")
            "item_tag"."item_id" IN (1)
            OR "item_tag"."item_id" IN (2)
        AND (
            T4."item_id" IN (3)
            OR T4."item_id" IN (4)

It makes another join for each AND clause of the condition, and looks for a row in which at least one tag matches the id in the corresponding column.

This should work, but it looks like it could get terribly inefficient: my most common use case is to have 2-3 conditions AND’ed together, and I will frequently need to ask for something like: Give me 3 items that satisfy this condition, then 5 that satisfy this one, etc. with all conditions having the same form as the one described above.

The problem is that the first part of the query (that is, the inner joins) will be thrown away after each query and re-calculated, whereas sometimes I might have been able to re-use the same result in a subsequent query.

Is there any way to make this more efficient, given that I’ll be working in the order of thousands of items and hundreds of tags?

Is your test that it must be tagged by exactly one of tag1, tag2, tag3, or is it that it must be tagged by at least one of tag1, tag2, tag3? (Yes, in addition to being tagged by one of tag4, tag5, tag6.)

1 Like

At least one from each group

Have you tried:

Item.objects.filter(tags__in=[tag1, tag2, tag3]).filter(tags__in=[tag4, tag5, tag6])

This creates a query with two inner joins on the tags table and satisfies your conditions as I understand them.

If this is not the correct query, please explain why not - or at least identify a condition for which it does not return the desired results.

1 Like

It looks like it’s equivalent to mine–so correctness isn’t the issue. What I’m now questioning (I realize I wasn’t very clear about this) is efficiency.

Say I have to make, one after the other, the following queries:

Item.objects.filter(tags__in=[tag1, tag2, tag3]).filter(tags__in=[tag4, tag5, tag6]).order_by('?')[:3]
Item.objects.filter(tags__in=[tag4, tag2]).order_by('?')[:5]
Item.objects.filter(tags__in=[tag1, tag2, tag3, tag4]).filter(tags__in=[tag5, tag6]).filter(tags__in=[tag7, tag8]).filter(tags__in=[tag9]).order_by('?')[:1]

and then return the combined queryset to get a list of items that satisfy all the conditions. The condition are arbitrary and dynamic, so I’ll need to implement a function that gets those conditions using some sort of syntax and returns a queryset.

My issue with this is: each of those queries could potentially re-use joins of the other ones; instead, here, I’m throwing everything away and re-joining everything on the next query. Is there a way to cache the “first part” of the queries (i.e. the SELECT and INNER JOINs)?

One issue I see with this is that the number of joins directly depends on the conditions that are “and”'ed—there’s one join per “group” of tags.

First, using order_by('?') could have a larger effect on performance than anything else, depending upon the size of the Item table. It has to read the entire result set, shuffle it, then return the desired subset.

Second, this is Python. Never lose sight of the fact that:
... .filter(tags__in=[tag1, tag2, tag3]) ...

is functionally identical to:

tag_list = [tag1, tag2, tag3]
query_parm = {'tags__in': tag_list}
... .filter(**query_parm) ...

That equivalence makes it trivial to dynamically construct queries.

Third, don’t forget that querysets are lazy. They’re not actually evaluated until a resultset is requested.

Finally, give the PostgreSQL query planner / optimizer a chance to work for you:

sub_q_1 = Item.objects.filter(tags__in=[tag1, tag2, tag3]).filter(tags__in=[tag4, tag5, tag6]).order_by('?').values('id')[:3]
sub_q_2 = Item.objects.filter(tags__in=[tag4, tag2]).order_by('?').values('id')[:5]
sub_q_3 = Item.objects.filter(tags__in=[tag1, tag2, tag3, tag4]).filter(tags__in=[tag5, tag6]).filter(tags__in=[tag7, tag8]).filter(tags__in=[tag9]).order_by('?').values('id')[:1]
resultset = Item.objects.filter(Q(id__in=sub_q_1)|Q(id__in=sub_q_2)|Q(id__in=sub_q_3))

Are you aware of a more efficient way of getting random items from the db?

Interesting. So let’s say I have my two model Tag and Item, and I have the two following additional models:

class ItemChoiceRule(models.Model):
    amount = models.PositiveIntegerField()
class ItemChoiceRuleClause(models.Model):
    rule = models.ForeignKey(ItemChoiceRule, related_name="clauses")
    tags = models.ManyToManyField(Tag)

Those are used to express the conditions we talked about so far. Every ItemChoiceRuleClause is a list of tags, and all the clauses are AND’ed when filtering.

If I want to build a function that takes in a list of rules and returns a queryset that follows those rules, I figured it could look like this:

items = Item.objects.all()
condition = Q()
for rule in rules:
  rule_qs = items
  for clause in rule.clauses.all():
    rule_qs = rule_qs.filter(tags__in=[t for t in clause.tags.all()])
  rule_qs = rule_qs.distinct().order_by('?').values('id')[:rule.amount]
  condition |= Q(id__in=rule_qs)

return Item.objects.filter(condition).distinct()

Does this look like it might hit the db more than necessary? I tried to simply generalize what we came up with in the above posts.

It really depends upon how many Item are going to be selected and the frequency at which these queries are (likely) being made. If you’re talking about 100 results from which the random selection is to be made, I wouldn’t worry about it. At 1000 I’d get a little concerned. At 10,000, I’m worried. At 100,000, I’m definitely looking for other ways to handle it.

Querysets are lazy. Everything you’re generating here is only going to be one database query. (Granted, one containing ‘n’+1 select statements, but it’s still only one query being submitted to the DB - which is going to be more efficient than ‘n’ separate select statements.)

1 Like

We’re talking at least a couple thousands items to shuffle per query. (*) Do you have any resources to point me to to read more in depth about how to do this more efficiently?

(*) I just realized that, since the shuffling happens after all the filters, it’s probably going to be much less than that in fact.

Not the way you’ve got it written. You’re doing a random selection from within each of the subqueries. So if you’ve got 3 subqueries, that’s three separate randomization processes.

Not specific references, no. I suspect that any targeted answer would need to be dependent upon more specific information regarding the exact data involved, the number and type of these subqueries needing to be generated, how “dynamic” this data is relative to the number of entries being queried and the number and types of tags involved, how frequently this query is being run, etc, etc. (It’s even possible that to achieve reasonable and consistent performance that you might end up needing to redesign your data schema.)

Yeah, but the randomization is being done on a much smaller subset of the items, i.e. Those who meet the condition(s).

I’m on my phone right now so it’s hard to type a lot, but essentially I’m making an elearning app (it’s an extension of the app I’ve already talked about in other threads), in which there’s a certain number of questions (which up to now I’ve called “items”), which are tagged with one or more tags. Teachers and students can create personalized exam mock ups by specifying randomization rules. Here’s what it looks like in practice:

  • give me 3 questions which are tagged “integrals” AND at one among “easy” and “medium difficulty,”
  • give me 5 questions which are tagged “limits” AND “hard”
  • give me 2 questions which are tagged one among “derivatives” and “differential calculus.”

And so on. The system then spits out a series of questions accordingly.

Right now I am working on an app that implements some of these concepts, but in a more specific fashion, which limits what can be done. Essentially, questions have a topic (fk to a Topic model) and a difficulty (integer) and a “randomization rule” simply has an fk to a topic and a series of fields “amount_easy,” “amount_hard”… sai_training/ at 6c5ce1778ebc3218688161cef05986278e648679 · samul-1/sai_training · GitHub

Here’s what the creation of an exam looks like with the app I already made RPReplay_Final1634590978.MP4 - Google Drive (sorry about the UI texts in Italian, hopefully it’s clear enough what’s going on). As you can see, you’re pretty limited in what you can select: a topic and a difficulty distribution.

What I want to do is generalize the concepts of both topics and difficulty level into the concept of tags, to allow for more flexible conditions like the one I exemplified above.

I can share with you the project of the data model that I have come up with if you feel like you want to help me see if I can come up with something better, that would be much appreciated. I really want to get this right, I truly believe in this project that I’m working on.

I wouldn’t consider investing a whole lot of time without knowing there’s a real problem with your current design.

You don’t actually need real data to run some tests. Load up your database with faked data and see what your performance profiles look like.

After that you can then evaluate alternatives.

But keep in mind, nothing is ever really etched in stone. It’s not a mistake to get something working first, then come back later to improve it or address unforseen limitations.

I got you. I am not claiming there’s an issue with my design. In fact, this is probably (going to be) the best work I’ve done so far, as I’m striving to use the lessons learned from the mistakes I made in my previous projects to come up with something better. I was asking about this because I haven’t written a single line of code for this project yet—I’m first gathering a comprehensive list of all the possible use cases for the system, because one of the lessons I learned in the past 9 months working on a couple of projects is that one of the main causes for issues isn’t so much a wrong choice made in the coding part as much as not getting the user-requirements right, or having them change throughout the development.

What I’m trying to come up with is a general enough solution that will allow most of the features that could eventually be requested after the system has been created to be implemented as “particular cases” of the already available models and combination of existing features, without having to yank off half of the code base and rewrite from scratch.

Sure, but this time round I want to avoid getting so far into the development to find myself in trouble when I need to make a change because I rushed to write the code so much, while a little more careful planning in the beginning could have easily prevented the issue(s).

Thank you for the – as always – valuable advice.

One of the lessons that I’ve learned across 40+ years of software development is that you never have a complete set of user-requirements at the start and that there will always be changes throughout development. It’s the nature of the beast and the reason why the various agile-style development methodologies have evolved. (And also why waterfall methodologies have usually proven to be failures - they all presume that it’s possible to know and define requirements up-front, which isn’t true except in a small, well-defined subset of cases.)

1 Like