INNER JOIN "item_tag" ON (
"item"."id" = "item_tag"."item_id"
"item_tag"."tag_id" IN (1)
OR "item_tag"."tag_id" IN (2)
"item_tag"."tag_id" IN (3)
OR "item_tag"."tag_id" IN (4)
Which, understandably, always results in an empty queryset, because it looks for a single row of the joined table that will satisfy both clauses of the AND sentence, which isn’t possible.
What’s a way I can achieve this? Is it even possible to make such a query at database level using the Django ORM?
SELECT DISTINCT "item"."id",
INNER JOIN "item_tag" ON (
"item"."id" = "item_tag"."item_id"
INNER JOIN "item_tag" T4 ON ("item"."id" = T4."item_id")
"item_tag"."item_id" IN (1)
OR "item_tag"."item_id" IN (2)
T4."item_id" IN (3)
OR T4."item_id" IN (4)
It makes another join for each AND clause of the condition, and looks for a row in which at least one tag matches the id in the corresponding column.
This should work, but it looks like it could get terribly inefficient: my most common use case is to have 2-3 conditions AND’ed together, and I will frequently need to ask for something like: Give me 3 items that satisfy this condition, then 5 that satisfy this one, etc. with all conditions having the same form as the one described above.
The problem is that the first part of the query (that is, the inner joins) will be thrown away after each query and re-calculated, whereas sometimes I might have been able to re-use the same result in a subsequent query.
Is there any way to make this more efficient, given that I’ll be working in the order of thousands of items and hundreds of tags?
Is your test that it must be tagged by exactly one of tag1, tag2, tag3, or is it that it must be tagged by at least one of tag1, tag2, tag3? (Yes, in addition to being tagged by one of tag4, tag5, tag6.)
and then return the combined queryset to get a list of items that satisfy all the conditions. The condition are arbitrary and dynamic, so I’ll need to implement a function that gets those conditions using some sort of syntax and returns a queryset.
My issue with this is: each of those queries could potentially re-use joins of the other ones; instead, here, I’m throwing everything away and re-joining everything on the next query. Is there a way to cache the “first part” of the queries (i.e. the SELECT and INNER JOINs)?
One issue I see with this is that the number of joins directly depends on the conditions that are “and”'ed—there’s one join per “group” of tags.
First, using order_by('?') could have a larger effect on performance than anything else, depending upon the size of the Item table. It has to read the entire result set, shuffle it, then return the desired subset.
Second, this is Python. Never lose sight of the fact that: ... .filter(tags__in=[tag1, tag2, tag3]) ...
Are you aware of a more efficient way of getting random items from the db?
Interesting. So let’s say I have my two model Tag and Item, and I have the two following additional models:
amount = models.PositiveIntegerField()
rule = models.ForeignKey(ItemChoiceRule, related_name="clauses")
tags = models.ManyToManyField(Tag)
Those are used to express the conditions we talked about so far. Every ItemChoiceRuleClause is a list of tags, and all the clauses are AND’ed when filtering.
If I want to build a function that takes in a list of rules and returns a queryset that follows those rules, I figured it could look like this:
items = Item.objects.all()
condition = Q()
for rule in rules:
rule_qs = items
for clause in rule.clauses.all():
rule_qs = rule_qs.filter(tags__in=[t for t in clause.tags.all()])
rule_qs = rule_qs.distinct().order_by('?').values('id')[:rule.amount]
condition |= Q(id__in=rule_qs)
Does this look like it might hit the db more than necessary? I tried to simply generalize what we came up with in the above posts.
It really depends upon how many Item are going to be selected and the frequency at which these queries are (likely) being made. If you’re talking about 100 results from which the random selection is to be made, I wouldn’t worry about it. At 1000 I’d get a little concerned. At 10,000, I’m worried. At 100,000, I’m definitely looking for other ways to handle it.
Querysets are lazy. Everything you’re generating here is only going to be one database query. (Granted, one containing ‘n’+1 select statements, but it’s still only one query being submitted to the DB - which is going to be more efficient than ‘n’ separate select statements.)
Not the way you’ve got it written. You’re doing a random selection from within each of the subqueries. So if you’ve got 3 subqueries, that’s three separate randomization processes.
Not specific references, no. I suspect that any targeted answer would need to be dependent upon more specific information regarding the exact data involved, the number and type of these subqueries needing to be generated, how “dynamic” this data is relative to the number of entries being queried and the number and types of tags involved, how frequently this query is being run, etc, etc. (It’s even possible that to achieve reasonable and consistent performance that you might end up needing to redesign your data schema.)
Yeah, but the randomization is being done on a much smaller subset of the items, i.e. Those who meet the condition(s).
I’m on my phone right now so it’s hard to type a lot, but essentially I’m making an elearning app (it’s an extension of the app I’ve already talked about in other threads), in which there’s a certain number of questions (which up to now I’ve called “items”), which are tagged with one or more tags. Teachers and students can create personalized exam mock ups by specifying randomization rules. Here’s what it looks like in practice:
give me 3 questions which are tagged “integrals” AND at one among “easy” and “medium difficulty,”
give me 5 questions which are tagged “limits” AND “hard”
give me 2 questions which are tagged one among “derivatives” and “differential calculus.”
And so on. The system then spits out a series of questions accordingly.
Here’s what the creation of an exam looks like with the app I already made RPReplay_Final1634590978.MP4 - Google Drive (sorry about the UI texts in Italian, hopefully it’s clear enough what’s going on). As you can see, you’re pretty limited in what you can select: a topic and a difficulty distribution.
What I want to do is generalize the concepts of both topics and difficulty level into the concept of tags, to allow for more flexible conditions like the one I exemplified above.
I can share with you the project of the data model that I have come up with if you feel like you want to help me see if I can come up with something better, that would be much appreciated. I really want to get this right, I truly believe in this project that I’m working on.
I got you. I am not claiming there’s an issue with my design. In fact, this is probably (going to be) the best work I’ve done so far, as I’m striving to use the lessons learned from the mistakes I made in my previous projects to come up with something better. I was asking about this because I haven’t written a single line of code for this project yet—I’m first gathering a comprehensive list of all the possible use cases for the system, because one of the lessons I learned in the past 9 months working on a couple of projects is that one of the main causes for issues isn’t so much a wrong choice made in the coding part as much as not getting the user-requirements right, or having them change throughout the development.
What I’m trying to come up with is a general enough solution that will allow most of the features that could eventually be requested after the system has been created to be implemented as “particular cases” of the already available models and combination of existing features, without having to yank off half of the code base and rewrite from scratch.
Sure, but this time round I want to avoid getting so far into the development to find myself in trouble when I need to make a change because I rushed to write the code so much, while a little more careful planning in the beginning could have easily prevented the issue(s).
One of the lessons that I’ve learned across 40+ years of software development is that you never have a complete set of user-requirements at the start and that there will always be changes throughout development. It’s the nature of the beast and the reason why the various agile-style development methodologies have evolved. (And also why waterfall methodologies have usually proven to be failures - they all presume that it’s possible to know and define requirements up-front, which isn’t true except in a small, well-defined subset of cases.)