How to write this raw sql subquery as native django query?

I have a model called Loans, containing (among others) the following fields: LibAccountId, Title, StartDate, EndDate and RemoteId.
So, some records could look like this:

LibAccountId,Title,StartDate,EndDate,RemoteId
1, Book title A, 2021-01-01, 2021-02-05, 1234
1, Book title A, 2021-01-01, 2021-01-21, 1234
1, Book title A, 2021-01-01, 2021-01-14, 1234
5, Book title B, 2021-05-05, 2021-06-01, 5678

I want to make a ModelManager that returns a queryset, where for a unique record (based on LibAccountId, Title, StartDate and - if present - RemoteId), only the ones with the most recent Enddate are returned.
So, this would return:

1, Book title A, 2021-01-01, 2021-02-05, 1234  #  2021-02-05 is most recent
5, Book title B, 2021-05-05, 2021-06-01, 5678

Iā€™ve managed to write native python code to filter this (returning a list of Loan objects), and (my SQL is a bit rusty) a raw SQL query, also returning the desired result.
However, with both approaches, it doesnā€™t give me a Queryset that I can further apply Django ORM methods like filter(), etc.

So, how can I write the following raw SQL query with Djangoā€™s native querylanguage?

class UniqueLoansManager(models.Manager):
    """Make loans that have been extended, distinct to 
       one single loan (the last one)
    """

    def get_queryset(self):
        return super().get_queryset().raw("""
            SELECT *
            FROM Loans
            INNER JOIN (
                SELECT LibAccountId as LAI, Title AS T,StartDate as SD, RemoteId AS RI, MAX(EndDate) AS max_enddate
                From Loans
                GROUP BY LibAccountId, Title, StartDate, RemoteId
            ) last_by_enddate ON
                Loans.LibAccountId = LAI
                AND Loans.Title = T
                AND Loans.StartDate = SD
                AND  (Loans.RemoteId= RI  OR (Loans.RemoteId IS NULL AND RI IS NULL))
                AND Loans.EndDate = last_by_enddate.max_enddate
            ORDER BY Loans.EndDate DESC
            """
        )
      # Problem: return value is a RawQuerySet 

I have worked on SQL for 3 years when I used .Net technology, And I liked SQL command, Really itā€™s magnificent. But I gave me chance to penetrate ORM you will get more robust and secure query by it.

I think this query is not difficult to be a result of ORM,
The Left Outer Join is more difficult but I can make it by ORM.
I have no ORM query for your SQL query now.
But If you think well, and give yourself another chance to try ORM you will have benefits.
You can use(filter, values, values_list, select_related, preftch_related, annotate and aggregate with order_by and distinct) then you get what you want .
Try to print the ORM query to the console by (orm_qs.query) to see the SQL equivalent,
Give us your models and you will get what you want ā€¦
Also when using ORM you can use(Case, When and etc) of SQL commands
Take a look at

Review the docs at Aggregation | Django documentation | Django

Itā€™s not a precise SQL translation, but I think Iā€™ve captured the essence of what youā€™re trying to do with:
Loan.objects.values('lib_account_id', 'title', 'start_date', 'remote_id').annotate(max_enddate=Max('end_date'))

(Yes, Iā€™ve changed your variable names to match Django standard coding conventions. I would suggest you adopt them - it really does make things easier overall.)

3 Likes

Hi Ken,

Thanks a lot for your feedback.
The pointer to the Aggregation documentation is a good read; I was, and still am, a bit overwhelmed by all the different Django query methods and helper functions. But that clears up at least some things.

And indeed, your statement returns the same amount of records (3147) as my original query. It does return a Queryset, so I can further chain filter(...) as I see fit. However, the values(...) method returns a dictionaries instead of Loan objects. I understand values()is needed for the grouping (as the documentation explains: " ā€¦ However [ā€¦] Instead of returning an annotated result for each result in the original QuerySet , the original results are grouped according to the unique combinations of the fields specified in the values() clause. An annotation is then provided for each unique group; the annotation is computed over all members of the group.")

Would there be a way to get ā€˜realā€™ Loan objects returned?
(a workaround would be to do, after the query, a Loan.objects.get(...) on a unique set of the 4 fields, for each of the returned items in the queryset; but that feels hacky-ish)

Finally, I am aware (and using) standard coding conventions for the model fields. However, this is a legacy database (with managed=False), and I copied the SQL query directly as I used it on the real table and field names). But thanks for pointing out.

Not using that query. You could probably do something with a subquery using a conditional expression to annotate a value to each element of the queryset and then use that in a filter to identify which rows to keep. I have no idea what sort of SQL statement would be generated though and youā€™d probably want to run an explain on it to see how bad it might be. (Itā€™s the same basic idea as your get ā€œhackā€ except being performed in a single query, so Iā€™d guess it would be better than that - but I wouldnā€™t know unless I tried it.

Iā€™m not sure what other filters youā€™re looking at adding to this, but it actually might not be too bad to just perform those subsequent filters in Python on that queryset list-of-dict objects. List comprehensions and dict value retrievals are fast enough that you might not see a difference between this ā€œhybridā€ solution and a full SQL query.

Thanks again for your useful feedback.

I wanted to expose the queryset via a UniqueLoansManager model manager, to use it for whatever i see fit. (At this point I am mainly targeting some statistics on e.g. unique loans histograms per year)
It would then be part of my Loan model:

class Loan(model.Models):
    ...
    objects = models.Manager()           # all loans, including extensions
    current_loans = CurrentLoansManager()
    unique_loans = UniqueLoansManager()  #  intermediate loan extensions filtered out

However, your feedback gave me a good headstart; Iā€™ll try to figure it out from here.

Out of my own curiosity, would applying those other filters (e.g. ā€œcurrent_loansā€) first be valid? In other words, could you apply the aggregation to the results of current_loans.all()?

No, that wouldnā€™t make sense in my use cases.

The ā€œcurrent loansā€ represent the most recent items (5 or 20 or so) of loans that are currently borrowed on a library account. I already have that operational for some time, as simple as (to further satisfy your curiosity :slight_smile: ):

class CurrentLoansManager(models.Manager):
    def get_queryset(self):
        return super().get_queryset().filter(libaccountid__lastcheck=F("lastseen"))

libaccountid is a foreign key reference.

On the other hand, the unique_loans represents all loans ever borrowed, excluding intermediate extensions (each loan extension is track as a separate, new loan record where the endddate is different; the common startdate and some other fields, allow to know whether this is really a bunch of extended loans.)

1 Like

As far as I know, reproducing the exact same query with the ORM will be a bit difficult as GROUP BY queries with the Django ORM wonā€™t allow you to select extra fields (a unique identifier of each loan record).

Otherwise you could use the __in field lookup to filter the Loan.objects queryset and select only the records with the idā€™s that match your GROUP BY query.

Fortunately there are more roads leading to Rome with the Django ORM: You could consider using a Exists Subquery query expression like so:

# First define a GROUP BY query to select only the loan records 
# with the most recent end dates
group_by_qs = (
    Loan.objects.values("LibAccountId", "Title", "StartDate", "RemoteId")
    .annotate(max_enddate=Max("EndDate"))
    .order_by()
)

# Now query for the Loans which match GROUP BY query,
# based on LibAccountId, Title, StartDate, RemoteId and max_enddate
qs = Loan.objects.filter(
    Exists(
        group_by_qs.filter(
            LibAccountId=OuterRef("LibAccountId"),
            Title=OuterRef("Title"),
            StartDate=OuterRef("StartDate"),
            RemoteId=OuterRef("RemoteId"),
            max_enddate=OuterRef("EndDate"),
        )
    )
).order_by("-EndDate")

Hi Maerteijn

thank you for your suggestions. I already had a look it it, but ran into a problem with the following line:

It raised an error that EndData field was not in the table (quoting it here from my memory), which disappeared if I remove that line. Still need to look into it further, but havenā€™t found the time yet.

The EndDate field Iā€™ve got from your SQL query above. Maybe you could also post your model definition here as well?

Yes, I understand. I adapted the names to the ones of my model, that didnā€™t fix it (I also checked for typos).
My model looks like this (itā€™s a legacy, imported sqlite table; I am in the processing of going to managed=True and cleaning it up further):

class Loan(models.Model):
    # fmt:off
    loanid = models.AutoField(db_column='LoanId', primary_key=True)  # Field name made lowercase.
    # libaccountid = models.IntegerField(db_column='LibAccountId', blank=True, null=True)  # Field name made lowercase.
    libaccountid = models.ForeignKey(Libraryaccount, on_delete=models.CASCADE, default=1, db_column='LibAccountId')
    lastseen = models.DateTimeField(db_column='LastSeen', blank=True, null=True)  # Field name made lowercase.
    title = models.TextField(db_column='Title', blank=True, null=True)  # Field name made lowercase.
    author = models.TextField(db_column='Author', blank=True, null=True)  # Field name made lowercase.
    startdate = models.DateField(db_column='StartDate', blank=True, null=True)  # Field name made lowercase.
    enddate = models.DateField(db_column='EndDate', blank=True, null=True)  # Field name made lowercase.
    type = models.TextField(db_column='Type', blank=True, null=True)  # Field name made lowercase.
    extendable = models.IntegerField(db_column='Extendable', blank=True, null=True)  # Field name made lowercase.
    branchname = models.TextField(db_column='BranchName', blank=True, null=True)  # Field name made lowercase.
    remoteid = models.TextField(db_column='RemoteId', blank=True, null=True)  # Field name made lowercase.
    # link = models.TextField(db_column='Link', blank=True, null=True)  # Field name made lowercase.
    link = models.URLField(db_column='Link', blank=True, null=True)  # Field name made lowercase.
    cover_link = models.URLField(db_column='CoverLink', blank=True, null=True)
    # fmt:on

    objects = models.Manager()  # The default manager.
    current_loans = CurrentLoansManager()
    unique_loans = UniqueLoansManager()

    class Meta:
        managed = False
        db_table = "Loans"

Side note: You donā€™t need to change the underlying table and column names in the database. The names of Models and Fields donā€™t need to match them. (See the db_table and db_column settings)

1 Like

Tested it with SQLite now and SQLIte is not liking the query that is generated with the Exists keyword.

Maybe a more elegant solution could be:

# Create a subquery which sorts the loans on DateEnd, based on the
# LibAccountId, Title, StartDate and RemoteId fields.
subquery = (
    Loan.objects.filter(
        LibAccountId=OuterRef("LibAccountId"),
        Title=OuterRef("Title"),
        StartDate=OuterRef("StartDate"),
        RemoteId=OuterRef("RemoteId")
    )
    .order_by("-EndDate")
    .values("pk")
)

# Query for the loans which matches the pk of the Loan of the latest EndDate only
Loan.objects.filter(pk=Subquery(subquery[:1]))

Generated SQL:

SELECT "my_project_loan"."id",
       "my_project_loan"."LibAccountId",
       "my_project_loan"."Title",
       "my_project_loan"."StartDate",
       "my_project_loan"."EndDate",
       "my_project_loan"."RemoteId"
FROM "my_project_loan"
WHERE "my_project_loan"."id" =
    (SELECT U0."id"
     FROM "my_project_loan" U0
     WHERE (U0."LibAccountId" = "my_project_loan"."LibAccountId"
            AND U0."RemoteId" = "my_project_loan"."RemoteId"
            AND U0."StartDate" = "my_project_loan"."StartDate"
            AND U0."Title" = "my_project_loan"."Title")
     ORDER BY U0."EndDate" DESC
     LIMIT 1)

(Note I still used your uppercase names)

Great. This really works well (a bit slower than raw sql query, but thatā€™s ok).
Iā€™ve fine-tuned it a bit further using Q() expressions on the remoteid field, as it can also be null, and the RemoteId=OuterRef("RemoteId") doesnā€™t consider these.

So I turned the subquery into:

        subquery = (
            Loan.objects.filter(
                Q(remoteid=OuterRef("remoteid")) | Q(remoteid__isnull=True),
                libaccountid=OuterRef("libaccountid"),
                title=OuterRef("title"),
                startdate=OuterRef("startdate"),
            )
            .order_by("-enddate")
            .values("pk")
        )

Thankā€™s to both of you for your input, effort and persistence :slight_smile:

1 Like

@rabarberski
Hope you are satisfing with ORM query
After your point of view changed