Database instrumentation at a higher layer

KenWhitesell · November 29, 2020, 9:47pm

I was wondering if anyone had any ideas on Instrumenting the ORM at the point where the Queryset is converted to an SQL statement to be issued to the database.
(Tracking, timing, logging, categorizing)

I write a lot of management commands to perform data transformations and migrations. I’m looking for a way where I can gather information about my processes over the lifespan of the command.

Ideally, I’d like to gather information at the point in time where the Queryset is converted to SQL to be issued, so that I still have the original Queryset available to me to aid in the categorization process. (I really don’t want to parse SQL)

I have read both Database Instrumentation and How to Make Always-Installed Django Database Instrumentation (Thanks @adamchainz!) and I think I understand what they’re doing - but they’re doing it at a lower layer than what I’m looking for.
(I’d also prefer to not have to go into every command and make changes to every command file. I’d really like to find / do something at the system level.)

I’m thinking I might be able to write a custom subclass for django.db.models.sql.compiler.SQLCompiler with my desired hooks and create a custom engine subclassed from the standard PostgreSQL engine to use it, but I’m just starting to read through the code and I don’t quite understand how the pieces all fit together yet.

I’ve checked djangopackages.org and pypi.org and haven’t found any candidates.

Anyone have any thoughts on this?

massover · November 30, 2020, 2:14am

It kind of sounds like you want to set up application tracing.

Paid services like datadog have libraries that you can pip install as well as handling the daemon processes to push the data to datadog. By default if you hook it up to your Django application, it will “just work” and instrument your middleware, sql, cache, and maybe some other things I’m forgetting.

The “free” tracing will only be so good and ultimately you’ll have to go in and find the best points in your code to wrap for tracing.

There’s also open source alternatives such as Open Tracing and Open Telemetry

I could be way off with this suggestion.

Hope it helps!

KenWhitesell · November 30, 2020, 2:32am

Thanks for the suggestions, but no, that’s not quite the direction I’m going. APM tools don’t provide the specificity that I’m trying to obtain.

The typical monitoring / profiling tools will show me where the time is spent in different functions, which is fine as far as it goes. But we do a lot with dynamic queryset construction and selection, so what I’m looking to do to is count and categorize the querysets being used. That’s why I need to find (or create) a hook that lets me analyze querysets at the point at which they’re used to generate a real SQL statement.

adamchainz · November 30, 2020, 9:22am

Yes a custom compiler class sounds like the kind of thing you would want. Definitely treading into undocumented/murky waters.

What I can remember is: the SQLCompiler object receives the Query, which is the internal state of a QuerySet. From this you can read some details like the model, and many other attributes (source). The names are a bit confusing because they bridge between the ORM and SQL, so it might be a little hard to categorize querysets as you want.

Good luck and let us know how you get on!

KenWhitesell · November 30, 2020, 1:13pm

Thanks for the affirmation.

The Query object is exactly what I’m looking for - I’ve already futzed enough with them to know it has enough of what I want for this purpose. My sticking point has been finding the right place (and time) to grab it.

I’m going to dive into this rabbit hole to see how far it goes - and I will report back once I get some results (either way).

Thanks again!

KenWhitesell · December 1, 2020, 1:04am

So it turns out to be easier than I thought. I’ve “cargo-culted” a bunch of stuff, but this fits into the category of “it works for me”, so I’m not terribly inclined to worry about it too much.

At the risk of boring you to tears…

Created a new app named “apm” and added it to the settings.
Changed the database engine to apm.backends
Created directory apm/backends
Created the following files in apm/backends

base.py

from django.db.backends.postgresql import base
from apm.backends.operations import DatabaseOperations

class DatabaseWrapper(base.DatabaseWrapper):
    ops_class = DatabaseOperations

compiler.py
The execute_sql method is where the magic happens. For this demo, I’m just printing some information. In reality, this will be writing to a redis queue for out-of-band processing and statistics gathering.
The addition of the other compiler classes is me whacking this with a brick to make it all work. No idea whether or not there’s a cleaner way to do this, but quite honestly, very little concern about it.

from django.db.models.sql.compiler import SQLCompiler as SQLC
from django.db.models.sql.compiler import SQLInsertCompiler as SQLIC
from django.db.models.sql.compiler import SQLDeleteCompiler as SQLDC
from django.db.models.sql.compiler import SQLUpdateCompiler as SQLUC
from django.db.models.sql.compiler import SQLAggregateCompiler as SQLAC
from django.db.models.sql.constants import (
    GET_ITERATOR_CHUNK_SIZE, MULTI,
)

class SQLCompiler(SQLC):
    pass

    def execute_sql(self, result_type=MULTI, chunked_fetch=False, chunk_size=GET_ITERATOR_CHUNK_SIZE):
        retval = super().execute_sql(result_type=result_type, chunked_fetch=chunked_fetch, chunk_size=chunk_size)
        print("===== Executing query =====")
        print(self.query.base_table)
        print(self.query.table_map)
        print(self.query)
        return retval

class SQLInsertCompiler(SQLIC):
    pass

class SQLDeleteCompiler(SQLDC):
    pass

class SQLUpdateCompiler(SQLUC):
    pass

class SQLAggregateCompiler(SQLAC):
    pass

operations.py

from django.db.backends.postgresql.operations import DatabaseOperations as DO

class DatabaseOperations(DO):
    compiler_module = "apm.backends.compiler"

Ken

adamchainz · December 1, 2020, 11:01am

Oh great. This post is probably the best documentation on the topic of creating a custom DB backend now

You can slightly simplify this since you don’t need your own subclasses… The only thing to do is silence flake8’s warning about unused imports with # noqa: F401:

from django.db.models.sql.compiler import SQLCompiler as SQLC
from django.db.models.sql.compiler import (  # noqa: F401
    SQLInsertCompiler,
    SQLDeleteCompiler,
    SQLUpdateCompiler,
    SQLAggregateCompiler,
)
from django.db.models.sql.constants import (
    GET_ITERATOR_CHUNK_SIZE, MULTI,
)

class SQLCompiler(SQLC):
    def execute_sql(self, result_type=MULTI, chunked_fetch=False, chunk_size=GET_ITERATOR_CHUNK_SIZE):
        retval = super().execute_sql(result_type=result_type, chunked_fetch=chunked_fetch, chunk_size=chunk_size)
        print("===== Executing query =====")
        print(self.query.base_table)
        print(self.query.table_map)
        print(self.query)
        return retval

Topic		Replies	Views
Question about adding database-specific functionality to Django ORM Using Django	3	1295	April 23, 2020
Make django query pasteable Using the ORM	2	95	April 12, 2024
How to extract usable SQL from a QuerySet? ORM	2	5469	August 26, 2020
Is logging all sql queries possible with DEBUG=False (staging environment)? Deployment	1	707	November 19, 2023
Trying to map a Postgres SQL Query to an ORM queryset Using the ORM	2	168	May 25, 2024

Database instrumentation at a higher layer

Related topics