I have a query which returns up to 200k rows with a Postgres JSON field. The query logic itself has been optimized, but I suspect that performance can be further improved. I am looking to measure how much time is spent parsing the JSON into Python. How would I go about profiling that?
The easiest way would be to use the shell to manually load the data, and then use the Python timeit module to run some experiments.
Ken
Is there a way to drill down into how much time is spent querying the database vs the time spent parsing JSON? Would I need to subclass the field and measure the from_python method?
I’d take a first stab in a different direction. I’d use the values clause to return the data as data rather than as Python objects and factor that out of the overall timings.
Cool, thanks for the idea.
You could also go the raw query route where you time the sql itself rather than any layers in the ORM.
Either way, once you retrieve the data, you can then run separate timings to see how quickly they’re deserialized into object instances.
For production data timing, you can also install an APM solution (Scout, New Relic, etc.) that constantly tracks performance of queries across your site, so you always have data.
We have that in place, the slow query logs are surfaced from Aruara to DataDog and AWS QuerySite.
@KenWhitesell Don’t values
and values_list
both parse the JSON into Python, just not as an ORM instance? I don’t have to call json.loads
on a values_list
.
Could be. We’ve left my scope of knowledge now, everything beyond this is conjecture on my part. I seem to remember that it’s psycopg2 that is doing the raw conversion rather than anything in the ORM.
(From the psycopg2 docs on JSON: Reading from the database, json
and jsonb
values will be automatically converted to Python objects.)
It goes on to say that you can bypass this process by casting the column to a text
field.
So if you’re needing to do more fine-grained analysis of the timing, you might want to run the raw queries with and without casting the column. (You could then also run the raw query through psql to factor out the rest of the overhead introduced by psycopg2 and python.)
If the total amount of data is large enough to warrant concern, you might even want to evaluate the network overhead involved if you’re not running this on the same system as the database. (And if you are on the same system, you might want to look at the difference between using a tcp port through the loopback address vs a unix socket if you’re not already using one.)
Ken
I would attempt to determine where the big problem is in the database or not.
In the past, I have found these articles helpful to try and figure that out:
In some recent work, using queryset.filter().values('field1')
results in big improvements for me. This selects less data from the database.
@KenWhitesell I now see the docs are explicit too that the conversion happens at a lower level than Django.
I executed the query directly against the psycopg2
client with json (the default) and again with the json casted as a string (json_field::text
). Without the json -> python conversion, the query is ~3x faster than with the python conversion.
Thanks for your help.
I’m going to look into different serialization options that might have better performance.
Yep, selecting only what I need with values_list
, and am specific within the JSON field (field__key
) as well.
Thanks for the links.