In Ticket 34521 I proposed adding Python’s __slots__
to many classes inside the template engine.
That PR was not merged, for good reason. Using slots prevents adding extra attributes to objects, and at least two cases were known where things add attributes to template internals:
- Django Debug Toolbar patches
RequestContext
to track context processor output. - django-template-partials patches the
Parser
to add tracking of partials.
Today, I’d like to propose a more restricted version: we add slots to Node
, all subclasses within Django, and NodeList
. I think this will strike the balance between extensibility and performance, for these reasons:
- These are the most numerous objects within the template system, with a typical full page render using maybe a thousand nodes.
- Nodes are “more internal” than
RequestContext
,Parser
, and others, so less likely to have extra attributes attached. - Complex custom template tags normally create
Node
subclasses. Python’s__slots__
behaviour means they won’t be slotted without their own explicit__slots__
definition, so they will continue to work without change.
I have made a draft PR with my proposed change. I’ve measured it and found a ~20% memory saving on real-world templates and a ~6% speedup in rendering on a benchmark.
This speedup is more than the 1% I originally measured in Ticket 34521. I think this is because:
- The proposed PR affects all
Node
subclasses in Django rather than just the base class. - I did a better benchmark.
More details on the benchmarking follow.
First, to measure total memory usage, I used the below script to load all templates under tracemalloc.
tracemalloc script
import os
import time
import tracemalloc
import warnings
from pathlib import Path
from django.template import Context, engines
# Ignore all warnings as some templates trigger them
warnings.simplefilter("ignore")
engine = engines["django"]
tracemalloc.start()
templates = {}
for dir_ in engine.template_dirs:
dir_ = Path(dir_)
for root, _, files in os.walk(dir_):
root = Path(root)
for file in files:
template_name = str((root / file).relative_to(dir_))
if template_name in templates:
continue
try:
templates[template_name] = engine.get_template(str(template_name))
except Exception: # some TemplateSyntaxErrors
pass
print(f"{len(templates)} templates loaded")
snapshot = tracemalloc.take_snapshot()
tracemalloc.stop()
total_bytes = sum(
stat.size for stat in snapshot.statistics("lineno")
)
print(f"Total memory allocation: {total_bytes / 1024 / 1024:.2f}MiB")
Invoked like:
$ ./manage.py shell -c 'import example'
601 templates loaded
Total memory allocation: 14.29MiB
On a real-world client project with 601 templates, I got these results:
- Before: 14.29 MiB
- After: 11.51 MiB (-19%)
Second, to benchmark rendering speed, I used pyperf for its robust running and comparison capabilities. Because it’s so thorough, I only had time to run a smaller benchmark. I ran:
$ python -m pyperf timeit \
--setup 'import django
django.setup()
from django.template import Template, Context
template = Template("it is {{ x }}\n" * 100_000)
context = Context({"x": "X"})' \
'template.render(context)' \
--inherit-environ DJANGO_SETTINGS_MODULE \
--rigorous \
--duplicate 10
This template has a lot of nodes which will bias a bit toward shownig improvements with slots, due to so many objects needing reading from memory in each render. But I found I needed quite a large template to get a stable benchmark with a long enough execution time.
I ran the command with --output before.json
on Django’s main branch and --output after.json
on my modified branch (about 9 minutes each run), and then compared the two results with:
$ python -m pyperf compare_to before.json after.json --table
+-----------+--------+----------------------+
| Benchmark | before | after |
+===========+========+======================+
| timeit | 183 ms | 172 ms: 1.06x faster |
+-----------+--------+----------------------+
This 6% speedup is statistically validated by the compare_to
subcommand.
So, what do we think?