Add __slots__ to template Node classes (only)

In Ticket 34521 I proposed adding Python’s __slots__ to many classes inside the template engine.

That PR was not merged, for good reason. Using slots prevents adding extra attributes to objects, and at least two cases were known where things add attributes to template internals:

  1. Django Debug Toolbar patches RequestContext to track context processor output.
  2. django-template-partials patches the Parser to add tracking of partials.

Today, I’d like to propose a more restricted version: we add slots to Node, all subclasses within Django, and NodeList. I think this will strike the balance between extensibility and performance, for these reasons:

  1. These are the most numerous objects within the template system, with a typical full page render using maybe a thousand nodes.
  2. Nodes are “more internal” than RequestContext, Parser, and others, so less likely to have extra attributes attached.
  3. Complex custom template tags normally create Node subclasses. Python’s __slots__ behaviour means they won’t be slotted without their own explicit __slots__ definition, so they will continue to work without change.

I have made a draft PR with my proposed change. I’ve measured it and found a ~20% memory saving on real-world templates and a ~6% speedup in rendering on a benchmark.

This speedup is more than the 1% I originally measured in Ticket 34521. I think this is because:

  1. The proposed PR affects all Node subclasses in Django rather than just the base class.
  2. I did a better benchmark.

More details on the benchmarking follow.

First, to measure total memory usage, I used the below script to load all templates under tracemalloc.

tracemalloc script
import os
import time
import tracemalloc
import warnings
from pathlib import Path

from django.template import Context, engines

# Ignore all warnings as some templates trigger them
warnings.simplefilter("ignore")

engine = engines["django"]

tracemalloc.start()

templates = {}
for dir_ in engine.template_dirs:
    dir_ = Path(dir_)
    for root, _, files in os.walk(dir_):
        root = Path(root)
        for file in files:
            template_name = str((root / file).relative_to(dir_))
            if template_name in templates:
                continue


            try:
                templates[template_name] = engine.get_template(str(template_name))
            except Exception:  # some TemplateSyntaxErrors
                pass

print(f"{len(templates)} templates loaded")

snapshot = tracemalloc.take_snapshot()
tracemalloc.stop()
total_bytes = sum(
    stat.size for stat in snapshot.statistics("lineno")
)
print(f"Total memory allocation: {total_bytes / 1024 / 1024:.2f}MiB")

Invoked like:

$ ./manage.py shell -c 'import example'
601 templates loaded
Total memory allocation: 14.29MiB

On a real-world client project with 601 templates, I got these results:

  • Before: 14.29 MiB
  • After: 11.51 MiB (-19%)

Second, to benchmark rendering speed, I used pyperf for its robust running and comparison capabilities. Because it’s so thorough, I only had time to run a smaller benchmark. I ran:

$ python -m pyperf timeit \
    --setup 'import django
django.setup()
from django.template import Template, Context
template = Template("it is {{ x }}\n" * 100_000)
context = Context({"x": "X"})' \
    'template.render(context)' \
    --inherit-environ DJANGO_SETTINGS_MODULE \
   --rigorous \
   --duplicate 10

This template has a lot of nodes which will bias a bit toward shownig improvements with slots, due to so many objects needing reading from memory in each render. But I found I needed quite a large template to get a stable benchmark with a long enough execution time.

I ran the command with --output before.json on Django’s main branch and --output after.json on my modified branch (about 9 minutes each run), and then compared the two results with:

$ python -m pyperf compare_to before.json after.json --table
+-----------+--------+----------------------+
| Benchmark | before | after                |
+===========+========+======================+
| timeit    | 183 ms | 172 ms: 1.06x faster |
+-----------+--------+----------------------+

This 6% speedup is statistically validated by the compare_to subcommand.

So, what do we think?

5 Likes

Thanks for picking this up @adamchainz. The proposal sounds about right. Let me have a proper look but :+1:

1 Like

That’s a nice speedup. The changes seem localized and unproblematic. i don’t think any of my ugly project-specific monkey patches would suffer, and if so, it’s probably fine.

Regarding the debug toolbar, we could maybe save our data in our own data structures instead of monkey patching to make way for further enhancements. I have a feeling that we’re not the only ones patching the context object though, since it seems like a good place to stash additional data. That being said, everyone could just use the context dictionaries instead for this…!

1 Like

I’ve finally picked this up, reopening the ticket and updating the PR. Please take a look, if you can: Fixed #34521 -- Added __slots__ to several template classes. by adamchainz · Pull Request #18649 · django/django · GitHub .

1 Like