Prefetching relations to arbitrary depth

I’m wondering if we can make it simpler to prefetch relations recursively.


I have trees modeled with a self-referring foreign key (“adjacency list” trees), like this model in the Django test suite I’ve edited to have a related_name of “children”.

class Node(models.Model):
    num = models.IntegerField(unique=True)
    parent = models.ForeignKey(
        "self", models.SET_NULL, to_field="num", null=True, related_name="children"
    )

I’ve been prefetching the child relationships like this:

Node.objects.prefetch_related("children__children__children")

which works great for a known depth of 3. But when the depth of the tree isn’t known, you can do the following: specify a depth (e.g. 50) that is way past what you’re likely to need, and Django is smart enough to stop doing queries once the depth is exhausted:

def test_recursive_fk_prefetch_related(self):
    node1 = Node.objects.create(num=42)
    node2 = Node.objects.create(num=43, parent=node1)

    with self.assertNumQueries(3):
        self.assertSequenceEqual(
            Node.objects.prefetch_related("__".join(["children"] * 50)).all(),
            [node1, node2],
        )

That works great if a depth of 10 is the biggest likely depth, and 50 would be vanishingly rare.

But what if 100 becomes the biggest likely depth, and 1000 would be vanishingly rare? Use 1000 as the magic number?

I’m wondering if we couldn’t add something to Django to be able to replace:

Node.objects.prefetch_related("__".join(["children"] * 50))

with

Node.objects.prefetch_related("children*")

Of course this could be problematic for users who have cycles in their trees (graphs?) But if this is opt-in behavior, I’m not so worried about that.

Doing the magic number feels like a trick, and it makes it look like Django is doing 50 queries.

I also have cases where I need to do things like:

Node.objects.prefetch_related(
    "children__other",
    "children__other__children__other",
)

So the ability to do even this would be helpful:

Node.objects.prefetch_related(
    models.Prefetch("children__other", recursive=True),
)

How are folks prefetching their self-referring foreign key relationships? I haven’t looked in depth detail at the various libraries for tree querying in Django, but core Django is already meeting my needs with the caveat that I’d like to be able to prefetch children to arbitrary depth.

2 Likes

If deep depth is rare, wouldn’t it also mean that just hitting the N+1 for those rare cases would be basically as efficient as doing the prefetch anyway?

I don’t think prefetch_related is the tool to use here. The SQL feature for recursive self-joins, CTEs, has an open ticket. If we could add that, it would allow selecting all levels of children back in one query. CTEs are already generated by some tree packages.

Additionally, check out my open PR on model field fetching modes. The “peer-fetching” mode allows for a kind of “post-fetch related” behaviour. Using that mode with a tree structure would provide “one query per layer” behaviour without any annotations on the QuerySet. It wouldn’t be quite as good as CTE support, but as good as your “unlimited depth prefetch” idea.

You can already try this “fetch peers” behaviour by using django-auto-prefetch, which I help maintain with its creator @tolomea.

Help trialling and reviewing my Django PR is very welcome!

3 Likes

If you have a tree that large and/or deep, then you should be looking at a structure other than an adjacency list. Even CTEs have potential negative performance characteristics. (Note, I just learned this morning, while researching the current status of this for my reply here, that PostgreSQL has implemented a potentially significant improvement to the query planner for CTEs in PostgreSQL 17. See PostgreSQL 17 - A Major Step Forward in Performance, Logical Replication and More. I’m looking forward to doing some experimentations with this!)

Our largest trees have the usage profile where queries need to be optimized first, followed by inserts, and then deletes. As a result, we settled on using a Materialized Path tree. (Technically, there are limits to the sizes of the trees that can be created, so in a truly unbounded situation, it is possible to break the tree under any given configuration. But our trees fit well within those limits.)

This is precisely the reason to be looking at an alternative structure.

1 Like

An approach we’ve used at work to avoid having to rely on N = max_depth queries to achieve this is to always fetch trees from their root, prefetch_related("offspring"), and then build the parent relationships of trees in memory which is what a CTE backend solution would have to do anyway.

class Node(models.Model):
    root = models.ForeignKey(
        "self", models.SET_NULL, null=True, related_name="offspring"
    )
    parent = models.ForeignKey(
        "self", models.SET_NULL, null=True, related_name="children"
    )

roots = Node.objects.filter(id__in=root_ids).prefetch_related("offspring")
# The logic below can be encapsulated in a custom `Prefetch` class that makes
# use of QuerySet._iterable_class.
for root in roots:
    children = defaultdict(list)
    for node in root.offspring.all():
        children[node.parent_id].append(node)
    for node in root.offspring.all():
        if not (node_children := children.get(node.id)):
            continue
        node._prefetched_objects_cache["children"] = node_children

If your use case require that you’re able to fetch a node from any arbitrary depth (not always from root) and all of its offspring your data structure needs revisiting as @KenWhitesell pointed out.

I’d look into using solutions like django-treebeard’s materialized path or simply writing your own by relying on Postgres int[] or ltree to denormalize your tree for efficient reads.

2 Likes

Great stuff, thanks all.

We’re using CTEs in places where we haven’t been using the ORM very much. Future ORM support for CTEs is exciting, thanks to everyone working on it.

Investigating other data structures is worthy, but I’m working on an open-source framework with mature deployments and changing the guts of the data structure is pretty much out of the question. I’m doing the equivalent of adding a contrib app for the framework that makes it easier to express transformations of this data using the ORM. Going from ??? queries to N = max_depth + 1 queries has struck me as good enough – we have other low-hanging fruit to pick first – but I appreciate all these resources!

The “peer-fetching” mode allows for a kind of “post-fetch related” behaviour. Using that mode with a tree structure would provide “one query per layer” behaviour without any annotations on the QuerySet.

Thanks, I’ve been meaning to check out that PR! I can see how that takes care of children__children..., but I’m interested to try out children__other__children__other... also.

You have surely long considered this, but just in case: If the total tree size was small enough, maybe you could load the entire tree at once and then build the parent-child relationships yourself / in your code?

In our project we have a tree that has roughly 300 nodes. I’ve had good success and performance with always loading everything. That works especially well as we have to traverse large portions of the tree with every access anyways. (And as our nodes change only very infrequently, I consider caching the entire tree in memory, but have only recently started looking into the best caching approach.)

Something else that might help is exploiting the features of depth-first or breadth-first numbering of the nodes in order to quickly limit what has to be loaded?

Using django-mptt and its get_descendants queryset method

2 Likes