Hello, I am trying to build multiple potentially very deep tree structures with unknown length.
What I am trying to achieve is to build trees of convesation where node are reference base on the similarity.
For instance “i love cat” is the child of “i like going to zoos”.
In fact I managed to pull this of with django Mptt and vector search.
However after some reading into Sql rabbit hole, nested set is not a correct way to deal with deep and frequently updated tree.
I also found that closure table is a potential solution.
If anyone have experience with big and deep tree, can you please give me your thought.
I am also want to know about graph db but they seem to be too expensive in term of money to setup and maintain.
Thank you
Best regards,
You are correct, Nested Set is not the best solution for highly dynamic tree structures.
In this case, the “Materialized Path” structure is generally better.
We use the Django treebeard library for this.
1 Like
Hello,
Thank you for your speedy support,
However, materialized path requires a fix max length for path which I dont know before hand and it is likely to go big.
Is that a bad idea to override treebeard implementation with textfield?
Also what is the cost of having extreme length path
“Big” and “deep” are relative terms. If you can’t quantify these, at least to provide an order-of-magnitude estimate, it’s going to be difficult to offer tangible suggestions.
Nope, it’s an extremely good idea.
Quoting directly from the PostgreSQL docs on this issue:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n
) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n
) is usually the slowest of the three because of its additional storage costs. In most situations text
or character varying
should be used instead.
The bigger constraint with the treebeard implementation is that you’ve got a hard upper limit on the number of nodes under a single parent. If that quantity is truly unbounded, then the treebeard library itself may not be suitable for your purposes, and you could be better off looking at different alternative implementations.
1 Like
Hello thank you for your reply,
Whenever there is a new message, I calculate the similarity with the prev one and plug it somewhere on the tree where the score is highest. The tree can be big if user say many different things and become deep if user say the same thing multiple times.
I found the limit of treebeard is quite generous:
Attribute that defines the length of each step in the path
of a node. The default value of 4 allows a maximum of 1679615 children per node. Increase this value if you plan to store large trees (a steplen
of 5 allows more than 60M children per node). Note that increasing this value, while increasing the number of children per node, will decrease the max depth
of the tree (by default: 63 ). To increase the max depth
, increase the max_length attribute of the path
field in your model.
Thank you again, i will check what does such variables mean.
It seems like you are looking for a hierarchy of comments to comments… but this is a really difficult task.
In my case, I thought I could solve it with a foreign key, but I found the parent of the parent comment, and then found the parent of that comment… and experienced an infinite loop.
finally, I implemented it by creating n trees and limiting replies to n-1 trees. This is also a very cumbersome task, so I do not use it now.
It may seem cumbersome, but I think it is a way to express the tree structure explicitly.
Here’s an example:
# tree{n} receives the pk of the parent object.
class model:
class Meta:
ordering = [tree1, tree2, tree3, pk]
tree1 = int
tree2 = int
tree3 = int
...
1 Like
Yes, the implementation of a hierarchical structure does present some challenges. Fortunately, it’s a well-studied area.
I recommend reading Managing Hierarchical Data in MySQL — Mike Hillyer's Personal Webspace as an introduction to the topic. While the author specifically addresses implementation in MySQL, the principles apply regardless of the database being used. The Django treebeard library provides implementations of all three fundamental algorithms.
I can also recommend Joe Celko’s SQL for Smarties: Advanced SQL Programming (The Morgan Kaufmann Series in Data Management Systems). It looks like the current edition (5th) still has a complete chapter on this topic. Anyone seriously interested in relational data modeling should include this in their reading list.
1 Like
I never said I needed that…
For those who are on this quest, I found this guy who do benchmark for mptt, treebeard (3 methods) and his own implementation.
https://github.com/BertrandBordage/django-tree/blob/master/benchmark/results/results.rst
The conclusion is that holding a big and deep tree in SQL that needs to be read and write quickly is generally a bad idea. (we can see that most solution derails at around 3000 records).
However, if you need to write quick and figure out how to rebuild a tree later for data analysis purpose (offline), treebread AL is a way to go.
That conclusion does not match our experience in practice.
We have trees exceeding 40,000 entries that are still quite performant using Treebeard materialized path trees. I wouldn’t recommend it as much as I do if we hadn’t had such good success with it. (I find the author’s off-hand comment about it being brittle to be somewhat specious. We’ve never found it to be so in practice.)
Hey,
Thank you, I have thought about it a little bit I found out that my code only require to query ancestors for a node which is extremely convenient with MP since all ancestors already presented in the path.
I think I will go with MP and figure out how to limit the dynamic depth, since super deep tree does not very intuitive for human brain either.
1 Like
For those who want to pull this off in with treebread MP, here is my attempt.
I overwrite the charfield of path to textfield because I thought that textfield does not have max_length
. However, treebread still search for the max_length variable internally so you can stay with Charfield. I choose the max_length
to be arbitrarily large which may be a problem in the future.
I also overwrite get_ancestors()
of treebread to include the node itself which is a feature of MTTP that is somehow missed for unknown reason. Therefore, this implimentation may blow up in the future, but now it works and manage to deal with 8 level of depth with few dozen nodes.
class MemoryTreeMP(MP_Node, AbstractPromptResponse):
name = models.CharField(max_length=255, unique=True)
steplen = 5
is_session_start_node = models.BooleanField(default=False)
path = models.TextField(max_length= 9999999, unique=True)
alphabet = string.digits + string.ascii_uppercase + string.ascii_lowercase
def __str__(self):
return 'Memory Node: {}'.format(self.name)
def get_ancestors(self, include_self: bool):
"""
:returns: Override A queryset containing the current node object's ancestors,
starting by the root node and descending to itself.
"""
if self.is_root():
return get_result_class(self.__class__).objects.none()
if include_self:
paths = [
self.path[0:pos]
for pos in range(0, len(self.path) + self.steplen, self.steplen)[1:]
]
else:
paths = [
self.path[0:pos]
for pos in range(0, len(self.path), self.steplen)[1:]
]
return get_result_class(self.__class__).objects.filter(
path__in=paths ).order_by('depth')
1 Like
Update,
The max size for index in postgres is about 2700 bytes therefore too large path length will indeed blow up your tree. Keep the max_length below 220000 char (compressed down to 2300 bytes)
1 Like