Some modelling advice

Ok, so this isn’t purely Django, but we do model our relational databases using Django (what a tremendous feature it is!).

I’m deliberating as to how to model a particular concept. The concept is quite simple. A test, for example a blood test has a category and it may have a sub-category. A sub-category always has a category.

One possible way I could this is:

class Category(models.Model):
    pass

class SubCategory(models.Model):
    category = models.Foreignkey(Category, on_delete=models.CASCADE)

class Test(models.Model):
    category = models.Foreignkey(Category, on_delete=models.CASCADE)
    sub_cat = models.Foreignkey(SubCategory, null=True, on_delete=models.CASCADE)

This is functional, but something about it irks me, namely that you could have a Test record that has a relation to both a SubCategory and a Category when you should really just be able to access the record by test.subcategory.category

Another way of doing it is to drop category from the Test model and create a SubCategory record which is for all intents and purpose a record to indicate that the test doesn’t have a sub-category. But this also doesn’t seem correct to me. It feels like a bit of a hack. As an example of what I mean:

>>>> test = Test.objects.get(my_test)
>>>> test.sub_category

<SubCategory: I am a placeholder category>

>>>> test.sub_category.category

<Category: Acidity Tests>

The third option which has popped into my head is to create two Test models, CategoryTest and SubCategory test. This however doesn’t seem very normalised to me, as both the CategoryTest and SubCategory models will be identical.

The fourth option, and the one I am using now, is to have a many-to-many relation which is illustrated in the code below:

class Test(models.Model):
    pass

class Category(models.Model):
    tests = models.ManyToManyField(Test, on_delete=models.CASCADE)

class SubCategory(models.Model):
    tests = models.ManyToManyField(Test, on_delete=models.CASCADE)

The first issue with this is that the relationship is not a many-to-many, it is in fact a 1 to many relationship, so with the above, I have unnecessary tables and complexity.

As you can probably tell, I’m a little unsure as to which way I’m going to go, so perhaps there is someone here who can look at this problem with a fresh pair of eyes.

If you are absolutely and positively sure that you’re ever only going to have the Category/SubCategory pair, my answer is going to be slightly different than if you only have Category and SubCategory now, but may add a third level later.

It’s also going to make a difference whether or not there’s a behavioral difference between a Category and SubCategory. Is this purely a labeling / display issue? Or are there stats and reporting that is going to matter whether something is in a particular category vs being a sub-category?

Finally, is it the case that if a Category has SubCategories associated with it, that all Tests associated with that Category will have a SubCategory? Or is it possible that Test A is in Category K, SubCategory X, while Test B is in Category K but no SubCategory?

These are domain issues that can affect your modeling choice. (At least they allow or deny different ideas that I have about this particular design choice.)

Ken

Whilst I can’t be 100% sure that there will not be an additional category added to today’s Category and SubCategory, I’m reasonably certain that that will not be the case. The reason I can’t be 100% is that the system I’m designing is primarily modelling things with which I am not familiar, but I am doing my best to extract all the requirements I can.

We will have reporting showing things like which test was most commonly requested, which category contained the most requested test so on and so forth, but I do not see a use case for defining other attributes to the Category and SubCategory models other than their names.

In the frontend, both Category and SubCategory are both primarily for labelling/grouping. That is, I can return an object like this:

"categories": [
	{
	    "name": "Imaging",
	    "tests": [There could be tests here, too],
	    "sub_categories": [
	        {
	            "name": "CT",
	            "tests": [ the tests ]
	        },
	        {
	        	"name": "Radiography",
	            "tests": [ the tests ]
	        }
	},
]

A test is exclusively in a single category, and whilst it isn’t the case today, there is no reason why a Category cannot contain both its own tests and sub-categories with their own tests.

Thank you for taking the time to look at this, Ken.

Cheers,

Conor

You could have each category just have a parent category.

This gives you infinite depth.

You need to make the Parent able to be Null/None as this gives you your top level categories.

Given the relative “simplicity” of the modeling requirements and that the categories are just text labels, I’d be inclined to create my first version of this as just a single table with columns for category and subcategory. For those categories without a subcategory, the subcategory column would be blank. (In other words, consider the category/subcategory pair as a logical tuple.)

That gets you off the ground and running with what appears to be the simplest solution that works - without locking you into something that you won’t be able to work out of later.

Ken

Hi Ken, Andy,

Thank you both for chiming in. Always nice to have some good input from the outside world. I’ll let you know what I end up going with after I’ve had another chat with the folks building the real-world requirements.

Again, cheers!

Conor

I know it has been a while since I started this thread, but some things got in the way of refactoring this particular part of my app. Since you both were kind enough to give me some advice, I thought it only fair to give you an update should you be interested.

Categories and sub-categories will have a couple of attributes so I think it is now best to make them their own models. Given the requirements that a test will either solely have a category or sub-category I have decided to build a test as such:

class Category(models.Model):
  name = 
  attr2 = 
  
class SubCategory(models.Model):
  name = 
  category = FK(Category)
  attr2 = 
  
class Test(models.Model):
  name = 
  category = models.ForeignKey(Category, null=True, blank=True, on_delete=models.CASCADE)
  sub_category = models.ForeignKey(SubCategory, null=True, blank=True, on_delete=models.CASCADE)
  

I can then have a nested serializer which should return all tests and subcategories given a specific category

class SubCategorySerializer(serializers.ModelSerializer):
    tests = TestSerializer(many=True, read_only=True)

    class Meta:
        model = SubCategory
        fields = ("id", "name", "attr2")


class CategorySerializer(serializers.ModelSerializer):
    tests = TestSerializer(many=True, read_only=True)
    sub_categories = SubCategorySerializer(many=True, read_only=True)

    class Meta:
        model = Category
        fields = ("id", "name", "tests", "sub_categories", "attr2")

Django and Django Rest Framework are bloody wonderful!

1 Like

From a data-modelling perspective, I would not include an FK to Category in Test.

  • It’s redundant data, since category can be retrieved through sub_category. (And with appropriate use of select_related and prefetch_related, there’s not going to be a performance-hit.)
  • There’s nothing in the model that enforces that category is the same category as sub_category.category leading to potential data-integrity errors.
  • (Minor point) It doesn’t even appear like you’re using category in CategorySerializer

Ken

Hi Ken,

Thanks for taking the time to respond and share your thoughts, it always appreciated. Looks like I didn’t do a terrific job of explaining how category and sub-category could work. Not every test must be in a sub-category. Some tests can be only in a category (bit annoying, but these are the requirements) and some categories will not have any sub-categories For example.

Category Penguin
      test 1 - Am_I_a_Gentoo

Category Blood
   SubCategory Haematology
           test 1 - something_to_do_with_blood
           test 2 - is_blood_red
   SubCategory BioChemistry
         test 1 - so_on_and_so_forth

I see your point about integrity and I’ve been thinking about this, and I must admit, I don’t have an answer I’m happy with.

I was thinking, and I know, it’s ugly, that if a test had a sub-category then I would leave category null. Likewise, if a test only had a category, then sub-category would be null.

Regarding the serialisers, it was a typo, I’ve fixed it now. I just whipped up the code in the forum’s editor. But yes, you’re spot on!

Cheers,

Conor

No, you did - I didn’t read it closely enough the first time.

I don’t think there’s a perfect answer for this, just different tradeoffs depending upon the types of queries being run on the data and the total volume that will exist in these tables.

Some ideas:

  • category becomes a GFK to either Category or SubCategory

or

  • Adding a trigger in the database to ensure that precisely one of category and sub_category is null.

or

  • Change Test to only have an FK to Category, where Category might look something like this:
class Category():
    name = ...
    parent = FK('Category', null=True, on_delete=models.PROTECT)

class CategoryAttributes():
    category = OneToOne(...)
    attr2 = ...

class SubCategoryAttributes():
    category = OneToOne(...)
    attr2 = ...

Wow, two things I didn’t know exists/could be done.

First, GFK is new to me. I haven’t read more than the highlights but it could be interesting but I am curious/concerned how the database is cobbled together. It sounds rather magical, for better or for worse.

I didn’t know that you could have an FK to self. That is really neat and I really like your idea with your Category and CategoryAttributes. I think that could work really very well.

You’ve given me plenty to chew on Ken, thank you! I’m going to go away and do some testing and have a play. I’ll let you know how I get along.

And please Ken, ask a question on the forum where I can help you for a change!

C

P.S. I’ve very excited about the FK to self. Very!

There are a number of different ways to store a “tree structure” or “hierarchy” within a database. There’s a package called django-treebeard that we use for this purpose. It implements three of the most common algorithms. It’s an interesting topic of study in that each of the three have different performance characteristics regarding searching, inserting, and deleting nodes within the tree. (This particular implementation where you have an FK to self is the Adjacency List.)

For some other reading on the topic, I can suggest Managing Hierarchical Data in MySQL — Mike Hillyer's Personal Webspace
(I thought I had one more link, but I can’t find the reference at the moment.)

I’ve been back and forth, up and down, and circled the wagons any number of times about GFKs. I’ve both recommended for and recommended against them at nearly the same time. I’m of the general mindset that they do have value and can be used quite effectively, but the number of such cases where they’re really useful is a lot smaller than you might think. (See Avoid Django’s GenericForeignKey - lukeplant.me.uk for some reasons why.)

Cheers!

1 Like

Ah trees! I’m getting flashbacks to a C subject from uni which had a textbook named something along the lines of “Abstract Data Types in C”. It’s been about 17 years since I looked at the topic, so maybe it is time to dive in!

Thanks for all the help, Ken. It’s been wonderfully helpful.

Cheers!

Conor