Enforce Constraint on Aggregated Data in a Many to Many Relationship

Hello, I am posting here because I have the feeling I might be doing something wrong.

I want to track computer files that can be copied to several data storage devices.
I would like to enforce, on the model level, that you can’t add a file which would exceed the device’s capacity.

As I understand it, you can’t use CheckConstraint across models. So I am using a through model for the ManyToManyField and validate this constraint manually in its save()/clean() method.

Here’s the gist of what I came up with:

class StorageDevice(models.Model):
    name = models.CharField(max_length=256)
    capacity = models.PositiveBigIntegerField()

    def __str__(self):
        return self.name

    def used_capacity(self):
        return self.file_set.aggregate(Sum("size"))["size__sum"] or 0


class File(models.Model):
    name = models.CharField(max_length=256)
    size = models.PositiveBigIntegerField()

    storage_devices = models.ManyToManyField(StorageDevice,
                                             through="FileStorage",
                                             blank=True)

    def __str__(self):
        return self.name


class FileStorage(models.Model):
    file = models.ForeignKey(File, on_delete=models.CASCADE)
    storage_device = models.ForeignKey(StorageDevice, on_delete=models.CASCADE)

    class Meta:
        unique_together = ("file", "storage_device")

    def __str__(self):
        return f"'{self.file.name}' stored in '{self.storage_device.name}'"

    def save(self, *args, **kwargs):
        self.full_clean()  # Also calls clean().
        super().save(*args, **kwargs)

    def clean(self):
        """Check if adding this file would exceed the device's capacity."""
        current_total = self.storage_device.used_capacity()
        if self.file.size + current_total > self.storage_device.capacity:
            err = (f"Cannot store file '{self.file.name}' "
                    "on device '{self.storage_device.name}': capacity exceeded.")
            raise ValidationError(err)

This seems to work. Doing this results in the expected ValidationError:

storage = StorageDevice.objects.create(name="device 1", capacity=100)
file = File.objects.create(name="file 1", size=200)
FileStorage.objects.create(file=file, storage_device=storage1)

But doing this, does not:

file.storage_devices.add(storage)

As far as I know this happens because Django directly modifies the data in the database in this case. I have to admit that I don’t understand why Django uses two different approaches here.

My questions:

How would you model such a relationship?

Would it make sense to use a custom ManyToManyDescriptor and overwrite its add() method? I guess I would also have to deal with set() and remove() in that case.

Where does Django use the add() method? Will I get away with simply not calling it myself? How would you prevent code from calling it accidentally?

I think the default admin forms use add(). But due to using a through model, I will have to write a custom admin form anyway.

Additional thoughts:

My approach isn’t reliable in case of changes to the file size or the device’s capacity after establishing their connection. I will probably have to overwrite their save() methods too, right?

Looking forward to your advice.

Thanks.

<opinion>
This evaluation doesn’t belong in the join table between File and StorageDevice. This appears to me to solely be an attribute of StorageDevice.

As a result, I would organize the code such that the management of the many-to-many field be done by the StorageDevice model.

In other words, I would create a method in the StorageDevice manager to add or remove File instances from that storage device - that you do not directly use the ManyToManyField methods like add, remove, set, etc.

</opinion>

(You might be able to subclass, replace, or modify the related object manager to have the manager perform these functions, but I’m not sure of what the details would be or what side-effects you might encounter.)

The other factor that you may need to address is the difference between the size of the storage unit increment (e.g. “block size”) and the size of the files themselves. This becomes much more significant as the number of files increases.

If I were doing this, I wouldn’t track the free space on the unit myself at all - I would query the device for free space for each operation. It’s going to be a lot more reliable over time.

Thanks for the insight, Ken. I’m just jumping in here, your point about handling the logic on the Storage Device side makes a lot of sense. Centralizing the control feels cleaner and less error-prone than relying on the join table. Appreciate the tip on block size too, definitely something I hadn’t considered.