Supporting collation settings on multiple database backends?

I work on Open edX, an open source project that for historical reasons uses MySQL for its production/development backend, and SQLite for running tests. I’m curious if there are established patterns for simultaneously supporting collations for multiple database types in Django code. For instance:

  • Applying specific character encoding and collation types on an app+database basis, e.g. “always use utf8mb4 character encoding for my app’s models when running in MySQL”.
  • Normalizing index/sorting case sensitivity behavior between MySQL and PostgreSQL/SQLite. I want some fields to always be case sensitive, regardless of database, like institution-assigned content identifiers. I want other fields to always be case insensitive, like titles.

I know that there’s a db_collation setting on CharFields now, which seems like it’d do exactly what I would want for one database backend, but wouldn’t work for more than one platform at a time. I know I can force case insensitivity in sorts and comparisons with things like iexact or using Lower, but my understanding is that would have severe performance penalties (some of our models have to scale out to billions of rows).

We currently do some ad hoc detection and switching of behavior in migrations files using db.connection.vendor. Does it make sense to try to make new subclasses like CaseSensitiveCharField and CaseInsensitiveCharField? A separate utility function to call to manually set the collation in the migration after the normal CharField has been created? Are there existing patterns for this kind of thing?

Thank you.

1 Like

Dredging this old thread up because I actually got around to trying to implement this.

I created a subclass of CharField like:

class MultiCollationCharField(models.CharField):
    def __init__(self, *args, db_collations=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.db_collations = db_collations or {}

    def db_parameters(self, connection):
        db_params = models.Field.db_parameters(self, connection)

        # Now determine collation based on DB vendor (e.g. 'sqlite', 'mysql')
        if connection.vendor in self.db_collations:
            db_params["collation"] = self.db_collations[connection.vendor]

        return db_params

    def deconstruct(self):
        name, path, args, kwargs = super().deconstruct()
        if self.db_collations:
            kwargs['db_collations'] = self.db_collations
        return name, path, args, kwargs

I then invoke it in my model like:

    key = MultiCollationCharField(
        max_length=255,
        blank=False,
        null=False,
        db_collations={
            'sqlite': 'BINARY',
            'mysql': 'utf8mb4_bin',
        }
    )

And that seems to actually do what I want in Django 4.2. The migration file generates and the sqlmigrate management command on my migration shows the column is being created like this on SQLite: "key" varchar(255) COLLATE BINARY NOT NULL

But in Django 3.2, column_sql doesn’t check db_params for the collation. It instead goes straight to querying the field for its db_collation attribute:

Unfortunately, I still need to support Django 3.2 for another half year or so. :frowning_face: I thought I might make a db_collation property on my field and have it pick the right thing to send back based on the current connection, but I don’t know how to get at that connection object using only the field itself:

@property
def db_collation(self):
    # I have the dict of vendors to collations here (self.db_collations),
    # but no connection, and so no vendor, so I don't know which one to
    # send back.

I could do something really hacky and set some vendor attribute on my field as a side-effect when the db_parameters method is called. That seems… bad… but I don’t know how else to make this work. (Edit: Verified that this hack does “work”.)

Anyone have any advice?

Edit: In retrospect, I should have put this thread in the “Using the ORM” section, but I don’t think I can move it. Apologies to the moderators.

Moved. (No worries, no apologies necessary.)

1 Like

Okay, more of me talking with myself (but maybe someone will find it useful one day):

I realized we do have a corner of the codebase that actually already does something like this by setting a module-level constant based on settings:

Then using it like so:

One place where it does act weirdly is the migrations themselves, since running it would “bake” the actual value of DB_COLLATION into the migration files based on whatever was running at the time the migration was created. But that can be fixed by editing the migration to use the constant instead:

It’s also weird because it doesn’t actually bake the history of the field in. I don’t think there’s any way to detect “it used to be encoding A in migration 0002, and was then changed to encoding B in migration 0003”, because it’s just pulling that value from the constant each time.