Supporting collation settings on multiple database backends?

ormsbee · March 11, 2022, 11:36pm

I work on Open edX, an open source project that for historical reasons uses MySQL for its production/development backend, and SQLite for running tests. I’m curious if there are established patterns for simultaneously supporting collations for multiple database types in Django code. For instance:

Applying specific character encoding and collation types on an app+database basis, e.g. “always use utf8mb4 character encoding for my app’s models when running in MySQL”.
Normalizing index/sorting case sensitivity behavior between MySQL and PostgreSQL/SQLite. I want some fields to always be case sensitive, regardless of database, like institution-assigned content identifiers. I want other fields to always be case insensitive, like titles.

I know that there’s a db_collation setting on CharFields now, which seems like it’d do exactly what I would want for one database backend, but wouldn’t work for more than one platform at a time. I know I can force case insensitivity in sorts and comparisons with things like iexact or using Lower, but my understanding is that would have severe performance penalties (some of our models have to scale out to billions of rows).

We currently do some ad hoc detection and switching of behavior in migrations files using db.connection.vendor. Does it make sense to try to make new subclasses like CaseSensitiveCharField and CaseInsensitiveCharField? A separate utility function to call to manually set the collation in the migration after the normal CharField has been created? Are there existing patterns for this kind of thing?

Thank you.

ormsbee · May 15, 2023, 3:34am

Dredging this old thread up because I actually got around to trying to implement this.

I created a subclass of CharField like:

class MultiCollationCharField(models.CharField):
    def __init__(self, *args, db_collations=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.db_collations = db_collations or {}

    def db_parameters(self, connection):
        db_params = models.Field.db_parameters(self, connection)

        # Now determine collation based on DB vendor (e.g. 'sqlite', 'mysql')
        if connection.vendor in self.db_collations:
            db_params["collation"] = self.db_collations[connection.vendor]

        return db_params

    def deconstruct(self):
        name, path, args, kwargs = super().deconstruct()
        if self.db_collations:
            kwargs['db_collations'] = self.db_collations
        return name, path, args, kwargs

I then invoke it in my model like:

    key = MultiCollationCharField(
        max_length=255,
        blank=False,
        null=False,
        db_collations={
            'sqlite': 'BINARY',
            'mysql': 'utf8mb4_bin',
        }
    )

And that seems to actually do what I want in Django 4.2. The migration file generates and the sqlmigrate management command on my migration shows the column is being created like this on SQLite: "key" varchar(255) COLLATE BINARY NOT NULL

But in Django 3.2, column_sql doesn’t check db_params for the collation. It instead goes straight to querying the field for its db_collation attribute:

github.com

django/django/blob/stable/3.2.x/django/db/backends/base/schema.py#L214-L224


      
          # Get the column's type and use that as the basis of the SQL
          db_params = field.db_parameters(connection=self.connection)
          sql = db_params['type']
          params = []
          # Check for fields that aren't actually columns (e.g. M2M)
          if sql is None:
              return None, None
          # Collation.
          collation = getattr(field, 'db_collation', None)
          if collation:
              sql += self._collate_sql(collation)

Unfortunately, I still need to support Django 3.2 for another half year or so. I thought I might make a db_collation property on my field and have it pick the right thing to send back based on the current connection, but I don’t know how to get at that connection object using only the field itself:

@property
def db_collation(self):
    # I have the dict of vendors to collations here (self.db_collations),
    # but no connection, and so no vendor, so I don't know which one to
    # send back.

I could do something really hacky and set some vendor attribute on my field as a side-effect when the db_parameters method is called. That seems… bad… but I don’t know how else to make this work. (Edit: Verified that this hack does “work”.)

Anyone have any advice?

Edit: In retrospect, I should have put this thread in the “Using the ORM” section, but I don’t think I can move it. Apologies to the moderators.

KenWhitesell · May 15, 2023, 3:51am

Moved. (No worries, no apologies necessary.)

ormsbee · May 15, 2023, 4:30pm

Okay, more of me talking with myself (but maybe someone will find it useful one day):

I realized we do have a corner of the codebase that actually already does something like this by setting a module-level constant based on settings:

github.com

openedx/blockstore/blob/master/blockstore/apps/bundles/models.py#L70


      
          from .store import DraftRepo, SnapshotRepo, bytes_from_hex_str
          
          
MAX_CHAR_FIELD_LENGTH = 180
          
          

          
"""
          Sqlite is used for testing on edx-platform, but doesn't contain the utf8mb4_general_ci collation sequence.
          
          
So we detect this case here, and use a different db_collation if we're running on Sqlite.
          """
          DB_COLLATION = 'binary' if 'sqlite' in settings.DATABASES['default']['ENGINE'] else 'utf8mb4_general_ci'
          
          

          
class Collection(models.Model):
              """
              Administrative grouping for Bundles: policy, permissions, licensing, etc.
          
          
    Each Bundle belongs to exactly one Collection.
          
          
    UUIDs should be treated as immutable, and external references (e.g. URLs)
              should use them.

Then using it like so:

github.com

openedx/blockstore/blob/master/blockstore/apps/bundles/models.py#L86


      
          
          
    Each Bundle belongs to exactly one Collection.
          
          
    UUIDs should be treated as immutable, and external references (e.g. URLs)
              should use them.
          
          
    Target Scale: 100K rows
              """
              id = models.BigAutoField(primary_key=True)
              uuid = models.UUIDField(default=uuid.uuid4, unique=True, editable=False)
              title = models.CharField(max_length=MAX_CHAR_FIELD_LENGTH, db_index=True, db_collation=DB_COLLATION)
          
          
    def __str__(self):
                  return f"{self.uuid} - {self.title}"
          
          

          
class Bundle(models.Model):
              """
              The UUID of a Bundle should be treated as immutable, but any other part of
              the bundle can be modified without breaking anything.

One place where it does act weirdly is the migrations themselves, since running it would “bake” the actual value of DB_COLLATION into the migration files based on whatever was running at the time the migration was created. But that can be fixed by editing the migration to use the constant instead:

github.com

openedx/blockstore/blob/master/blockstore/apps/bundles/migrations/0003_update_character_set.py#L19


      
          class Migration(migrations.Migration):
          
          
    dependencies = [
                  ('bundles', '0002_create_drafts'),
              ]
          
          
    operations = [
                  migrations.AlterField(
                      model_name='bundle',
                      name='description',
                      field=models.TextField(blank=True, db_collation=DB_COLLATION, max_length=10000),
                  ),
                  migrations.AlterField(
                      model_name='bundle',
                      name='slug',
                      field=models.SlugField(allow_unicode=True, db_collation=DB_COLLATION),
                  ),
                  migrations.AlterField(
                      model_name='bundle',
                      name='title',
                      field=models.CharField(db_collation=DB_COLLATION, db_index=True, max_length=180),

It’s also weird because it doesn’t actually bake the history of the field in. I don’t think there’s any way to detect “it used to be encoding A in migration 0002, and was then changed to encoding B in migration 0003”, because it’s just pulling that value from the constant each time.

Topic		Replies	Views
Change field index to be case sensitive Using Django	3	1257	January 5, 2022
order_by doesn't seem to order by UTF-8 Using the ORM	6	2203	May 30, 2025
Ticket 18392 and MySQL utf8mb4 Django Internals	27	839	December 2, 2024
Django unique case insensitive field Using the ORM	1	528	April 22, 2024
How to change the character set of an existing database Using Django	1	2926	September 8, 2021

Supporting collation settings on multiple database backends?

Related topics