Ticket 18392 and MySQL utf8mb4

Django’s minimum required version of MySQL is 8.0.11. Since MySQL 8.0, the utf8mb4 charset has been recommended, instead of the old utf8/utf8mb3, which they have deprecated.

Currently, Django defaults to using the deprecated “utf8”/“utf8mb3” character set.

If a user wants to use a different character set, they can add the “charset” and “collation” options to their DATABASES configuration.

What do you think of changing Django defaults to “utf8mb4” for MySQL, with notes for the users that they can use the DATABASES options to stay on “utf8mb3” if they have a legacy database that isn’t fully ready for “utf8mb4”?

(See the ticket for past discussion.)

1 Like

Wow. Yes. I always assumed that it was mysql that was defaulting to “utf8”, not Django. I have "OPTIONS": {"charset": "utf8mb4"}, set in all of my database settings and I 100% agree this should be the default in Django, otherwise you get emoji issues. With “utf8” aka “utf8mb3” being deprecated I think it makes it even more clear.

As far as I know the old 191 index-length concern is no longer an issue in modern MySQL versions, which I think would have been the main reason for keeping “utf8” as the default, (besides backward compatibility).

I also think that at some point we’ll have to go forward and change the default.

One method would be to change the default in the next Django version and prominently document it.

Another possible option could be passing through an (accelerated?) deprecation step, by obtaining the default database charset and warn projects to set an explicit 'OPTIONS': {'charset': 'utf8'} if they want to keep using the legacy encoding, as the default will change in the next Django version.

Thanks. I think the index length could still be limited if someone switched to utf8mb4, but kept the old row format. I would tend to just list that in the documentation and suggest they stay on the old utf8mb3 if they can’t update the row format.

Thanks. My preference would be to change the default and prominently document it, but if we need to go through a deprecation step, we could.

If we do a deprecation step, I wonder if we could just mark the default-utf8mb3 as deprecated, without adding code to check the database that django is connecting to.

Maybe, but I’m not certain it’s possible to determine the database collation based on the connection setting only. A bit tricky (that’s why the ticket is currently rotting). Breaking the compatibility is really the simplest way, but is not very in line with the traditional compatibility policy of Django.

I guess since all a user has to do to stay on utf8mb3 is put the charset & collation in the databases options, I don’t think it’s too bad of an incompatibility. And, the change is in line with what MySQL is doing - MySQL has deprecated the current Django default, so currently Django is defaulting to a deprecated (and many would argue, a broken) character set.

And, the current default is already causing problems - users like @collinanderson who want real UTF8 currently have to know about the issue and take an extra step to support it.

@claudep Do you think a Steering Council decision would be required to change the default?

1 Like

Do you think a Steering Council decision would be required to change the default?

Not necessarily. The idea of a forum thread is to be able to collect several opinions and see if some clear path emerges. For example, I would love to get the opinion of @adamchainz, as the maintainer of django-mysql.

We saw how well that went with the storage changes where users lost their files because they didn’t update the configuration. On the plusside I don’t think that changing to utf8mb4 as a default would result in loosing data. But it can be really annoying if the default gets changed and a new table gets created with utf8mb4 and all of a sudden you can no longer join tables because the charsets don’t match etc…

All in all it would be great to change the default.

Any more thoughts on this issue? @adamchainz @nessita @sarahboyce others?

Here’s a sample PR for the change.

Hello @benc, thank you for your patience. From my reading of the linked docs, I think your proposal makes sense. As a very isolated reference point, I have asked the Grafana On Call team (which use MySQL for their setup) which encoding do they use and they confirmed they use utf8mb4 by default.

I would like some certainty that changing the default does not loose data as Florian (@apollo13) suggest: with that confirmation, I think it makes sense to change the default.

@charettes would you have an opinion on this proposal?

Here’s one thing that could happen if we change the default: if a user has a utf8 collation (eg. “utf8mb3_general_ci”), but is using the default django charset (ie. “utf8”), then if we change the default django charset to “utf8mb4”, an exception is thrown:
“COLLATION ‘utf8mb3_general_ci’ is not valid for CHARACTER SET ‘utf8mb4’”
To fix this, the user would need to explicitly set the charset to “utf8mb3”, instead of depending on the default.

From my understanding setting the charset option of the mysqlclient library results in the mysql_set_character_set C call which I assume is the equivalent of the SET CHARACTER SET statement which maps all strings sent between the server and the current client with the given mapping by setting three session system variables


Assuming the above is correct if we look at these three settings in isolation in terms of what the impact of moving from utf8 to utf8mb4 would have.

Changing character_set_client seems like it will only make things more permissive as it will allow strings literals requiring 4-bytes encoding to be transmitted from the client to the server. Ultimately if the table has columns defined with a charset of utf8mb3 they will still be disallowed to store some bytes sequences but that’s already the case today unless I’m missing something.

Changing character_set_results also seems like a non-issue because utf8mb4 is a superset of utf8mb3 so Python will be able to decode both using the equivalent of bytes.decode('utf8') when transmitted from the server.

Lastly it has no direct effect on character_set_connection since it sets it to a server defined value which passing utf8 also did.


To summarize I don’t think that changing charset like it’s proposed here poses risks regarding indexing or interoperability at querying time as SET CHARSET won’t affect the default character set used to define entities storing text (that’s something controlled by character_set_database which is inherited from character_set_server).

In other words, from my understanding, all this configuration does is increase interoperability for string literals exchange between the client and the server and has no effect on character_set_database which is implicitly used for tables, columns, and indices creation.

Otherwise (neither CHARACTER SET nor COLLATE is specified), the database character set and collation are used.

The problem of tables created with utf8mb3 columns and indices (they can’t store characters like emojis) and the limitations of utf8mb4 with regards to indexing (limited to 191 characters) has little to do with this particular flag and more with their MySQL server configuration as Django never explicitly specified which charset and collations should be used on schema changes (it always defaulted to the configured database character set which SET CHARSET has no effect on).

If you’d like a visual confirmation of that you can refer to this MySQL 8.0.36 fiddle or the

Fiddle
select version();
version()
8.0.36
SHOW VARIABLES LIKE 'character_set%';
Variable_name Value
character_set_client utf8mb4
character_set_connection utf8mb4
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8mb4
character_set_server utf8mb4
character_set_system utf8mb3
character_sets_dir /usr/share/mysql-8.0/charsets/
SELECT '🐬';
?
:dolphin:
CREATE TABLE before_table (field text);
SHOW CREATE TABLE before_table;
Table Create Table
before_table CREATE TABLE `before_table` (
`field` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
INSERT INTO before_table (field) VALUES ('🐬');
SELECT * FROM before_table;
field
:dolphin:
SET CHARSET 'utf8'; -- This is currently what Django does
SHOW VARIABLES LIKE 'character_set%';
Variable_name Value
character_set_client utf8mb3
character_set_connection utf8mb4
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8mb3
character_set_server utf8mb4
character_set_system utf8mb3
character_sets_dir /usr/share/mysql-8.0/charsets/
SELECT '🐬';
???
???
CREATE TABLE after_table (field text);
SHOW CREATE TABLE after_table;
Table Create Table
after_table CREATE TABLE `after_table` (
`field` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
INSERT INTO after_table (field) VALUES ('🐬');
SELECT * FROM after_table;
field
???

fiddle

The only risk this change poses is that passing unicode characters literals that are 4-bytes wide to compare them against utf8mb3 columns will now loudly error out instead of silently corrupting them. I feel like it could be valuable for users to be made aware of this problem and the need to rebuild their tables with a more adequate charset than to keep hiding this problem from them.

1 Like

Thank you @charettes, your reply is very educational and complete. I deeply appreciate the time you invested in writing it! I’ll circle back to @benc’s PR.

1 Like

Thank you all - the default has now been changed to “utf8mb4”.

2 Likes

@benc Hi! I’ve noticed an error in our Jenkins CI instance, for MariaDB and Python 3.12, where we get the following error:

django.db.utils.OperationalError: (1918, "Encountered illegal value 'μg/mL' when converting to latin1")

This definitely seems related to this change so I was hoping if you would have some time to debug and evaluate.

Synchronizing apps without migrations:
  Creating tables...
    Creating table django_content_type
    Creating table auth_permission
    ...
    Creating table constraints_product
Traceback (most recent call last):
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/utils.py", line 103, in _execute
    return self.cursor.execute(sql)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/mysql/base.py", line 76, in execute
    return self.cursor.execute(query, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/cursors.py", line 179, in execute
    res = self._query(mogrified_query)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/cursors.py", line 330, in _query
    db.query(q)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/connections.py", line 265, in query
    _mysql.connection.query(self, query)
MySQLdb.OperationalError: (1918, "Encountered illegal value 'μg/mL' when converting to latin1")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/./runtests.py", line 787, in <module>
    failures = django_tests(
               ^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/./runtests.py", line 425, in django_tests
    failures = test_runner.run_tests(test_labels)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/test/runner.py", line 1089, in run_tests
    old_config = self.setup_databases(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/test/runner.py", line 987, in setup_databases
    return _setup_databases(
           ^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/test/utils.py", line 206, in setup_databases
    connection.creation.create_test_db(
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/base/creation.py", line 78, in create_test_db
    call_command(
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/core/management/__init__.py", line 194, in call_command
    return command.execute(*args, **defaults)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/core/management/base.py", line 459, in execute
    output = self.handle(*args, **options)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/core/management/base.py", line 107, in wrapper
    res = handle_func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/core/management/commands/migrate.py", line 323, in handle
    self.sync_apps(connection, executor.loader.unmigrated_apps)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/core/management/commands/migrate.py", line 485, in sync_apps
    editor.create_model(model)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/base/schema.py", line 505, in create_model
    self.execute(sql, params or None)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/base/schema.py", line 202, in execute
    cursor.execute(sql, params)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/utils.py", line 79, in execute
    return self._execute_with_wrappers(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/utils.py", line 92, in _execute_with_wrappers
    return executor(sql, params, many, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/utils.py", line 100, in _execute
    with self.db.wrap_database_errors:
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/utils.py", line 103, in _execute
    return self.cursor.execute(sql)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/django/db/backends/mysql/base.py", line 76, in execute
    return self.cursor.execute(query, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/cursors.py", line 179, in execute
    res = self._query(mogrified_query)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/cursors.py", line 330, in _query
    db.query(q)
  File "/home/jenkins/workspace/django-mariadb/database/mysql/label/mariadb/python/python3.12/tests/.env/lib/python3.12/site-packages/MySQLdb/connections.py", line 265, in query
    _mysql.connection.query(self, query)
django.db.utils.OperationalError: (1918, "Encountered illegal value 'μg/mL' when converting to latin1")

@nessita Sure, I should be able to look at it some this afternoon. Are you seeing that error intermittently across various PRs, or is this the first time?

Thanks @benc. This is failing consistently since November 16th:

Hmm, I’m trying to look around on djangoci.com, but I’m getting “Bad Gateway” errors now.