any plan to update bulk_create to retrieve the ids for objects using MySQL?

allenling · May 17, 2024, 10:55am

maybe we can retrieve the ids for objects created by bulk_create when using MySQL.
the original was posted on Bulk Create doesn't update instance references in runtime · Issue #306 · doableware/djongo · GitHub
but i will post it again here.

the bulk_create

let’s first take a look at how MySQL handles auto_incrment, and then how django loses ids after bulk_create.

types of inserts in MySQL

https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html

in MySQL, there’s a couple types of inserts, but two main ones, simple inserts and bulk inserts.

please do not confuse those with bulk_create operation in django.

in MySQL, simple inserts are

Statements for which the number of rows to be inserted can be determined in advance.

and

This includes single-row and multiple-row Insert.

and bulk inserts are

Statements for which the number of rows to be inserted (and the number of required auto-increment values) is not known in advance.
This includes INSERT … SELECT, REPLACE … SELECT, and LOAD DATA statements, but not plain INSERT.

and clearly, calling save and bulk_create will generate simple inserts in the context of MySQL.

how does MySQL handle simple INSERT statements?

InnoDB assigns new values for the AUTO_INCREMENT column one at a time as each row is processed.

but really how?

there’s a lock for increasing the auto_increment column, and there’s several lock modes there.

the AUTO_INCREMENT lock modes used to generate auto-increment values

The auto-increment lock mode is configured at startup using the innodb_autoinc_lock_mode variable.

There are three possible settings for the innodb_autoinc_lock_mode variable. The settings are 0, 1, or 2, for “traditional”, “consecutive”, or “interleaved” lock mode, respectively.

you might have consecutive mode or interleaved mode as default.

the consecutive mode

under the consecutive mode, bulk inserts will lock the table!

In this mode, “bulk inserts” use the special AUTO-INC table-level lock and hold it until the end of the statement.

but for simple inserts, no!

“Simple inserts” (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes.

as we can see, consective mode will force ids being allocated in sequence via the table lock

if you want your bulk inserts having consecutive ids, active the consecutive mode.

that’ because in some cases, we have to wait for some subqueries to be executed in order to calculate the number of rows to be inserted, so lock the table.

                                                                               
   bulk_insert -> lock the table
               -> execute some queries to calculate the number of rows, suppose there's 10 rows.
               -> insert 10 rows, make sure ids are consecutive
               -> release the lock.

but for simple bulks, MySQL will acquire a light-lock to allocate a series of consective ids for the rows, that’s it.

so the ids within the same simple insert will be consecutive.

the interleaved mode

under the interleaved mode, bulk inserts would have unconsecutive ids, but no for simple inserts.

In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time.

why? as the name suggests, concurrent inserts are allowed, but for one plain insert statement, ids are still consecutive.

In this lock mode, auto-increment values are guaranteed to be unique and monotonically increasing across all concurrently executing “INSERT-like” statements.

conclusion.

so simple inserts can be viewed as “atomic” and the ids are always consecutive for one insert statement.

turning on the consecutive mode or not really just depends on what you want to do with bulk inserts.

how dose django lose the ids after calling bulk_create?

let’s see how save method retrieves the id for the object.

in short, after sending the insert statement, django will call last_insert_id for the id.


    # django/db/models/sql/compiler.py.SQLInsertCompiler#L1813
    def execute_sql(self, returning_fields=None):
        assert not (
            returning_fields
            and len(self.query.objs) != 1
            and not self.connection.features.can_return_rows_from_bulk_insert
        )
        opts = self.query.get_meta()
        self.returning_fields = returning_fields
        cols = []
        with self.connection.cursor() as cursor:
            for sql, params in self.as_sql():
                cursor.execute(sql, params)
            if not self.returning_fields:
                return []
            if (
                self.connection.features.can_return_rows_from_bulk_insert
                and len(self.query.objs) > 1
            ):
                rows = self.connection.ops.fetch_returned_insert_rows(cursor)
                cols = [field.get_col(opts.db_table) for field in self.returning_fields]
            elif self.connection.features.can_return_columns_from_insert:
                assert len(self.query.objs) == 1
                rows = [
                    self.connection.ops.fetch_returned_insert_columns(
                        cursor,
                        self.returning_params,
                    )
                ]
                cols = [field.get_col(opts.db_table) for field in self.returning_fields]
            else:
                cols = [opts.pk.get_col(opts.db_table)]
                rows = [
                    (
                        self.connection.ops.last_insert_id(
                            cursor,
                            opts.db_table,
                            opts.pk.column,
                        ),
                    )
                ]

in this execute_sql, the id field will be passed as returning_fields, and so we will go to the last else statement calling
self.connection.ops.last_insert_id.

that’s the first id among the ones newly inserted.

say the last_insert_id is x, and you are inserting 10 rows, then you can have their ids as x, x+1,x+2, …

however, when calling bulk_create method, the parameter returning_fields will be None.

so we just return without calling last_insert_id.

why?

it turns out that django will set returning_fields as None if the database can not return rows from bulk_insert.


    # db/models/query.py.QuerySet:L1852
    def _batched_insert(
        self,
        objs,
        fields,
        batch_size,
        on_conflict=None,
        update_fields=None,
        unique_fields=None,
    ):
        bulk_return = connection.features.can_return_rows_from_bulk_insert
        for item in [objs[i : i + batch_size] for i in range(0, len(objs), batch_size)]:
            if bulk_return and (
                on_conflict is None or on_conflict == OnConflict.UPDATE
            ):
                inserted_rows.extend(
                    self._insert(
                        item,
                        fields=fields,
                        using=self.db,
                        on_conflict=on_conflict,
                        update_fields=update_fields,
                        unique_fields=unique_fields,
                        returning_fields=self.model._meta.db_returning_fields,
                    )
                )
            else:
                self._insert(
                    item,
                    fields=fields,
                    using=self.db,
                    on_conflict=on_conflict,
                    update_fields=update_fields,
                    unique_fields=unique_fields,
                )

if you are using MySQL, then connection.features.can_return_rows_from_bulk_insert will be False, which results in passing
returning_fields as None.

solution.

as we mentioned above, simple inserts are always consecutive, so in theory, we can retrieve the ids for the objects created by bulk_create.

another solution will be open a transaction, and insert a row to shift the auto_increment field one step forward, this looks like you are kind of locking a spot using that light-lock, then commit all of your inserts.