maybe we can retrieve the ids for objects created by bulk_create when using MySQL.
the original was posted on Bulk Create doesn't update instance references in runtime · Issue #306 · doableware/djongo · GitHub
but i will post it again here.
the bulk_create
let’s first take a look at how MySQL handles auto_incrment, and then how django loses ids after bulk_create.
types of inserts in MySQL
https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html
in MySQL, there’s a couple types of inserts, but two main ones, simple inserts and bulk inserts.
please do not confuse those with bulk_create operation in django.
in MySQL, simple inserts are
Statements for which the number of rows to be inserted can be determined in advance.
and
This includes single-row and multiple-row Insert.
and bulk inserts are
Statements for which the number of rows to be inserted (and the number of required auto-increment values) is not known in advance.
This includes INSERT … SELECT, REPLACE … SELECT, and LOAD DATA statements, but not plain INSERT.
and clearly, calling save
and bulk_create
will generate simple inserts in the context of MySQL.
how does MySQL handle simple INSERT statements?
InnoDB assigns new values for the AUTO_INCREMENT column one at a time as each row is processed.
but really how?
there’s a lock for increasing the auto_increment column, and there’s several lock modes there.
the AUTO_INCREMENT lock modes used to generate auto-increment values
The auto-increment lock mode is configured at startup using the innodb_autoinc_lock_mode variable.
There are three possible settings for the innodb_autoinc_lock_mode variable. The settings are 0, 1, or 2, for “traditional”, “consecutive”, or “interleaved” lock mode, respectively.
you might have consecutive mode or interleaved mode as default.
the consecutive mode
under the consecutive mode, bulk inserts will lock the table!
In this mode, “bulk inserts” use the special AUTO-INC table-level lock and hold it until the end of the statement.
but for simple inserts, no!
“Simple inserts” (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes.
as we can see, consective mode will force ids being allocated in sequence via the table lock
if you want your bulk inserts having consecutive ids, active the consecutive mode.
that’ because in some cases, we have to wait for some subqueries to be executed in order to calculate the number of rows to be inserted, so lock the table.
bulk_insert -> lock the table
-> execute some queries to calculate the number of rows, suppose there's 10 rows.
-> insert 10 rows, make sure ids are consecutive
-> release the lock.
but for simple bulks, MySQL will acquire a light-lock to allocate a series of consective ids for the rows, that’s it.
so the ids within the same simple insert will be consecutive.
the interleaved mode
under the interleaved mode, bulk inserts would have unconsecutive ids, but no for simple inserts.
In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time.
why? as the name suggests, concurrent inserts are allowed, but for one plain insert statement, ids are still consecutive.
In this lock mode, auto-increment values are guaranteed to be unique and monotonically increasing across all concurrently executing “INSERT-like” statements.
conclusion.
so simple inserts can be viewed as “atomic” and the ids are always consecutive for one insert statement.
turning on the consecutive mode or not really just depends on what you want to do with bulk inserts.
how dose django lose the ids after calling bulk_create?
let’s see how save
method retrieves the id for the object.
in short, after sending the insert statement, django will call last_insert_id
for the id.
# django/db/models/sql/compiler.py.SQLInsertCompiler#L1813
def execute_sql(self, returning_fields=None):
assert not (
returning_fields
and len(self.query.objs) != 1
and not self.connection.features.can_return_rows_from_bulk_insert
)
opts = self.query.get_meta()
self.returning_fields = returning_fields
cols = []
with self.connection.cursor() as cursor:
for sql, params in self.as_sql():
cursor.execute(sql, params)
if not self.returning_fields:
return []
if (
self.connection.features.can_return_rows_from_bulk_insert
and len(self.query.objs) > 1
):
rows = self.connection.ops.fetch_returned_insert_rows(cursor)
cols = [field.get_col(opts.db_table) for field in self.returning_fields]
elif self.connection.features.can_return_columns_from_insert:
assert len(self.query.objs) == 1
rows = [
self.connection.ops.fetch_returned_insert_columns(
cursor,
self.returning_params,
)
]
cols = [field.get_col(opts.db_table) for field in self.returning_fields]
else:
cols = [opts.pk.get_col(opts.db_table)]
rows = [
(
self.connection.ops.last_insert_id(
cursor,
opts.db_table,
opts.pk.column,
),
)
]
in this execute_sql
, the id
field will be passed as returning_fields
, and so we will go to the last else
statement calling
self.connection.ops.last_insert_id
.
that’s the first id among the ones newly inserted.
say the last_insert_id
is x, and you are inserting 10 rows, then you can have their ids as x, x+1,x+2, …
however, when calling bulk_create
method, the parameter returning_fields
will be None
.
so we just return without calling last_insert_id
.
why?
it turns out that django will set returning_fields
as None if the database can not return rows from bulk_insert.
# db/models/query.py.QuerySet:L1852
def _batched_insert(
self,
objs,
fields,
batch_size,
on_conflict=None,
update_fields=None,
unique_fields=None,
):
bulk_return = connection.features.can_return_rows_from_bulk_insert
for item in [objs[i : i + batch_size] for i in range(0, len(objs), batch_size)]:
if bulk_return and (
on_conflict is None or on_conflict == OnConflict.UPDATE
):
inserted_rows.extend(
self._insert(
item,
fields=fields,
using=self.db,
on_conflict=on_conflict,
update_fields=update_fields,
unique_fields=unique_fields,
returning_fields=self.model._meta.db_returning_fields,
)
)
else:
self._insert(
item,
fields=fields,
using=self.db,
on_conflict=on_conflict,
update_fields=update_fields,
unique_fields=unique_fields,
)
if you are using MySQL, then connection.features.can_return_rows_from_bulk_insert
will be False
, which results in passing
returning_fields
as None.
solution.
as we mentioned above, simple inserts are always consecutive, so in theory, we can retrieve the ids for the objects created by bulk_create.
another solution will be open a transaction, and insert a row to shift the auto_increment field one step forward, this looks like you are kind of locking a spot using that light-lock, then commit all of your inserts.
why?
because until you commit, the insert statements just increase the value of the auto_increment field, and nothing else.
you will not see the records on the table on the MySQL Workbench, and if you select the table, you will not get those uncommit rows.
with transaction.atomic():
for _ in range(10):
objs_list.append(Record.objects.create())
# if someone selects the table here at the same time, he will not get those newly inserted Records.
the main difference is how many times we are going to use that light-lock to increment that auto_increment field.