How to avoid the overhead of model instances in bulk_create

Hi all, this is my first post here so let me know if I’m doing something wrong.

I’m working on a file importer in Django that requires that we import each column, row and cell in the database, as the data itself can be anything and has to be handled in different ways later down the line. This importer is expected to receive potentially very big files, maybe up to millions of rows. I was testing the performance of what I had built with 100k rows and 18 columns, and I was consistently hitting import times around 45 seconds. Now this isn’t terrible, but considering I expect some files to be upwards of 10 times that size, it’s unacceptable. And before you ask, everything is in bulk_creates, with a batch_size of 2000 and the file is walked, not loaded all in memory at once.

After profiling the code, I realized that most of the time is actually spent in manager methods, and then it clicked. The bottleneck for my code is the creation of the model objects. Importing the cells (but keeping the columns and rows in regular Django ORM code) through means of RawSQL made the same import take less than 15 seconds, so a 3x improvement.

I am not a fan of having to write RawSQL queries however, so I was wondering if there was a way to insert through the manager without building the Model instances at any point? I really do not need them, except maybe for getting the ID of the columns and rows as these are foreign keys in the Cell model.

Any other tips you guys might have for improving the performance of the import would be greatly appreciated. I think 15 seconds for 1.8 million data points might be acceptable, but considering it could go up to 10 million (10 columns with 1 million rows), this is still too slow to my liking.

Thanks all!

Use the RawSQL. It’s your best-bet in this situation.

ORM = Object Relational Mapping. By definition it’s an interface layer between an object structure and a relational DB. It doesn’t really make sense to talk about using an “Object Manager” if there aren’t any objects involved.

1 Like

It’s very hard to help you beyond what @KenWhitesell suggested about using RawSQL due to the lack of any profiling data.

Asking for suggestion about how to make something faster without providing traces of the experienced slowness is analogous to asking for help for an exception message without providing the traceback or the code that caused it.

All I can offer with the details you’ve provided is to create model instances using *args instead of **kwargs as it make a significant difference in the amount of work required to create them.

For example, if you have a model of the form

class Product(models.Model):
    field1 = models.TextField()
    ...
    field17 = models.IntegerField()

Then doing

# None is for the implicit id primary key in this case and
# should be usable for any field that have a database generated
# default.
Product(None, field1, ..., field17)

Will perform better than

Product(field1=field1, ..., field17=field17)

The difference is not noticeable when dealing with a few model instances but it would add up when dealing with millions of them. That’s the reason why the *args approach is used by QuerySet over **kwargs when creating model instances.

Making sure that you don’t have any pre_init and post_init signals registered for your application is also something I would look into.

2 Likes

That makes sense, I’ve always tried to avoid stepping out of the ORM, but you’re right that in this case it does make a lot of sense. Especially since every microsecond counts when working with millions of records.

One thing I was wondering is if it was possible to pass a list of dicts with the data that would be passed to the Model instances instead of the Model instance directly. This would basically tell the Manager, here is the data, please store it in the database, but I do not care about getting the object back, so bypass that part. I feel like that would be a very useful thing to have, since it happens pretty often that objects are instantiated just because we need them for the creation of the data, not really because we need the object.

Note that I am not sure what this would look like in the code, or if it even really makes sense from a process point of view. I also don’t know if the overhead is major in enough cases that it warrants a whole new creation process, but I think that for large scale applications, even saving 5ms (and a bit of memory saved by not instatiating models) per database call can make quite the difference on the bill at the end of the year.

Thanks for the reply, I wasn’t aware there was a difference in performance between args and kwargs.

No signals are used on these tables, and I will definitely make sure to keep it that way.

And here is the profiling for the RawSQL:

1568092 function calls (1567831 primitive calls) in 1.009 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.009    1.009 local.py:181(__call__)
        1    0.000    0.000    1.009    1.009 task.py:407(__call__)
        1    0.000    0.000    1.008    1.008 tasks.py:13(import_file_task)
        1    0.143    0.143    1.007    1.007 csv.py:12(start_process)
       18    0.314    0.017    0.770    0.043 base.py:49(_insert_cells)
       39    0.000    0.000    0.252    0.006 utils.py:66(execute)
       39    0.000    0.000    0.252    0.006 utils.py:76(_execute_with_wrappers)
       39    0.000    0.000    0.252    0.006 utils.py:82(_execute)
       39    0.000    0.000    0.252    0.006 base.py:322(execute)
       39    0.181    0.005    0.181    0.005 {function SQLiteCursorWrapper.execute at 0x000002313380FEC0}
   193894    0.063    0.000    0.079    0.000 base.py:653(_get_pk_val)
    75/57    0.031    0.000    0.072    0.001 {method 'join' of 'str' objects}
       39    0.000    0.000    0.071    0.002 base.py:341(convert_query)
       39    0.070    0.002    0.070    0.002 {method 'sub' of 're.Pattern' objects}
561416/561413    0.068    0.000    0.068    0.000 {built-in method builtins.isinstance}
   183654    0.040    0.000    0.040    0.000 base.py:59(<genexpr>)
   183654    0.021    0.000    0.021    0.000 {method 'strip' of 'str' objects}
   193926    0.020    0.000    0.020    0.000 {method 'append' of 'list' objects}
       18    0.003    0.000    0.017    0.001 base.py:71(_insert_rows)
194404/194360    0.016    0.000    0.017    0.000 {built-in method builtins.getattr}
    10204    0.006    0.000    0.017    0.000 base.py:75(__iter__)
    10222    0.004    0.000    0.006    0.000 base.py:154(endswith_lf)
        1    0.000    0.000    0.005    0.005 files.py:76(open)
        1    0.000    0.000    0.005    0.005 base.py:20(open)
        1    0.000    0.000    0.005    0.005 filesystem.py:67(_open)
        1    0.005    0.005    0.005    0.005 {built-in method _io.open}
       34    0.005    0.000    0.005    0.000 {method 'clear' of 'list' objects}
       21    0.000    0.000    0.003    0.000 base.py:48(chunks)
       21    0.002    0.000    0.003    0.000 {method 'read' of '_io.TextIOWrapper' objects}
       21    0.000    0.000    0.002    0.000 utils.py:96(inner)
       19    0.002    0.000    0.002    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}
        3    0.000    0.000    0.002    0.001 manager.py:85(manager_method)
    10202    0.002    0.000    0.002    0.000 base.py:16(__init__)
    10242    0.002    0.000    0.002    0.000 {method 'endswith' of 'str' objects}
       38    0.000    0.000    0.002    0.000 cp1252.py:22(decode)
    10220    0.002    0.000    0.002    0.000 base.py:79(<genexpr>)
       38    0.002    0.000    0.002    0.000 {built-in method _codecs.charmap_decode}
       18    0.001    0.000    0.001    0.000 base.py:22(set_ids)
        1    0.000    0.000    0.001    0.001 query.py:725(bulk_create)
        1    0.000    0.000    0.001    0.001 query.py:613(get)
       20    0.001    0.000    0.001    0.000 {method 'splitlines' of 'str' objects}
    78/39    0.000    0.000    0.001    0.000 asyncio.py:15(inner)
       39    0.000    0.000    0.001    0.000 base.py:327(cursor)
       44    0.000    0.000    0.001    0.000 connection.py:56(__getitem__)
       39    0.000    0.000    0.001    0.000 base.py:304(_cursor)
       36    0.000    0.000    0.001    0.000 connection.py:14(__getattr__)
       44    0.000    0.000    0.001    0.000 local.py:99(__getattr__)
        2    0.000    0.000    0.001    0.000 compiler.py:1532(execute_sql)
        1    0.000    0.000    0.001    0.001 query.py:1810(_batched_insert)
        1    0.000    0.000    0.001    0.001 base.py:767(save)
   107/99    0.000    0.000    0.001    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.001    0.001 query.py:1780(_insert)
        1    0.000    0.000    0.001    0.001 query.py:379(__len__)
        1    0.000    0.000    0.001    0.001 query.py:1879(_fetch_all)
        1    0.000    0.000    0.001    0.001 base.py:835(save_base)
        2    0.000    0.000    0.001    0.000 query.py:85(__iter__)
        1    0.000    0.000    0.001    0.001 compiler.py:1812(execute_sql)
       44    0.000    0.000    0.001    0.000 local.py:80(_get_storage)
        1    0.000    0.000    0.001    0.001 base.py:938(_save_table)
       44    0.000    0.000    0.001    0.000 local.py:44(_get_context_id)
        1    0.000    0.000    0.001    0.001 base.py:1028(_do_update)
        1    0.000    0.000    0.000    0.000 base.py:34(_build_columns)
       43    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 compiler.py:1732(as_sql)
       19    0.000    0.000    0.000    0.000 base.py:460(__init__)
        2    0.000    0.000    0.000    0.000 query.py:1430(filter)
        2    0.000    0.000    0.000    0.000 query.py:1446(_filter_or_exclude)
       39    0.000    0.000    0.000    0.000 base.py:293(_prepare_cursor)
        1    0.000    0.000    0.000    0.000 testcases.py:1398(__get__)
     79/1    0.000    0.000    0.000    0.000 copy.py:118(deepcopy)
        2    0.000    0.000    0.000    0.000 query.py:1457(_filter_or_exclude_inplace)
        2    0.000    0.000    0.000    0.000 query.py:1531(add_q)
     12/4    0.000    0.000    0.000    0.000 copy.py:247(_reconstruct)
        1    0.000    0.000    0.000    0.000 compiler.py:725(as_sql)
        2    0.000    0.000    0.000    0.000 query.py:1556(_add_q)
        1    0.000    0.000    0.000    0.000 query.py:1217(_update)
      122    0.000    0.000    0.000    0.000 functional.py:291(__getattribute__)
      6/1    0.000    0.000    0.000    0.000 copy.py:217(_deepcopy_dict)
        1    0.000    0.000    0.000    0.000 compiler.py:1977(execute_sql)
        2    0.000    0.000    0.000    0.000 query.py:1361(build_filter)
    14/10    0.000    0.000    0.000    0.000 compiler.py:541(compile)
        1    0.000    0.000    0.000    0.000 query.py:664(_prepare_for_bulk_create)
        1    0.000    0.000    0.000    0.000 compiler.py:78(pre_sql_setup)
   117/99    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        1    0.000    0.000    0.000    0.000 compiler.py:1913(as_sql)
       18    0.000    0.000    0.000    0.000 related_descriptors.py:252(__set__)
        2    0.000    0.000    0.000    0.000 where.py:114(as_sql)
       36    0.000    0.000    0.000    0.000 compiler.py:1659(prepare_value)
    42/21    0.000    0.000    0.000    0.000 utils.py:42(<lambda>)
       19    0.000    0.000    0.000    0.000 base.py:1069(_prepare_related_fields_for_save)
       37    0.000    0.000    0.000    0.000 __init__.py:950(get_db_prep_save)
        2    0.000    0.000    0.000    0.000 lookups.py:343(as_sql)
        1    0.000    0.000    0.000    0.000 compiler.py:70(setup_query)
        2    0.000    0.000    0.000    0.000 lookups.py:223(as_sql)
       44    0.000    0.000    0.000    0.000 sync.py:545(get_current_task)
       39    0.000    0.000    0.000    0.000 base.py:183(queries_logged)
       61    0.000    0.000    0.000    0.000 utils.py:22(__getattr__)
        1    0.000    0.000    0.000    0.000 compiler.py:229(get_select)
        1    0.000    0.000    0.000    0.000 compiler.py:1697(assemble_as_sql)
       40    0.000    0.000    0.000    0.000 functional.py:264(inner)
       37    0.000    0.000    0.000    0.000 utils.py:36(__exit__)
       19    0.000    0.000    0.000    0.000 compiler.py:1721(<genexpr>)
       39    0.000    0.000    0.000    0.000 base.py:647(validate_thread_sharing)
        2    0.000    0.000    0.000    0.000 lookups.py:213(process_lhs)
       22    0.000    0.000    0.000    0.000 files.py:45(_get_file)
        4    0.000    0.000    0.000    0.000 query.py:1849(_chain)
        4    0.000    0.000    0.000    0.000 query.py:1860(_clone)
        5    0.000    0.000    0.000    0.000 query.py:359(chain)
    20/14    0.000    0.000    0.000    0.000 copy.py:252(<genexpr>)
        3    0.000    0.000    0.000    0.000 query.py:296(get_compiler)
       18    0.000    0.000    0.000    0.000 related.py:1119(get_db_prep_save)
       43    0.000    0.000    0.000    0.000 <frozen abc>:117(__instancecheck__)
     10/8    0.000    0.000    0.000    0.000 {method '__reduce_ex__' of 'object' objects}
       54    0.000    0.000    0.000    0.000 compiler.py:1715(<genexpr>)
        2    0.000    0.000    0.000    0.000 query.py:1293(build_lookup)
        2    0.000    0.000    0.000    0.000 lookups.py:98(process_lhs)
       39    0.000    0.000    0.000    0.000 base.py:690(make_cursor)
        2    0.000    0.000    0.000    0.000 compiler.py:1483(get_converters)
        5    0.000    0.000    0.000    0.000 query.py:314(clone)
       39    0.000    0.000    0.000    0.000 base.py:189(create_cursor)
        4    0.000    0.000    0.000    0.000 copy.py:61(copy)
        1    0.000    0.000    0.000    0.000 compiler.py:1504(results_iter)
        1    0.000    0.000    0.000    0.000 __init__.py:1914(__del__)
        1    0.000    0.000    0.000    0.000 cProfile.py:119(__exit__)
       44    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:645(parent)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       23    0.000    0.000    0.000    0.000 files.py:39(_require_file)
       19    0.000    0.000    0.000    0.000 related_descriptors.py:86(__set__)
        9    0.000    0.000    0.000    0.000 expressions.py:1138(as_sql)
       39    0.000    0.000    0.000    0.000 base.py:630(allow_thread_sharing)
        2    0.000    0.000    0.000    0.000 base.py:608(__reduce__)
       44    0.000    0.000    0.000    0.000 threading.py:1463(current_thread)
       36    0.000    0.000    0.000    0.000 compiler.py:1627(field_as_sql)
       44    0.000    0.000    0.000    0.000 {built-in method _asyncio.current_task}
        2    0.000    0.000    0.000    0.000 expressions.py:265(resolve_expression)
        1    0.000    0.000    0.000    0.000 filesystem.py:178(path)
      263    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
       43    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.000    0.000 __init__.py:1918(close)
        2    0.000    0.000    0.000    0.000 base.py:614(__getstate__)
       36    0.000    0.000    0.000    0.000 compiler.py:1688(pre_save_val)
        1    0.000    0.000    0.000    0.000 _os.py:9(safe_join)
       94    0.000    0.000    0.000    0.000 options.py:660(get_field)
       20    0.000    0.000    0.000    0.000 __init__.py:2740(get_db_prep_value)
       21    0.000    0.000    0.000    0.000 compiler.py:519(quote_name_unless_alias)
       18    0.000    0.000    0.000    0.000 __init__.py:940(get_db_prep_value)
        1    0.000    0.000    0.000    0.000 compiler.py:950(get_default_columns)
       38    0.000    0.000    0.000    0.000 mixins.py:21(is_cached)
       39    0.000    0.000    0.000    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 __init__.py:2034(_fpclose)
        2    0.000    0.000    0.000    0.000 expressions.py:421(copy)
        1    0.000    0.000    0.000    0.000 task.py:1111(push_request)
        1    0.000    0.000    0.000    0.000 __init__.py:1604(get_db_prep_value)
       25    0.000    0.000    0.000    0.000 query.py:1769(db)
       43    0.000    0.000    0.000    0.000 utils.py:218(_route_db)
      257    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
      6/4    0.000    0.000    0.000    0.000 functional.py:49(__get__)
        2    0.000    0.000    0.000    0.000 query.py:1222(solve_lookup_type)
        1    0.000    0.000    0.000    0.000 {method 'close' of '_io.BufferedReader' objects}
        4    0.000    0.000    0.000    0.000 query.py:1018(get_initial_alias)
       99    0.000    0.000    0.000    0.000 utils.py:70(__exit__)
        1    0.000    0.000    0.000    0.000 where.py:38(split_having_qualify)
       18    0.000    0.000    0.000    0.000 __init__.py:1229(get_prep_value)
        3    0.000    0.000    0.000    0.000 manager.py:150(get_queryset)
        1    0.000    0.000    0.000    0.000 task.py:99(__init__)
        2    0.000    0.000    0.000    0.000 compiler.py:2087(cursor_iter)
        1    0.000    0.000    0.000    0.000 query.py:1738(using)
        1    0.000    0.000    0.000    0.000 base.py:574(from_db)
        2    0.000    0.000    0.000    0.000 lookups.py:25(__init__)
       20    0.000    0.000    0.000    0.000 __init__.py:2048(get_prep_value)
        8    0.000    0.000    0.000    0.000 expressions.py:1154(get_db_converters)
        1    0.000    0.000    0.000    0.000 local.py:96(_get_current_object)
       37    0.000    0.000    0.000    0.000 __init__.py:930(pre_save)
        7    0.000    0.000    0.000    0.000 query.py:294(__init__)
        1    0.000    0.000    0.000    0.000 query.py:1653(order_by)
       40    0.000    0.000    0.000    0.000 dispatcher.py:152(send)
        1    0.000    0.000    0.000    0.000 functional.py:455(partition)
        2    0.000    0.000    0.000    0.000 compiler.py:2093(<lambda>)
       18    0.000    0.000    0.000    0.000 related_descriptors.py:201(__get__)
       19    0.000    0.000    0.000    0.000 base.py:149(endswith_cr)
        1    0.000    0.000    0.000    0.000 query.py:671(_check_bulk_create_options)
        2    0.000    0.000    0.000    0.000 expressions.py:386(get_lookup)
        1    0.000    0.000    0.000    0.000 task.py:104(_get_custom_headers)
        1    0.000    0.000    0.000    0.000 __init__.py:66(task_by_cons)
        2    0.000    0.000    0.000    0.000 compiler.py:1493(apply_converters)
        2    0.000    0.000    0.000    0.000 lookups.py:108(process_rhs)
       56    0.000    0.000    0.000    0.000 query_utils.py:164(__get__)
        1    0.000    0.000    0.000    0.000 operations.py:362(bulk_insert_sql)
        2    0.000    0.000    0.000    0.000 query.py:1768(setup_joins)
      4/2    0.000    0.000    0.000    0.000 {built-in method builtins.any}
       36    0.000    0.000    0.000    0.000 __init__.py:960(get_default)
       10    0.000    0.000    0.000    0.000 __init__.py:492(get_col)
        8    0.000    0.000    0.000    0.000 operations.py:294(get_db_converters)
        2    0.000    0.000    0.000    0.000 query_utils.py:232(get_lookup)
        2    0.000    0.000    0.000    0.000 lookups.py:328(get_prep_lookup)
       78    0.000    0.000    0.000    0.000 {built-in method _asyncio.get_running_loop}
        4    0.000    0.000    0.000    0.000 query.py:1652(names_to_path)
        2    0.000    0.000    0.000    0.000 transaction.py:18(get_connection)
       23    0.000    0.000    0.000    0.000 base.py:25(__bool__)
        2    0.000    0.000    0.000    0.000 __init__.py:798(db_type)
       44    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        2    0.000    0.000    0.000    0.000 query.py:2640(update_join_types)
       39    0.000    0.000    0.000    0.000 base.py:588(close_if_health_check_failed)
        2    0.000    0.000    0.000    0.000 base.py:2521(model_unpickle)
        5    0.000    0.000    0.000    0.000 where.py:209(clone)
       18    0.000    0.000    0.000    0.000 base.py:657(_set_pk_val)
       18    0.000    0.000    0.000    0.000 __init__.py:688(get_pk_value_on_save)
        2    0.000    0.000    0.000    0.000 lookups.py:242(get_db_prep_lookup)
      114    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 query.py:231(__init__)
       39    0.000    0.000    0.000    0.000 __init__.py:934(get_prep_value)
        1    0.000    0.000    0.000    0.000 operations.py:260(adapt_datetimefield_value)
       99    0.000    0.000    0.000    0.000 utils.py:67(__enter__)
        1    0.000    0.000    0.000    0.000 operations.py:80(fetch_returned_insert_rows)
        1    0.000    0.000    0.000    0.000 __init__.py:12(get_importer)
        1    0.000    0.000    0.000    0.000 where.py:242(contains_aggregate)
       16    0.000    0.000    0.000    0.000 operations.py:187(quote_name)
        1    0.000    0.000    0.000    0.000 __init__.py:1584(get_prep_value)
       11    0.000    0.000    0.000    0.000 __init__.py:615(__eq__)
        2    0.000    0.000    0.000    0.000 copy.py:200(_deepcopy_tuple)
        3    0.000    0.000    0.000    0.000 operations.py:311(convert_datetimefield_value)
       21    0.000    0.000    0.000    0.000 utils.py:93(__call__)
      2/1    0.000    0.000    0.000    0.000 where.py:236(_contains_aggregate)
       19    0.000    0.000    0.000    0.000 operations.py:364(<genexpr>)
        2    0.000    0.000    0.000    0.000 query.py:2612(__init__)
        2    0.000    0.000    0.000    0.000 threads.py:204(push)
       83    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    0.000    0.000 transaction.py:182(__enter__)
        1    0.000    0.000    0.000    0.000 <frozen ntpath>:107(join)
       18    0.000    0.000    0.000    0.000 query.py:784(<lambda>)
        2    0.000    0.000    0.000    0.000 registry.py:188(get_model)
        2    0.000    0.000    0.000    0.000 query_utils.py:213(_get_lookup)
        2    0.000    0.000    0.000    0.000 files.py:166(__get__)
        1    0.000    0.000    0.000    0.000 related.py:1184(get_col)
       39    0.000    0.000    0.000    0.000 base.py:284(ensure_connection)
       19    0.000    0.000    0.000    0.000 compiler.py:1714(<genexpr>)
        1    0.000    0.000    0.000    0.000 task.py:1114(pop_request)
        2    0.000    0.000    0.000    0.000 threads.py:214(pop)
        2    0.000    0.000    0.000    0.000 lookups.py:80(get_prep_lookup)
        2    0.000    0.000    0.000    0.000 __init__.py:781(db_type_parameters)
       18    0.000    0.000    0.000    0.000 mixins.py:24(set_cached_value)
       39    0.000    0.000    0.000    0.000 base.py:529(validate_no_broken_transaction)
        1    0.000    0.000    0.000    0.000 compiler.py:1998(pre_sql_setup)
        1    0.000    0.000    0.000    0.000 transaction.py:224(__exit__)
       18    0.000    0.000    0.000    0.000 mixins.py:12(get_cached_value)
        2    0.000    0.000    0.000    0.000 {method 'fetchmany' of 'sqlite3.Cursor' objects}
       13    0.000    0.000    0.000    0.000 copy.py:231(_keep_alive)
        2    0.000    0.000    0.000    0.000 where.py:239(<genexpr>)
       74    0.000    0.000    0.000    0.000 related.py:509(get_cache_name)
        2    0.000    0.000    0.000    0.000 query.py:379(_get_col)
        1    0.000    0.000    0.000    0.000 options.py:704(get_parent_list)
        2    0.000    0.000    0.000    0.000 <frozen ntpath>:596(abspath)
       45    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.RLock' objects}
        1    0.000    0.000    0.000    0.000 compiler.py:448(get_order_by)
       15    0.000    0.000    0.000    0.000 tree.py:22(__init__)
        1    0.000    0.000    0.000    0.000 where.py:252(contains_over_clause)
        2    0.000    0.000    0.000    0.000 query.py:1040(join)
       51    0.000    0.000    0.000    0.000 {method 'copy' of 'dict' objects}
       38    0.000    0.000    0.000    0.000 <frozen codecs>:281(getstate)
      109    0.000    0.000    0.000    0.000 {built-in method builtins.id}
        2    0.000    0.000    0.000    0.000 __init__.py:595(__init__)
        2    0.000    0.000    0.000    0.000 query_utils.py:48(__init__)
       39    0.000    0.000    0.000    0.000 utils.py:16(__init__)
        2    0.000    0.000    0.000    0.000 query.py:2633(add_votes)
        4    0.000    0.000    0.000    0.000 __init__.py:668(update)
        6    0.000    0.000    0.000    0.000 copyreg.py:98(__newobj__)
      2/1    0.000    0.000    0.000    0.000 where.py:246(_contains_over_clause)
        3    0.000    0.000    0.000    0.000 __init__.py:2780(__instancecheck__)
        5    0.000    0.000    0.000    0.000 tree.py:28(create)
        4    0.000    0.000    0.000    0.000 threads.py:134(__getattr__)
        3    0.000    0.000    0.000    0.000 compiler.py:46(__init__)
        1    0.000    0.000    0.000    0.000 base.py:515(gen_task_name)
        7    0.000    0.000    0.000    0.000 expressions.py:470(select_format)
       39    0.000    0.000    0.000    0.000 {method '__exit__' of '_thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {method 'seek' of '_io.TextIOWrapper' objects}
        1    0.000    0.000    0.000    0.000 base.py:31(<lambda>)
        2    0.000    0.000    0.000    0.000 datastructures.py:267(__init__)
       62    0.000    0.000    0.000    0.000 copy.py:172(_deepcopy_atomic)
        1    0.000    0.000    0.000    0.000 timezone.py:272(make_aware)
        5    0.000    0.000    0.000    0.000 deconstruct.py:15(__new__)
       19    0.000    0.000    0.000    0.000 operations.py:363(<genexpr>)
      2/1    0.000    0.000    0.000    0.000 utils.py:46(<lambda>)
        1    0.000    0.000    0.000    0.000 __init__.py:1573(pre_save)
        2    0.000    0.000    0.000    0.000 <frozen ntpath>:51(normcase)
        1    0.000    0.000    0.000    0.000 objects.py:36(__get__)
       37    0.000    0.000    0.000    0.000 related.py:1057(target_field)
        1    0.000    0.000    0.000    0.000 expressions.py:239(contains_aggregate)
       20    0.000    0.000    0.000    0.000 base.py:440(__get__)
        4    0.000    0.000    0.000    0.000 tree.py:89(add)
       18    0.000    0.000    0.000    0.000 __init__.py:1224(to_python)
        2    0.000    0.000    0.000    0.000 where.py:249(<genexpr>)
       39    0.000    0.000    0.000    0.000 {method 'close' of 'sqlite3.Cursor' objects}
        1    0.000    0.000    0.000    0.000 compiler.py:1106(get_from_clause)
        1    0.000    0.000    0.000    0.000 __init__.py:1462(get_prep_value)
       37    0.000    0.000    0.000    0.000 utils.py:33(__enter__)
        1    0.000    0.000    0.000    0.000 subqueries.py:145(__init__)
        1    0.000    0.000    0.000    0.000 timezone.py:296(make_naive)
       18    0.000    0.000    0.000    0.000 utils.py:240(allow_relation)
        1    0.000    0.000    0.000    0.000 operations.py:250(limit_offset_sql)
        2    0.000    0.000    0.000    0.000 manager.py:184(__get__)
        1    0.000    0.000    0.000    0.000 imports.py:126(gen_task_name)
        4    0.000    0.000    0.000    0.000 query.py:890(demote_joins)
        1    0.000    0.000    0.000    0.000 contextlib.py:132(__enter__)
        1    0.000    0.000    0.000    0.000 contextlib.py:287(helper)
       36    0.000    0.000    0.000    0.000 __init__.py:108(return_None)
        2    0.000    0.000    0.000    0.000 base.py:11(__init__)
        1    0.000    0.000    0.000    0.000 testcases.py:1390(get_memo)
        2    0.000    0.000    0.000    0.000 <frozen ntpath>:179(splitroot)
       10    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        9    0.000    0.000    0.000    0.000 __init__.py:860(get_db_converters)
       36    0.000    0.000    0.000    0.000 operations.py:683(modify_insert_params)
        1    0.000    0.000    0.000    0.000 copyreg.py:107(_slotnames)
       12    0.000    0.000    0.000    0.000 query.py:2482(extra_select)
       22    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 config.py:224(get_model)
        1    0.000    0.000    0.000    0.000 dispatcher.py:234(_live_receivers)
        1    0.000    0.000    0.000    0.000 operations.py:401(return_insert_columns)
        2    0.000    0.000    0.000    0.000 <frozen ntpath>:564(normpath)
       20    0.000    0.000    0.000    0.000 query.py:310(query)
       17    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {built-in method _csv.reader}
        3    0.000    0.000    0.000    0.000 operations.py:363(compiler)
        1    0.000    0.000    0.000    0.000 files.py:18(__init__)
        2    0.000    0.000    0.000    0.000 {method 'replace' of 'datetime.datetime' objects}
        1    0.000    0.000    0.000    0.000 expressions.py:245(contains_over_clause)
        4    0.000    0.000    0.000    0.000 query.py:1280(check_filterable)
       10    0.000    0.000    0.000    0.000 query.py:305(get_meta)
       13    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x00007FFCDF8413D0}
        1    0.000    0.000    0.000    0.000 base.py:665(get_deferred_fields)
        3    0.000    0.000    0.000    0.000 compiler.py:1742(<genexpr>)
        2    0.000    0.000    0.000    0.000 base.py:631(__setstate__)
        2    0.000    0.000    0.000    0.000 threads.py:94(release_local)
        2    0.000    0.000    0.000    0.000 query_utils.py:223(get_instance_lookups)
        1    0.000    0.000    0.000    0.000 operations.py:396(insert_statement)
        1    0.000    0.000    0.000    0.000 query.py:909(reset_refcounts)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        2    0.000    0.000    0.000    0.000 expressions.py:185(__getstate__)
       21    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 contextlib.py:141(__exit__)
        2    0.000    0.000    0.000    0.000 {built-in method nt._getfullpathname}
        1    0.000    0.000    0.000    0.000 transaction.py:316(atomic)
        1    0.000    0.000    0.000    0.000 {built-in method now}
        2    0.000    0.000    0.000    0.000 lookups.py:69(get_source_expressions)
        2    0.000    0.000    0.000    0.000 query.py:818(table_alias)
        6    0.000    0.000    0.000    0.000 query.py:2101(is_sliced)
       19    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
       19    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        3    0.000    0.000    0.000    0.000 timezone.py:246(is_aware)
        2    0.000    0.000    0.000    0.000 query.py:1893(trim_joins)
        1    0.000    0.000    0.000    0.000 __init__.py:1436(pre_save)
        2    0.000    0.000    0.000    0.000 <string>:1(<lambda>)
       18    0.000    0.000    0.000    0.000 operations.py:502(validate_autopk_value)
        2    0.000    0.000    0.000    0.000 timezone.py:259(is_naive)
        1    0.000    0.000    0.000    0.000 datastructures.py:10(__init__)
        2    0.000    0.000    0.000    0.000 threads.py:140(__setattr__)
        2    0.000    0.000    0.000    0.000 query.py:1202(resolve_lookup_value)
        8    0.000    0.000    0.000    0.000 {built-in method greenlet._greenlet.getcurrent}
        5    0.000    0.000    0.000    0.000 {method 'utcoffset' of 'datetime.datetime' objects}
        1    0.000    0.000    0.000    0.000 contextlib.py:104(__init__)
        1    0.000    0.000    0.000    0.000 related.py:1178(get_db_converters)
        1    0.000    0.000    0.000    0.000 query.py:1032(count_active_tables)

... Cut short because the message is too long.

The insertion of the data is still the slowest part, but it does make sense that it takes some time as there is a lot of it. It’s also currently used with a fairly small batch_size, since I’m using Sqlite at the moment (I plan on switching to Postgres soon, but the project is still in very early development and I like to get started with a lighter stack).

If you add the time spent in _insert_cells and the time spent in the SQLiteCursorWrapper, that is half the time though.

Here is the code for the actual insertion, in case you can quickly see at a glance something I could improve:

    def _insert_cells(self, cells: list[tuple[MakeshiftRow, Column, str]]) -> None:
        with connection.cursor() as cursor:
            cursor.execute(
                f"""
                INSERT INTO {Cell._meta.db_table} (
                    {Cell._meta.get_field('row').column},
                    {Cell._meta.get_field('column').column},
                    {Cell._meta.get_field('value').column}
                )
                VALUES
                    {','.join(
                        "(%s, %s, %s)"
                        for _ in range(len(cells))
                    )}
                """,
                [
                    v.pk if isinstance(v, MakeshiftRow | Column) else v  # type: ignore[misc]
                    for cell_values in cells
                    for v in cell_values
                ],
            )

For context, MakeshiftRow is a class I made that serves to track the object IDs the same way the Model usually would. It is defined like this:

class MakeshiftRow:
    def __init__(self, import_file_id: int) -> None:
        self.import_file_id = import_file_id
        self.pk: int | None = None

I basically assign use this in place of the Row model as a very lightweight holder until I get the ID upon creation of the rows in the database:

    def _insert_rows(self, rows: MakeshiftManager) -> None:
        with connection.cursor() as cursor:
            cursor.execute(
                f"""
                INSERT INTO {Row._meta.db_table} (
                    {Row._meta.get_field('import_file').column}
                )
                VALUES
                    {','.join("(%s)" for _ in range(len(rows)))}
                RETURNING {Row._meta.get_field('id').column}
                """,
                [row.import_file_id for row in rows],
            )
            ids = cursor.fetchall()
            rows.set_ids([id_[0] for id_ in ids])

And MakeshiftManager is just a list with an added set_ids method for convenience:

class MakeshiftManager(list):
    def set_ids(self, ids: list[int]) -> None:
        for row, id_ in zip(self, ids):
            row.pk = id_

If you see something I missed, I’ll be happy to hear it! Thanks for taking the time to respond, and at the very least I’ll have learned something about kwargs and args for Model instantiation!

1 Like