When loading a fixture (or other data to with natural-key in relations django performs a SELECT operatio for each related model. If I have a fixture in which there are 1000 references to the same natural-key this leads to 1000 SELECT queries for the same record!
Is there a way to avoid such repetitive and useless operation while keeping the natural-key behaviour?
Has someone ever requested such optimization in fixture loading?
My goal was to avoid keeping IDs in fixture whenever is possible to avoid problems which may be due to different ID assignment in different environments (ie: natural keys are the same in both environments but IDs are differente).
Hi @carltongibson, I was thinking about this but I do not feel comfortable using a cache at that level. Because than I should handle also cache invalidation due to an update from an other host/process/thread.
Also when the fixture contains models on which I do not have source control (ie: auth.Group or some auxilliary package) it is not possible to put an effective cache without some monkey-patch which feels like a bad option and may lead to other issue.
I belive that for the purpose of data loading a temporary caching in deserialize_m2m_values and deserialize_fk_value could be a good solution (also I noticed that there is some similar code between the two methods, before checking the code Ibelived that the first would call the latter).
However I understand that this kind of change may only come in a next release of django, while I would like to find something which may help me in the near future to reduce the overhead of fixture loading while keeping the readability and portability of natural keys.
That is interesting, however I feel that it still suffers from the same problem: there is no caching for FK/M2M, thus it should perform X SELECT queries if there are X reference to the same natural-key.
To work-around my use case I had to overwrite the default serializers which I use in my project (yaml/json) with SERIALIZATION_MODULES. I had to intercept data stream before it is passed to django.core.serializers.python.Deserializer and put there a simple yet effective cache layer mapping model/natural-key/pk. This layer is only valid for a single fixture load and could be improved for better usage within the loaddata management command
Here is JSON module and the cache layer:
"""myserializers/json.py"""
import json
from django.core.serializers.base import DeserializationError
from django.core.serializers.json import Serializer
from django.core.serializers.python import Deserializer as PythonDeserializer
from django.db import DEFAULT_DB_ALIAS
from .cache import cached_replace_natural_keys
def Deserializer(stream_or_string, **options):
"""Deserialize a stream or string of JSON data."""
if not isinstance(stream_or_string, (bytes, str)):
stream_or_string = stream_or_string.read()
if isinstance(stream_or_string, bytes):
stream_or_string = stream_or_string.decode()
try:
objects = json.loads(stream_or_string)
yield from PythonDeserializer(cached_replace_natural_keys(objects, options.get('using', DEFAULT_DB_ALIAS)), **options)
except (GeneratorExit, DeserializationError):
raise
except Exception as exc:
raise DeserializationError() from exc
"""myserializers/cache.py"""
from collections import defaultdict
from collections.abc import Iterable
from django.apps import apps
from django.core.serializers import base
from django.db import models
def _is_natural_key(manager, field_value):
return hasattr(manager, "get_by_natural_key") and hasattr(field_value, "__iter__") and not isinstance(field_value, str)
def cached_replace_natural_keys(object_list: Iterable[dict], using: str) -> Iterable[dict]:
natural_key_cache = defaultdict(dict)
relatd_field_names_cache = {} # Model: <list of field_names>
for data in object_list:
# Look up the model and starting build a dict of data for it.
try:
Model = apps.get_model(data["model"]) # pylint: disable=invalid-name
except (LookupError, TypeError):
# let it be handled by django internal serializer
yield data
continue
if Model not in relatd_field_names_cache:
relatd_field_names_cache[Model] = {f for f in Model._meta.get_fields() if f.remote_field}
# get and populate related fields when there are natural keys
for field in relatd_field_names_cache[Model]:
if field.name in data['fields'] and (field_value := data['fields'][field.name]) is not None:
remote_model = field.remote_field.model
default_manager = remote_model._default_manager
# handle FK fields
if isinstance(field.remote_field, models.ManyToOneRel) and _is_natural_key(default_manager, field_value):
field_value = tuple(field_value)
try:
data['fields'][field.name] = natural_key_cache[remote_model][field_value]
except KeyError:
# populate cache whenever the field can be found
# use handle_forward_references=True to avoid raising other exceptions
value = base.deserialize_fk_value(field, field_value, using, True)
if value is not base.DEFER_FIELD:
data['fields'][field.name] = natural_key_cache[remote_model][field_value] = value
# handle M2M fields
elif isinstance(field.remote_field, models.ManyToManyRel) and field_value and hasattr(default_manager, 'get_by_natural_key'):
new_value = []
for elem in field_value:
if _is_natural_key(default_manager, elem):
key = tuple(elem)
try:
elem = natural_key_cache[remote_model][key]
except KeyError:
# populate cache whenever the field is found
# use handle_forward_references=True to avoid raising other exceptions
value = base.deserialize_m2m_values(field, [elem], using, True)
if value is not base.DEFER_FIELD:
# get first element since value is a list
elem = natural_key_cache[remote_model][key] = value[0]
new_value.append(elem)
# replace field
data['fields'][field.name] = new_value
yield data