De-serialization natural-key cache

When loading a fixture (or other data to with natural-key in relations django performs a SELECT operatio for each related model. If I have a fixture in which there are 1000 references to the same natural-key this leads to 1000 SELECT queries for the same record!

Is there a way to avoid such repetitive and useless operation while keeping the natural-key behaviour?
Has someone ever requested such optimization in fixture loading?

My goal was to avoid keeping IDs in fixture whenever is possible to avoid problems which may be due to different ID assignment in different environments (ie: natural keys are the same in both environments but IDs are differente).

Hi @sevdog

I guess you could implement get_by_natural_key() to cache its lookups? :thinking:

Hi @carltongibson, I was thinking about this but I do not feel comfortable using a cache at that level. Because than I should handle also cache invalidation due to an update from an other host/process/thread.

Also when the fixture contains models on which I do not have source control (ie: auth.Group or some auxilliary package) it is not possible to put an effective cache without some monkey-patch which feels like a bad option and may lead to other issue.

I belive that for the purpose of data loading a temporary caching in deserialize_m2m_values and deserialize_fk_value could be a good solution (also I noticed that there is some similar code between the two methods, before checking the code Ibelived that the first would call the latter).

However I understand that this kind of change may only come in a next release of django, while I would like to find something which may help me in the near future to reduce the overhead of fixture loading while keeping the readability and portability of natural keys.

Yes, quick changes isn’t something we can do, so you’ll need to use some workaround if you need it now.

Worth checking out django-import-export, which allows a bit more sophistication…

https://django-import-export.readthedocs.io/en/latest/

That is interesting, however I feel that it still suffers from the same problem: there is no caching for FK/M2M, thus it should perform X SELECT queries if there are X reference to the same natural-key.

To work-around my use case I had to overwrite the default serializers which I use in my project (yaml/json) with SERIALIZATION_MODULES. I had to intercept data stream before it is passed to django.core.serializers.python.Deserializer and put there a simple yet effective cache layer mapping model/natural-key/pk. This layer is only valid for a single fixture load and could be improved for better usage within the loaddata management command

Here is JSON module and the cache layer:

"""myserializers/json.py"""
import json
from django.core.serializers.base import DeserializationError
from django.core.serializers.json import Serializer
from django.core.serializers.python import Deserializer as PythonDeserializer
from django.db import DEFAULT_DB_ALIAS
from .cache import cached_replace_natural_keys


def Deserializer(stream_or_string, **options):
    """Deserialize a stream or string of JSON data."""
    if not isinstance(stream_or_string, (bytes, str)):
        stream_or_string = stream_or_string.read()
    if isinstance(stream_or_string, bytes):
        stream_or_string = stream_or_string.decode()
    try:
        objects = json.loads(stream_or_string)
        yield from PythonDeserializer(cached_replace_natural_keys(objects, options.get('using', DEFAULT_DB_ALIAS)), **options)
    except (GeneratorExit, DeserializationError):
        raise
    except Exception as exc:
        raise DeserializationError() from exc
"""myserializers/cache.py"""
from collections import defaultdict
from collections.abc import Iterable
from django.apps import apps
from django.core.serializers import base
from django.db import models


def _is_natural_key(manager, field_value):
    return hasattr(manager, "get_by_natural_key") and hasattr(field_value, "__iter__") and not isinstance(field_value, str)


def cached_replace_natural_keys(object_list: Iterable[dict], using: str) -> Iterable[dict]:
    natural_key_cache = defaultdict(dict)
    relatd_field_names_cache = {}  # Model: <list of field_names>

    for data in object_list:
        # Look up the model and starting build a dict of data for it.
        try:
            Model = apps.get_model(data["model"])  # pylint: disable=invalid-name
        except (LookupError, TypeError):
            # let it be handled by django internal serializer
            yield data
            continue

        if Model not in relatd_field_names_cache:
            relatd_field_names_cache[Model] = {f for f in Model._meta.get_fields() if f.remote_field}
        # get and populate related fields when there are natural keys
        for field in relatd_field_names_cache[Model]:
            if field.name in data['fields'] and (field_value := data['fields'][field.name]) is not None:
                remote_model = field.remote_field.model
                default_manager = remote_model._default_manager
                # handle FK fields
                if isinstance(field.remote_field, models.ManyToOneRel) and _is_natural_key(default_manager, field_value):
                    field_value = tuple(field_value)
                    try:
                        data['fields'][field.name] = natural_key_cache[remote_model][field_value]
                    except KeyError:
                        # populate cache whenever the field can be found
                        # use handle_forward_references=True to avoid raising other exceptions
                        value = base.deserialize_fk_value(field, field_value, using, True)
                        if value is not base.DEFER_FIELD:
                            data['fields'][field.name] = natural_key_cache[remote_model][field_value] = value
                # handle M2M fields
                elif isinstance(field.remote_field, models.ManyToManyRel) and field_value and hasattr(default_manager, 'get_by_natural_key'):
                    new_value = []
                    for elem in field_value:
                        if _is_natural_key(default_manager, elem):
                            key = tuple(elem)
                            try:
                                elem = natural_key_cache[remote_model][key]
                            except KeyError:
                                # populate cache whenever the field is found
                                # use handle_forward_references=True to avoid raising other exceptions
                                value = base.deserialize_m2m_values(field, [elem], using, True)
                                if value is not base.DEFER_FIELD:
                                    # get first element since value is a list
                                    elem = natural_key_cache[remote_model][key] = value[0]

                        new_value.append(elem)
                    # replace field
                    data['fields'][field.name] = new_value

        yield data