A legacy system’s report table is slow because it contains customized columns with EAV (entity attribute value). The current system behaviour is to:
- With a set of transactions referred to as the entity, we query the
eav_value
andeav_attribute
tables for the relevant EAV data. - When initializing the Python serializer, the program loops through the EAV QuerySet and builds a dictionary for random access with the
entity_id
as the key. - Overwrite the serializer’s
to_representation()
function to change its default behaviour. The input parameterinstance
is one single transaction, also called one entity, and the transaction’s ID is theentity_id
of EAV. Therefore, we look up the entity’s attributes and values in the above dictionary.
Our system currently uses the original django-eav, not Django EAV 2, although we have fixed compatibility issues and upgraded our forked repository to Python 3.
Despite the arguments about the Pro and Con of EAV, we still hope to improve the system performance within the EAV framework for stability if doable.
We suspect storing many EAV records in a randomly accessed dictionary is inefficient. Usually, the database can more efficiently handle records than an in-memory dictionary. So, we want to improve it by querying EAV data in two steps:
- The first database query reads in the metadata: the report lists transactions of the same product, so we retrieve the attributes related to this product and the data types.
- Then, make an adaptive QuerySet corresponding to the metadata, query the database for the entity data and append the related EAV data by
left outer join
.
As a POC (proof of concept), we made a hand-written SQL query with recursive CTEs and tested it on MariaDB v11.5. Please find the sample source code in the Technical Details section with comments as explanations, and let us know for any questions.
The POC has the same time complexity as the old solution. Given page size M and N attributes, the task takes O(M x N). For space complexity, I’m not sure about the dictionary’s implementation space-wise, including hashing, pseudorandom, loading factor, etc., while the SQL query should still be O(M x N).
- We hope the database handles records more efficiently than an in-memory dictionary so that when the page size M gets to a big number, e.g. tens of thousands, the program still runs reasonably fast with the help of database indexing. In contrast, the dictionary can hit a memory bottleneck, so the task takes longer.
- We are also considering global searching for the customized columns, so EAV may provide more convenience than wrapping up everything into a text field of JSON.
- We also need a generic solution within Django’s ecosystem, not the raw SQL query for demonstration.
Our Question:
Besides the raw SQL query above, is it possible to implement the same by Django’s existing functions?
We are open to Dango-EAV2 and other alternative solutions, and we highly appreciate any hints and suggestions.
Technical Details:
1. The relevant part of the Python serializer:
...
class ReportSerializer(serializers.ModelSerializer):
eav_set = None
...
def __init__(self, *args, **kwargs):
...
value_qs = Value.objects.filter(entity_ct=content_type,
entity_id__in=trans_ids).select_related('attribute')
eav_set = {}
for value in value_qs:
eav_set_value = eav_set.get(value.entity_id)
if eav_set_value:
eav_set[value.entity_id].append(value)
else:
eav_set[value.entity_id] = [value]
self.eav_set = eav_set
def to_representation(self, instance):
...
if self.eav_set:
eav_tran_set = self.eav_set.get(instance.id)
if eav_tran_set:
for value in eav_tran_set:
ret_val[value.attribute.slug] = value.value
else:
val_dict = instance.eav.get_values_dict()
for key, val in six.iteritems(val_dict):
ret_val[key] = val
return ret_val
...
2. Proof of Concept with a hand-written SQL query:
We referred to Claire’s online article Flattening an EAV model in SQL, the DRY way aiming at a specific scenario. Our query is only a shallow example to demonstrate the thoughts, though. And we can make it more efficient if you have questions, e.g.:
- whether we can do it without CTE,
- what CTE’s system behaviour is
- what is the query execution plan
- etc. So, for anything, just let us know.
with recursive
-- This CTE simulates a partial set of transactions with pagination,
-- with page size five and offset ten, we are querying for the 3rd page.
--
cte_transaction as (
select
`app_transaction`.`id`
, `app_transaction`.`create_ts`
from
`app_transaction`
inner join `app_product` on
(`app_transaction`.`product_id` = `app_product`.`id`)
where
(not `app_transaction`.`rec_del`
and `app_transaction`.`product_id` = __PRODUCT_ID__)
order by
`app_transaction`.`id` desc
limit 5 OFFSET 10
)
, cte_meta_attribute as (
select
''
, eava.slug
, eava.datatype
from
app_product ep
inner join app_producttemplate ept
on ep.product_template_id = ept.id
inner join app_transactionattribute eta
on ept.id = eta.product_template_id
inner join eav_attribute eava
on eta.attribute_id = eava.id
where
ep.id = __PRODUCT_ID__
)
--
-- |slug |datatype |
-- +---------------+---------+
-- |birthdate |date |
-- |phone |text |
-- |firstname |text |
-- |lastname |text |
-- |passtype |text |
-- |postalcode |text |
-- |reuse_photos |text |
-- |clubname |text |
-- |firstnamewaiver|text |
-- |lastnamewaiver |text |
-- |countrycode |text |
-- |waiver |text |
--
, cte_eav_set as (
select
cte_transaction.id as transaction_id
, eav_attribute.datatype
, eav_attribute.slug
, eav_value.*
from
cte_transaction
inner join eav_value
on cte_transaction.id = eav_value.entity_id
inner join eav_attribute
on eav_value.attribute_id = eav_attribute.id
)
, cte_eav_set_postalcode as (
select
cte_eav_set.transaction_id
, cte_eav_set.value_text
from
cte_eav_set
where
cte_eav_set.slug = 'postalcode'
)
, cte_eav_set_birthdate as (
select
cte_eav_set.transaction_id
, cte_eav_set.value_date
from
cte_eav_set
where
cte_eav_set.slug = 'birthdate'
)
--
-- There are more EAV attributes and CTEs, and we can use the
-- metadata, cte_meta_attribute, to generate the SQL queries automatically, e.g.:
--
-- cte_eav_set_birthdate
-- cte_eav_set_phone
-- cte_eav_set_firstname
-- cte_eav_set_lastname
-- cte_eav_set_passtype
-- cte_eav_set_postalcode
-- cte_eav_set_reuse_photos
-- cte_eav_set_clubname
-- cte_eav_set_firstnamewaiver
-- cte_eav_set_lastnamewaiver
-- cte_eav_set_countrycode
-- cte_eav_set_waiver
--
select
cte_transaction.*
, cte_eav_set_postalcode.value_text as 'postalcode'
, cte_eav_set_birthdate.value_date as 'birthdate'
from
cte_transaction
left outer join cte_eav_set_postalcode
on cte_transaction.id = cte_eav_set_postalcode.transaction_id
left outer join cte_eav_set_birthdate
on cte_transaction.id = cte_eav_set_birthdate.transaction_id
--
-- ... more left outer join's for the other attributes, e.g.
-- left outer join cte_eav_set_phone ...
-- ...
;
-- Final Results:
-- id |create_ts |postalcode|birthdate ||Others... |
-- ------+-----------------------+----------+----------++----------+
-- xxxxxx|xxxx-xx-xx xx:xx:xx.xxx|XXXXXX |xxxx-xx-xx||... ... |
-- xxxxxx|xxxx-xx-xx xx:xx:xx.xxx|XXXXXX |xxxx-xx-xx||... ... |
-- xxxxxx|xxxx-xx-xx xx:xx:xx.xxx|XXXXXX |xxxx-xx-xx||... ... |
-- xxxxxx|xxxx-xx-xx xx:xx:xx.xxx|XXXXXX |xxxx-xx-xx||... ... |
-- xxxxxx|xxxx-xx-xx xx:xx:xx.xxx|XXXXXX |xxxx-xx-xx||... ... |
--