Deduplicate rows but prefer rows where another field is not null

My object data_model.fields has 3 relevant columns:

name identifier tenant
customer customer_table#customer AAAA
customer generic_customer_table#customer null
location generic_customer_table#location null
rebate customer_table#rebate AAAA
rebate customer_table#rebate BBBB

Note that

  • for the same name the identifier might be different
  • name might be duplicated
  • tenant might be null, or multiple different tenants for the same name or/and identifier

I need to query (identifier, name) with the following two conditions:

  1. For a single query I am only interested in rows for a target_tenant (e.g. AAAA) or where tenant is null
  2. I want one row per unique name
  3. There is a preference behavior how to choose the identifier
    3.1 If there is only one row for a unique name with tenant==null use it.
    3.2 If there is one row for a unique name with tenant==target_target use it
    3.3 If there are two entries with the same name but one has tenant==null and one has tenant==target_tenant then prefer the one which has tenant set

An additional thought to it: Since I need the identifier and that might be different in each row, I can’t do an easy group by.

Some queries I came up with in SQL:

Query 1

SELECT identifier, name, tenant
FROM mine_dataobject
WHERE (name, tenant) IN (
    SELECT name, MAX(tenant)
    FROM mine_dataobject 
    GROUP BY name
) OR (tenant IS NULL AND name IN (
    SELECT name
    FROM mine_dataobject
    GROUP BY name
    HAVING MAX(tenant) IS NULL
));

Query 2

WITH first AS (select name, MAX(tenant) as correct_tenant
               from mine_datafield
               group by name)

select * from mine_datafield
         join first a on public.mine_datafield.name = a.name
WHERE tenant = correct_tenant or coalesce(tenant, correct_tenant) is null

(some more options, but you get the iea)

Since then I have struggled painfully to translate this into Django ORM.
Mainly ChatGPT came up with this monster, but I am not quite sure about it.
It seems to work, but also be a little convoluted.

        max_tenant_subquery = (
            data_model.fields
            .values('name')
            .filter(Q(tenant__isnull=True) | Q(tenant=self.targettenant))
            .annotate(max_tenant=Max('tenant'))
            .values('name', 'max_tenant')
        )

        # Subquery to get names where MAX(tenant) is NULL
        names_with_null_tenant_subquery = (
            data_model.fields
            .values('name')
            .filter(Q(tenant__isnull=True) | Q(tenant=self.target_tenant))
            .annotate(max_tenant=Max('tenant'))
            .filter(max_tenant__isnull=True)
            .values('name')
        )

        # Main query
        result = (
            data_model.fields
            .filter(Q(tenant__isnull=True) | Q(tenant=self.target_tenant))
            .filter(
                Q(
                    Exists(
                        max_tenant_subquery.filter(
                            name=OuterRef('name'),
                            max_tenant=OuterRef('tenant')
                        )
                    )
                ) | Q(
                    tenant__isnull=True,
                    name__in=names_with_null_tenant_subquery
                )
            )
            .values('identifier', 'name')
        )
        return result

1. Preliminary assumptions:

  • For a simple working example I removed “.filter(Q(tenant__isnull=True) | Q(tenant=self.target_tenant))” since it makes the example much easier to understand and does not change the important logic of deduplicating:
max_tenant_subquery = (
    data_model.fields
    .values('name')  # select name and group by name
    .annotate(max_tenant=Max('tenant'))  # aggragate function max(tenant)
    .values('name', 'max_tenant')  # returned fieldnames
)
print(max_tenant_subquery)
# Main query
result = (
    data_model.fields
    .filter(
        Q(
            Exists(
                max_tenant_subquery.filter(
                    name=OuterRef('name'),
                    max_tenant=OuterRef('tenant')
                )
            )
        ) | Q(
            tenant__isnull=True,
            name__in=names_with_max_tenant_null_subquery
        )
    )
    .values('identifier', 'name', 'tenant')
)
  • Assuming the ORM is configured in a way that data_model.fields yields “objects” in the classical sense (class MineDataObject(models.Model) and followingly data_model.fields = MineDataObject.objects) the following holds:

While the query does yield the results you want, you might consider a simpler approach by defining two subqueries beforehand, strictly separating the cases in which there is a tenant available (and MAX(tenant) yields a result) from the cases in which there is no tenant (and MAX(tenant) yields null). Then the first Q filter is very similar to the second one, but without the Exists and the extra subquery:

A. Approach 1 (simpler)

A.1 Subquery to get names where MAX(tenant) is NULL

names_with_max_tenant_null_subquery = (
    data_model.fields
    .values('name')
    .annotate(max_tenant=Max('tenant'))
    .filter(max_tenant__isnull=True)
    .values('name')
)
print(names_with_max_tenant_null_subquery)

A.2 Subquery to get names where MAX(tenant) is NOT NULL

names_with_max_tenant_not_null_subquery = (
    data_model
    .values('name')
    .annotate(max_tenant=Max('tenant'))
    .filter(max_tenant__isnull=False)
    .values('name')
)
print(names_with_max_tenant_not_null_subquery)

A.3 Filter using the previous subqueries and select identifier as well!

result = (
    data_model.fields
    .filter(
        Q(
            tenant__isnull=False,
            name__in=names_with_max_tenant_not_null_subquery
        ) | Q(
            tenant__isnull=True,
            name__in=names_with_max_tenant_null_subquery
        )
    )
    .values('identifier', 'name', 'tenant')
)
print(result)

Evaluated into SQL (print(result.query)) this approach yields:

SELECT "identifier",
    "name",
    "tenant"
FROM "mine_dataobject"
WHERE (
        (
            "name" IN (
                SELECT MDO_SUB."name"
                FROM "mine_dataobject" MDO_SUB
                GROUP BY MDO_SUB."name"
                HAVING MAX(MDO_SUB."tenant") IS NOT NULL
            )
            AND "tenant" IS NOT NULL
        )
        OR (
            "name" IN (
                SELECT MDO_SUB."name"
                FROM "MDO_SUB" U_INNER
                GROUP BY MDO_SUB."name"
                HAVING MAX(MDO_SUB."tenant") IS NULL
            )
            AND "tenant" IS NULL
        )
    )

NOTE 1: I changed django’s alias “U0” to “MDO_SUB” to emphasize “mine_dataobject of the subquery”. And removed the cluttering “mine_dataobject” in front of every other column from the main table.
NOTE 2: This approach is robust against sqlite3 AND postgres it seems. Very interestingly your approach seems to work with postgres, but NOT with sqlite!

B. Your approach with django to sql translation

Consider the three “seemingly identical approaches” (two identical correct results for postgres one incorrect, but three different results for sqlite3, only one correct) - hint → look at the “outer” table / alias for it:

a. Direct django to sql translation → no alias for main query:

SELECT "mine_dataobject"."identifier",
    "mine_dataobject"."name",
    "mine_dataobject"."tenant"
FROM "mine_dataobject"
WHERE (
        EXISTS(
            SELECT 1 AS "a"
            FROM "mine_dataobject" MDO_SUB
            WHERE MDO_SUB."name" = ("mine_dataobject"."name")
            GROUP BY MDO_SUB."name",
                MDO_SUB."tenant"
            HAVING MAX(MDO_SUB."tenant") = ("mine_dataobject"."tenant")
            LIMIT 1
        )
        OR (
            "mine_dataobject"."name" IN (
                SELECT MDO_SUB."name"
                FROM "mine_dataobject" MDO_SUB
                GROUP BY MDO_SUB."name"
                HAVING MAX(MDO_SUB."tenant") IS NULL
            )
            AND "mine_dataobject"."tenant" IS NULL
        )
    )

→ SUCCESSFUL FOR POSTGRES
→ fails with SQLITE_ERROR: no such column: minedataobject.tenant

b. Removing the outer table information, no alias:

SELECT "identifier",
    "name",
    "tenant"
FROM "mine_dataobject"
WHERE (
        EXISTS(
            SELECT 1 AS "a"
            FROM "mine_dataobject" MDO_SUB
            WHERE MDO_SUB."name" = ("name")
            GROUP BY MDO_SUB."name",
                MDO_SUB."tenant"
            HAVING MAX(MDO_SUB."tenant") = ("tenant")
            LIMIT 1
        )
        OR (
            "name" IN (
                SELECT MDO_SUB."name"
                FROM "mine_dataobject" MDO_SUB
                GROUP BY MDO_SUB."name"
                HAVING MAX(MDO_SUB."tenant") IS NULL
            )
            AND "tenant" IS NULL
        )
    )

→ INCORRECT RESULT for POSTGRES AND SQLITE (it seems that all data is returned, the duplicate rows are not filtered out)

c. Alias for main table:

SELECT U_OUTER."identifier",
    U_OUTER."name",
    U_OUTER."tenant"
FROM "mine_dataobject" U_OUTER
WHERE (
        EXISTS(
            SELECT 1 AS "a"
            FROM "mine_dataobject" MDO_SUB
            WHERE MDO_SUB."name" = (U_OUTER."name")
            GROUP BY MDO_SUB."name",
                MDO_SUB."tenant"
            HAVING MAX(MDO_SUB."tenant") = (U_OUTER."tenant")
            LIMIT 1
        )
        OR (
            U_OUTER."name" IN (
                SELECT MDO_SUB."name"
                FROM "mine_dataobject" MDO_SUB
                GROUP BY MDO_SUB."name"
                HAVING MAX(MDO_SUB."tenant") IS NULL
            )
            AND U_OUTER."tenant" IS NULL
        )
    )

→ Both POSTGRES AND SQLITE yield CORRECT result (filtering the duplicates)

I only became aware of this because I tried a quick and dirty setup with django using sqlite3, and receiving the error mentioned above… I hope this helps, even though I cannot explain what exactly goes wrong on driver level when using Exists() in this context …

Note: I was using

sqlite3 - version 2.6.0

PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit"