LEFT JOIN... WHERE IS NULL clause

I know that it is strongly encouraged to use foreign key to execute such equivalent query, but in this case, I cannot.

I have an EncumbranceImport table that gets data on a daily basis through a csv file and its content has to be cleaned up before further processing. One of the cleaning process it to verify if there are any costcenter in the EncumbranceImport that do not currently exists in the CostCenters table. IF so, a list has to be made and presented to the user for further action.

I cooked up this function but clearly, the query generated does not look at all what would be expected. Tried with ~Q too. Being reasonably familiar with SQL, I wonder if I should not simply use raw query. Or that will be another source of problems?

    def missing_costcenters(self):
        qs_cc = EncumbranceImport.objects.all().values_list("costcenter")
        cc_set = set(map(lambda e: e[0], qs_cc))
        q_list = [Q(costcenter=cc) for cc in cc_set]
        print(q_list)
        qset = Costcenters.objects.filter(reduce(operator.or_, q_list))
        print(qset)
        print(qset.query)

Print output:

[<Q: (AND: ('costcenter', '8486C2'))>, <Q: (AND: ('costcenter', '8486C1'))>, <Q: (AND: ('costcenter', '8486B1'))>]
<QuerySet [<Costcenters: 8486C1 - KITCHEN>]>
SELECT "costcenters_costcenters"."id", "costcenters_costcenters"."costcenter", "costcenters_costcenters"."shortname", "costcenters_costcenters"."fund_id", "costcenters_costcenters"."source_id", "costcenters_costcenters"."isforecastable", "costcenters_costcenters"."isupdatable", "costcenters_costcenters"."note", "costcenters_costcenters"."parent_id" FROM "costcenters_costcenters" WHERE ("costcenters_costcenters"."costcenter" = 8486C2 OR "costcenters_costcenters"."costcenter" = 8486C1 OR "costcenters_costcenters"."costcenter" = 8486B1)

Let me see if I understand you correctly.

You’re reading a file to be imported into EncumbranceImport.

You want to find all costcenter values in EncumbranceImport that are not in the CostCenters table.

If I am correct, do you need any other information from the EncumbranceImport data beyond those missing Costcenter objects?

What is the size of this Costcenter model? (Roughly how many rows?) If it’s not too huge (> 1M or so) I’d be looking at something like this:

import_cc = set(EncumbranceImport.objects.values_list("costcenter", flat=True))
costcenters = set(Costcenters.objects.values_list('pk', flat=True))
missing = import_cc.difference(costcenters)

Yes, you do understand exactly the situation.
In the EncumbranceImport, there is more to be done. What is done for costcenters has to be done for funds too. If unknown cost centers are found, then the process must stop and action taken before doing anything else. There are also defaults values the need to be set in other tables according to some other fields in EncumbranceImport. Exemple, if balance reaches 0, then forecast must be set to working plan. etc.

CostCenter model is les than 1000 lines, EncumbranceImport not more than 50k lines.

While waiting for someone to answer, I came up with this

        qs_cc_import = EncumbranceImport.objects.all().values_list("costcenter")
        cc_set_import = set(map(lambda e: e[0], qs_cc_import))
        qs_cc = Costcenters.objects.all().values_list("costcenter")
        cc_set = set(map(lambda e: e[0], qs_cc))
        return cc_set_import.difference(cc_set)

I don’t know what flat=True is but looks promising.
Thanks again.

Side note: Beyond the point that flat=True makes either of these unnecessary, there’s a better alternative this:

would be this:

cc_set_import = {
   e.costcenter for e in EncumbranceImport.objects.distinct('costcenter').only('costcenter')
}

Good side note. I had seen object.distinct, but sqlite3 does not support this capability. It is a bit annoying because when I am in production with postgresql it wont be an issue.

The distinct function in this case is really just a “micro-optimization “. Removing it does not affect the fundamental functionality of the statement.