Trouble with django.db connection and Python Strings in SQL?

I have an endpoint where I’m passing in an array of ints thus:

I get it out of my endpoint request and turn it into a string like this.

list_of_ids = request.data['list_of_ids']
list_of_ids = "','".join(str(e) for e in list_of_ids)

It looks like this in debug mode, even though it prints out looking fine.

'7690\',\'10964\',\'10973\',\'10977\',\'10960\',\'10958\',\'10961\',\'10571'

I then try to punch it into my SQL statement like this (see “list_of_ids” below in the cursor.execute()):

# This is a custom query, no Django Models involved
        query = f"""
                    UPDATE staffing_change_report
                    {set_string}
                    WHERE payroll_code IN (%s)
                    AND business_process_type IN ('Hire',  'Contract Contingent Worker');
                    """
        try:
            with connection.cursor() as cursor:
            cursor.execute(query, [boolean_to_set, boolean_to_set, list_of_ids])

But the query fails without errors and when I grab it from the cursor in debug mode it looks like this with the escape characters all in place:

UPDATE staffing_change_report
                    SET hire_process_is_completed = 0, hire_init_process_is_completed = 0
                    WHERE payroll_code IN ('7690\',\'10964\',\'10973\',\'10977\',\'10960\',\'10958\',\'10961\',\'10571')
                    AND business_process_type IN ('Hire',  'Contract Contingent Worker');

I assume (but don’t know for sure) that the presence of those escape characters in the cursor’s copy of the SQL statement is the reason the query is not working right. I wonder if, because this call is also supposed to “sanitize” the inputs, the library is somehow refusing to handle the python string as expected?

I’ve tried everything I can think of to “clear” those escape characters, but I believe this is a feature of Python…? In other words, there’s no way to avoid the escape character of “\” if you have quotations in your string…

Is there a way to make this work? If I hardcode the “IN()” in the SQL, it works fine, but I cannot find a way to pass in a string or a list or a tuple that works with cursor.execute.

By the way, this is the cursor from: from django.db import connection

PS. I did find this, which does work:

                for value in my_tuple:
                    cursor.execute(query, [boolean_to_set, boolean_to_set, value])

And I’ll live with it if I have to, but this turns a query with N number of items in the IN() clause into N number of queries - something that does NOT scale well…

Thanks in advance!

First, if you look at what you’re passing:

Notice that the first and last quotes aren’t escaped. So what you’re sending is one string that starts with 7690','109....

Now, my gut reaction is that you shouldn’t be sending a list of strings anyway, if those IDs are integers. I would think that the desired line would look more like:
WHERE payroll_code IN (7690, 10964, 10973, ...)

In this case, I’d try replacing that string on the join function with ",".join(...

Thanks Ken.

That’s an interesting observation about the first and last quotes… I think I had it where they were there, but I’ve messed with it so many times I’m not sure. It’s certainly possible that what you mention is messing it up… I might go back and verify with your suggested join statement… And you’re right. I do NOT need quotes anyway! Thanks!

However, I DID come up with a solution just now that uses a Tuple and works beautifully AND does not hit the database multiple times.

My code is currently full of experiments and commented-out cruft. I will post the complete code here soon (once I’ve cleaned it up and made it readable) in case it helps someone else later on… It’s a bit tricky, but makes sense once you think it through…

OK. Here is my working example.

First, the serializer that does not use a Model:

class FixStaffingChangeReportHireFlagsSerializer(JSONSerializer):
    affected_fields = serializers.CharField(max_length=10)
    boolean_to_set = serializers.BooleanField(allow_null=False)
    list_of_ids = serializers.ListField(allow_empty=False,
                                        child=serializers.IntegerField(min_value=0, max_value=10000))

Next, my urls.py entry

url(r'^matching/ztest/', views.StaffingChangeReportFixFlagsViewSet.as_view({'post': 'post'}))

Now, the endpoint code from views.py
Note that I get the count of id’s from the list_of_ids array BEFORE I add the booleans for use by the execute so that I have the right number of %s’s ending up in the query’s IN() clause.

class StaffingChangeReportFixFlagsViewSet(viewsets.GenericViewSet):
    """
    Sets the hire_process_is_completed and or hire_init_process_is_completed fields in the staffing change report to the value in boolean_to_set
    for the payroll_codes where business_process_type is
    in ('Hire',  'Contract Contingent Worker')

    Use the following values in the JSON you submit depending on your needs.
        affected_fields options: hire, hire_init, both
        boolean_to_set options: true, false
        list_of_ids: comma delimited list of payroll_code(s) from staffing change report
    """
    parser_classes = (parsers.JSONParser,)
    serializer_class = serializers.FixStaffingChangeReportHireFlagsSerializer
    authentication_classes = []
    permission_classes = []

    def post(self, request, *args, **kwargs):
        print(request.data)

        # get the data from the request.data JSON
        affected_field = request.data['affected_fields']
        boolean_to_set = request.data['boolean_to_set']
        list_of_ids = request.data['list_of_ids']

        # do some validation
        if affected_field not in ['hire', 'hire_init', 'both']:
            return Response(status=404, data={'Error': 'affected_field must be one of [hire, hire_init, both]'})

        if boolean_to_set not in [True, False]:
            return Response(status=404, data={'Error': 'boolean_to_set must be one of [true, false]'})

        if list_of_ids is None or len(list_of_ids) == 0:
            return Response(status=404, data={'Error': 'list_of_ids cannot be empty'})

        # not letting users pass in actual column names.  Setting up dict for columns and %s in the query
        field_name_dict = {"both": 'SET hire_process_is_completed = %s, hire_init_process_is_completed = %s',
                           'hire': 'SET hire_process_is_completed = %s',
                           'hire_init': 'SET hire_init_process_is_completed = %s'}
        # set_string is used in the actual query - derived from passed-in value for affected_field
        set_string = field_name_dict[affected_field]

        # number_of_ids is used to set up the correct number of %s entries for the IN() clause in the query below
        number_of_ids = len(list_of_ids)
        # we will always need at least one boolean_to_set for the %s in set_string
        list_of_ids.insert(0, boolean_to_set)
        # add second boolean_to_set for the second %s in set_string IF we're modifying both fields in the table
        if affected_field == 'both':
            list_of_ids.insert(0, boolean_to_set)
        # set up the tuple (after adding the booleans) to be used in the execute statement
        # this may be unnecessary as the execute would probably just take the list without complaint
        tuple_for_execute_method = tuple(item for item in list_of_ids)

        logging.info(
            f' StaffingChangeReportFixFlagsViewSet:post called.  Setting the field(s) {affected_field} and setting to {boolean_to_set} '
            f'for list of payroll_code(s) {list_of_ids}')

        # credit where it's due: https://stackoverflow.com/questions/54296971/inserting-multiple-mysql-records-using-python-error-python-tuple-cannot-be
        # This is a custom query, no Django Models involved
        query = f"""
                            UPDATE staffing_change_report
                            {set_string}
                            WHERE payroll_code IN (""" + ", ".join(["%s"] * number_of_ids) + """)
                            AND business_process_type IN ('Hire',  'Contract Contingent Worker');
                            """
        try:
            with connection.cursor() as cursor:
                # According to https://www.stackhawk.com/blog/sql-injection-prevention-django/
                # This is how to prevent SQL Injection on a "raw" query into the database using external input.
                # Django "sanitizes" the variables passed in the array on the 2nd parameter (array)
                # in the cursor.execute(string, array) statement.
                # This is done in order, using the %s "placeholders" in the query string.
                # Obviously, you have to get the variables in the right order in the array.
                # cursor.execute(query, [set_string, list_of_ids])
                # cursor.execute(query, list_of_args)
                # cursor.executemany(query, [boolean_to_set, boolean_to_set, foo])
                cursor.execute(query, tuple_for_execute_method)
                # below is the alternate method, but note that it makes a new query to the DB for every id in the list
                # for value in foo:
                #     cursor.execute(query, [boolean_to_set, boolean_to_set, value])

        except Exception as e:
            return Response(
                data={'Error': f'Unexpected error! Error is: [ {e} ]',
                      }, status=500)

        return Response(status=200, data={'Success': 'Foobar'})

Here is what the query looks like when the “join” has run to add the right number of %s’s to the query before it’s processed by execute()

Here is what the query looks like in the cursor AFTER it’s been processed by the execute()…

And here is what the Swagger page looks like:

The “magic” happens in the query where I ‘flatten’ the list into a bunch of %s’s using the join and the count from the list BEFORE I added the one or two booleans that are ultimately necessary to make the execute work. If you get that out of order, your number of arguments in the IN() will be wrong and the execute will complain.

Now, assuming there is the correct number of entries in the tuple, the execute will apply all the values from the tuple to the %s’s in order and the query ends up looking right and sending ALL the id’s to the DB in one query.

Note - the execute CAN take a data structure, if you do it right, so the exception you may get from the execute() about not being able to handle a “list” or a “tuple” is misleading.

Also note the credit to the StackOverflow page where I found the data that allowed me to get to my solution here.

I hope this saves someone some pain - this took me far too long to suss out…

This is example #1 in the category of “How to create the possibility of an SQL Injection vulnerability” in your application.

This would never pass a code-review here.

Yes, this may be safe in this specific situation as it’s being used, but I wouldn’t want to trust my system to it.

This is a perfect example of why I warn people away from SO as being a dangerous resource. Please, do yourself a favor and get rid of this.

Please help me understand why this is a problem?

I assume you mean that it is a problem in spite of the fact that I’m sending it through the execute(), which according to the site I mentioned in the comment is the Django way of sanitizing before sending to the DB?

And, perhaps, a suggestion of a better or more elegant approach?

This is all that is happening to the query string at the point where you commented - I even went to the slight extra effort of grabbing a variable with the count of the array rather than using the len(array) call…

The base problem is that you are interpolating variables into the string of the query itself and not just supplying data as parameters. Anytime you are modifying the query with data (as opposed to supplying variables as parameters), you are creating the potential for a problem.

Again, you may be safe here, because of the other facilities being used. But in the general case, this is bad. (Note, using the execute call here does not “sanitize” this, again, because you’re modifying the query.) It’s simply not a technique that you should ever get used to using.

If this table is in your current database, I see no reason not to use a model and the ORM for this.

This confuses me. Do you mean that ANY direct query (as opposed to something from the ORM) carries this risk no matter what you do?

The article on the stackhawk web page seems to clearly state that using this approach does use parameters… but maybe you and they mean different things by that word?

I take your point about using the model and ORM.

As for why not to use the ORM - because I may have use cases where users send 20k id’s and I thought that making one call to the database per model would be too costly in time for the process I need to run instead of doing some kind of ‘paging’ for 5k ids at a time…

It may be my ignorance, but I’m unaware of a way to use Django ORM for updating something like WHERE id IN(x,x,x,x,x,x,x)… Is that possible?

No, the issue is that you’re modifying the query string based on data coming from outside.

The problem is not using the query string. The potential for problems are from {set_string} and """ + ", ".join(["%s"] * number_of_ids) + """)

The ORM allows the construction of queries using the in operator. You’re still going to be creating one query with it. See QuerySet API reference | Django documentation | Django

Ok. Text is hard and I appreciate you taking the time. I’m not trying to be obtuse, I promise.

What I think you’re saying is that when stackhawk says the method they discuss “sanitizes” the things coming in on the second argument of the execute(query, someTuple) call, that is just not the case.

in other words, your position is that cursor.execute(query, someTuple) will NOT, in fact, find SQL Injection type “things” in someTuple?

I’m looking up that link you sent now…

It’s not the second parameter that is the problem, it’s the first.

The issue is that you are dynamically modifying the query parameter.

I’m completely lost as to why that can cause a problem.

Please help me understand by providing an example of what could go wrong in my instance? I’m doing all the changes in code and not including any external data so I’m lost here…

Do a search on SQL Injection. You’ll find many pages that talk about it.

And I’ll repeat myself again - while you may not be creating a vulnerability here, in the general case it’s a bad practice and should be avoided.

I did a search and came up with the StackHawk website which I thought made it clear that I had a path to not having a problem.

I will ask a DBA buddy of mine. Thanks

And thanks for that link!

That is pretty darn straightforward.

Thanks again.

One more question…

I came up with this - it seems very efficient…

        StaffingChangeReport.objects.filter(payroll_code__in=list_of_ids, business_process_type__in=["Hire", "Contract Contingent Worker"]).update(hire_process_is_completed=boolean_to_set, hire_init_process_is_completed=boolean_to_set)

However, I’m just a bit concerned that what actually happens is not as efficient as I would hope for my process…

Do you know what this does under the covers?

Does it pull back all the objects that fit the “IN” criteria, update the fields in the Update and then save them all back to the DB individually?

OR

Does it somehow do something similar to my custom query and just:

UPDATE someTable WHERE id IN(x,x,x,x)

Nevermind… I see it.

Finally, realize that update() does an update at the SQL level and, thus, does not call any save() methods on your models, nor does it emit the [pre_save] or [post_save] signals (which are a consequence of calling [Model.save()]).