Escaping inputs for raw queries… and related problems

I am using raw queries in a project, and I am concerned that the inputs are not being escaped. Here’s an example:

…
myquery = """select * from identifiers \
			join books on identifiers.bookid = books.id \
			left join covers on identifiers.identifier = covers.idvalue \
			where identifiers.identifier = ""%(barcode)s"" ;"""
thevalues =  {'barcode': barcode}
theset = Books.objects.raw(myquery,  thevalues)
print('theset.query: ', theset.query)
…

When I print theset.query the characters that should be escaped in barcode are not, ie:

select * from identifiers 
join books on identifiers.bookid = books.id 
left join covers on identifiers.identifier = covers.idvalue 
where identifiers.identifier = ""978150360"922;8"" ;

Where I think it should look like this:

select * from identifiers 
join books on identifiers.bookid = books.id 
left join covers on identifiers.identifier = covers.idvalue 
where identifiers.identifier = ""978150360\"922\;8"" ;

In a similar vein I am trying LIKE queries:

myquery = """select * from identifiers \
			join books on identifiers.bookid = books.id \
			left join covers on identifiers.identifier = covers.idvalue \
			where identifiers.identifier LIKE ""%%%(barcode)s%%"" ;"""
thevalues =  {'barcode': barcode}
theset = Books.objects.raw(myquery,  thevalues)
print('theset.query: ', theset.query)

In this case the query runs in Django ok, but finds no records, whereas when I directly use the output from theset.query I get two hits.

I suspect in the former case it is being escaped, but not showing me for some reason, and the letter I’m just not quite getting the formatting right – but would appreciate some pointers.

Someone is probably wondering why the raw queries? I tried building a model for this, with standard Django queries, but kept getting complaints that some of the fields I was trying to build joins between did not have a unique attribute - which in the case of this set of records makes absolutely no sense.

Couple quick distinct thoughts here:

  • If you’re having problems building standard models for these, it might be an indication that you haven’t modeled your entities properly.

  • You never want to interpolate data directly into an SQL statement. Always use one of the parameter forms to pass data into a query. What you’re doing is a SQL-injection vulnerability in the making. (Read the Warning box in that section.)

If you’re indicating that I might have my data set-up incorrectly in my databases, and thus the relationships I want to build are all askew… I kind of doubt that. I spent a fair amount of time working it out, and I think it is about optimised. I’m getting allmost instantaneous results from tables with millions of records, one of them has tens of millions (and I expect the others to get to that stage too, potentially the identiiers table could be over 100 million).

The problem I had, building a standard model, was Django complaining that it wouldn’t let me build a model where the relate fields were not unique. I’m pretty sure that the entire basis for a many-to-many relationship is exactly non-unique keys. So, this is a bit puzzling.

It could be that I need to do some further research on handling that in Django – but I did a fair bit and ended up with the raw SQL query, because Django seemed to be presenting roadblocks.

Whatever the right way to do it, it should be fairly easy to switch between the two, because the raw method is still presenting a queryset.

I understand that putting raw data into an SQL query presents the opportunity for SQL injection, that’s exactly why I am trying to undrstand why I am seeing the raw data, with the problematic characters, in theset.query.

Again, I did a fair amount of research on this to try to understand how to do this safely. Here’s an eample doing exactly as I think my code is:

qs = MyModel.objects.raw(sql, [param1, param2])

I think the only difference with my code is that it is using named parameters. Or are you indicating that method is not supported? If it isn’t, then I’m kind of wondering how the parameter gets into the query.

Are you indicating that the barcode element is not being escaped? This is essentially my question - is it being esacaped, or not, and how can I tell?


Edited a typo


I literally wrote the suggestion that the warning box suggested against.

Does the following not work?

myquery = """select * from identifiers \
			join books on identifiers.bookid = books.id \
			left join covers on identifiers.identifier = covers.idvalue \
			where identifiers.identifier = %(barcode)s;"""
theset = Books.objects.raw(myquery,  {"barcode": barcode})

A many-to-many relationship is modeled via an intermediate join table that connects the primary (unique) keys of two tables, not by creating a reference to a non-unique field from one to the other.

In a pseduo-code style of description:

Table a
    column pk (unique)
    data columns

Table b
    column pk (unique)
    data columns

Table join
    column pk (unique)
    fk-to-a
    fk-to-b
    constraint unique(a,b)

That’s the archetype for implementing a many-to-many relationship in a relational DB, regardless of any application framework being layered on it.
And yes, Django models this relationship quite well through the use of the ManyToManyField.

The SQL standard does not specify that the double quotes are allowed for constants - it requires single quotes. (See the PostgreSQL docs for a reference on this.)
But you don’t want to enclose your passed parameter in quotes anyway, because of the aforementioned Warning in the docs.

You’ve also got unnecessary slashes at the end of your lines. The triple-quote delimiter in Python includes linefeed characters in the string, and multiline SQL statements are perfectly valid. (See the example in Mapping query fields to model fields as a demonstration of a multi-line query. Note that I’m not saying it’s wrong to have them, they’re just unnecessary.)

Tim, it works as far as I get a successful query with just a field = value type, but I also want to figure out how to do a field like '%value%'.

However, that is the minor point. The major point is that the parameter values do not seem to be escaped. When I print theset.query in the terminal I get the output noted above, and, as a further example, what appears to be the epitome of SQL injection:

select * from identifiers 
join books on identifiers.bookid = books.id 
left join covers on identifiers.identifier = covers.idvalue 
where identifiers.identifier = 9781503609228; select * from identifiers where identifier like  "%9781503609228%" ;

Where I have input:

9781503609228; select * from identifiers where identifier like  "%9781503609228%"

in the form’s input field.

And then Ken seems to be saying that I’m doing it wrong (despite the example) and that this method invites an SQL injection attack.

So, overall a bit confused.

Ok, well, I got part of the answer - I turned on general_log in MySQL, and that shows the actual query submitted:

select * from identifiers 
join books on identifiers.bookid = books.id
left join covers on identifiers.identifier = covers.idvalue 
where identifiers.identifier = ""'978150360922\";8'""

Which shows single quotes around the barcode input, and the inserted " being escaped. This tends to show that theset.query does not actually provide very useful output (ie it is not actually showing what query was used).

I would still appreciate some input on doing LIKE queries.

What I’ve done in that situation is to modify the parameter before passing it to the query.
e.g. theset = Books.objects.raw(myquery, {"barcode": '%'+barcode+'%'})

It’s very useful information under many circumstances, however, as you’ve noted, it’s not the actual query being used - it’s the internal representation of the query before it gets “translated” to the database-specific representation issued as a query. Django doesn’t know what the SQL statement is going to be until it knows what database engine is going to execute it.

In my experience, print(queryset.query) doesn’t escape all values. When I do that, I have to wrap string values in single quotes.

Edit: Late to the party again.

Tim: “When I do that”

I’m probably being dim, but I’m not sure what “that” is.

Wrapping the parameter value in “%” does execute a LIKE query appropriately.

While it may be of some use that you can see queryset.query it is of very limited utility when it does not show you the actual query used, particularly if you are trying to debug a query’s fine detail. I don’t think I’ve ever come across a situation like that previously, and would go as far as suggesting that it is a significant bug.

Arguing that this merely illustrates the abstraction of the Django code from the database (which I think is Ken’s point) is a fair defence to some extent (ie if you wanted to see the bones of the query before it was used), but I don’t really see it stands up very well, when, in this case the query has been executed and Django obviously knows the target database.

So, I believe that queryset.query is not fully fit for purpose.

As to the problem of relationships between tables and Django models, I realised (upon introspection) that the table set-up isn’t classically many-to-many - that’s because two of the tables don’t actually “know” anything about the third (hence the left join). So rather than the example that Ken gives, the relationships are more like:

Table a
pk, data columns

Table b
pk, fk-a, data-target, data+

Table c
pk, (data-target), data, data columns

So, while (a) and (b) are known entities, and can rely upon b.fk-a = a.pk, to get to (c) we build a (potential) left join b.data-target = c.data-target where we cannot rely upon b.data-target or c.data-target being unique, or of c.data-target even existing (but still wanting the data from (a) and (b) if ther’s not successful join to (c).

I think this is the problem that Django was complaining about “I can’t build joins to non-unique fields/columns”.

The central problem being that I’m using data in in my tables that has come from other entities, there is lots of it, and that data is inherently messy – it’s more practical, for various reasons, to attempt to work around the mess, than to clean it up.

On this point, I don’t really want to find the perfect model for this solution in this thread, I just wanted to explain the problem a little further. Perhaps I’ll start another thread to dissect it further another time.

I meant when I try to use print(queryset.query) and use the output in a postgres shell. I should have clarified that. My typical use case is to run it through EXPLAIN ANALYZE and I always get caught by the datetimes and string values not being wrapped.

On this point, I don’t really want to find the perfect model for this solution in this thread, I just wanted to explain the problem a little further. Perhaps I’ll start another thread to dissect it further another time.

If you change your mind, let us know. Sometimes all it takes is fresh eyes.

I agree that one might think that that’s how an ORM would work, but one would be wrong in thinking that Django’s ORM works that way.
It’s a bit opaque, I know, but when you’ve gotten a result set from a queryset, the queryset object itself is still a queryset. That you have passed it along to the database engine for execution / resolution does not change the nature of that queryset, and does not prevent you from subsequently executing that same queryset on a different database engine.
So no, the queryset does not “know” the target database. That information is not returned to the queryset.

I can understand the desire to get access to the actual SQL statement issued - See the thread at Database instrumentation at a higher layer for the solution I implemented for this.

Yep, I deal with that type of situation a lot - and it’s one of Django’s limitations. Django does not fare well with databases that are not “classically normalized” - and there are many reasons why databases might be designed that way.