Should "reverse" urlencode its args/kwargs?

Hello there!
The reverse function is normally used for deriving urls simply by referencing the url’s name.
If those urls have any parameters to them, we can pass them to reverse.

However, in its current state, reverse does not on its own encode the parameters. This means it may in some cases provide you with unusable urls.

I propose some change on how this is handled. Below are some ideas i have for this change. Please note that when i refer to the “user” i mean the developer which is building an application using Django. I am not referring to the administrator or any end-user which may run or use a project.

Why change the behaviour at all?

I’m not saying everyone is going to benefit from this change, users may actually not have noticed that they should (or even need to) encode their parameters, not all url parameters are created equal. When you only ever generate urls with integer based primary keys or slug-ified strings, this is pretty much a non issue.

My issue is that the problem may actually never come up in testing:
It may only show itself once deployed and in production. When not encoding the input properly, we generate useless url’s for seemingly no reason. Django already takes care of decoding the client’s requested url address. Its clear that when i call reverse, that i want a url pointing to a urlpattern/view. No matter what arguments/values i passed (as long as the mandatory ones are present).

Different approaches and problems

Default encoding

If the default behavior is to properly encode all parameters for the url, old code may break as the encoding was previously handled by the user.
To combat the breaking API change, view the section “Optional encoding”.

Optional encoding

Give the user a simple boolean argument for reverse, set to False by default.
This doesn’t break the API and gives users an easy way to make sure they are generating valid urls. No matter the value(s) passed.
However, this isn’t perfect either. The user has to be aware of this functionality, and actually use it. Encoding parameters properly is a rather simple to do, so it may actually sit unused, even in cases where it shouldn’t.
This could also later on be changed to a breaking change by changing the default value of our argument.

Documentation!

Instead of changing the implementation, it should be clearly communicated that reverse does not handle encoding its parameters and not doing it may result in broken urls.

By far the simplest approach, but we aren’t doing anything for already existing implementations.


Should reverse be changed? What do you think?


Best regards
JaK

1 Like

I think there’s a bug here. The documentation says:

The string returned by reverse() is already urlquoted. For example:

reverse("cities", args=["Orléans"]) '.../Orl%C3%A9ans/'

Applying further encoding (such as urllib.parse.quote()) to the output of reverse() may produce undesirable results.

If I’m understanding the docs correctly, your proposal should already be the case.

Not entirely, it only occurred to me now, but it is encoding the values.

The problem isn’t with url encoding in general, just that it doesn’t handle slashes properly.

I briefly tested it with just randomly generated text, that text was long enough for a slash to slip in there pretty much every time :sweat_smile:.

The encoding is done when the url has already been completely built. All the interesting stuff gets done in “_reverse_with_prefix”, this is where the url is encoded. It clearly encodes the entire url as one.

Lets change this topic up from should reverse urlencode, to how should slashes be handled in parameters. The answer in that case should be pretty clear in my opinion:
Encode the parameters before attaching them to the entire url.

Ah! That makes a lot more sense. I agree that the current behaviour is counter-intuitive. I suspect this can just be raised as a bug report, but in case it needs discussion here, I’m +1 to fixing this.

Hmmm… I don’t disagree that the current behavior is counter-intuitive, but there is a use-case for it.

(Whether there should be, is a different question.)

At a minimum, I might suggest this be approached as a change in API and not a bug fix since to the extent that I see in the docs, it’s working as documented - even though the implications of this are not obvious.

Encoding the arguments to reverse individually and removing slashes from the safe parameter list might not be ideal. As far as Django is concerned the url is currently unquoted before throwing it against resolve ( #15718 (Django unquotes urls and not able to distinguish %2F and /) – Django ). So this would mean that it most likely would not roundtrip as expected. Even worse I think webservers might already unquote slashes sometimes.

EDIT:// Edited to include a missing “not” in roundtripping.

I’d also want to look very carefully here about why the design of the existing path converters is to exclude / from matches (unless you actively use the path converter). (I’d wager this points in this area were already considered, and shouldn’t be forgotten.)

I’d actually argue that the use-case should be considered a bad practice, even when it would be appropriate.
So, yes. There might be a use-case, i suggest there shouldn’t be.

Passing arguments for building a URL highly suggests that those arguments will be treated as such, and not be interpreted as a separate endpoint. The documentation - in my opinion - further emphazizes this point:

The string returned by reverse() is already urlquoted . For example:

reverse("cities", args=["Orléans"]) '.../Orl%C3%A9ans/'

Applying further encoding (such as urllib.parse.quote()) to the output of reverse() may produce undesirable results.

This example clearly suggests that whatever argument(s) is passed will be encoded/escaped.

I’d refer to this as undocumented behaviour - a bug.

I don’t disagree here, but as I said…

I describe it as a different question because I do disagree with this:

Because it is documented here:
Any special characters in the resulting path will be encoded using iri_to_uri().

which then links to:

This is the algorithm from section 3.1 of RFC 3987#section-3.1, slightly simplified since the input is assumed to be a string rather than an arbitrary byte stream.

which then contains the specification of what’s going to happen in that function.

Beyond that, in principle, not all undocumented behaviors are bugs. It’s not possible to document the results of every interaction between any arbitrary set of components within the system. There should be sufficient information such that you can draw a conclusion from the information available, but that’s about it - and even so, even if there isn’t sufficient information to do that, then the appropriate response may simply be to document that particular case.

Yep, i can definitely see how my last statement is false.
Of course, not all undocumented behavior is a bug, sorry about that one :slight_smile:

Just because it confused me:
Could you please elaborate a bit more on what iri_to_uri() has to do with the reverse() function?
Yes, they are both for doing something with urls. I went through the entire call graph for reverse() and iri_to_uri() doesn’t appear anywhere. Was that just for reference to not all undocumented behavior being considered a bug, or maybe something else?

Sorry, bad edit on my part.

I cut out the url links in my reply.

In the Note box in the docs for reverse, it says:

The string returned by reverse() is already urlquoted. For example:

The word urlquoted above is a link to the docs that reference that function. However, I recognize that those docs aren’t specifically saying that reverse uses that specific function, but that the same algorithm is being used.

Also, related to that is the url template tag.
Quoting directly from the docs at: url

Returns an absolute path reference (a URL without the domain name) matching a given view and optional parameters. Any special characters in the resulting path will be encoded using iri_to_uri().

(I’m going solely by what I’m finding in the docs. I’ve not tried to dig my way through the source on this.)

So I agree that I don’t see specifically where it says that reverse uses the iri_to_uri function, but I would definitely draw the conclusion that it’s using code that effectively does the same thing.

Alright then, reverse() doesn’t use iri_to_uri() (im pretty sure on that one, as per generated call graph and just looking at the function).
But yes, you are right! It’s definitely doing something similar (Line).

It quotes the views full url (view base url + arg parameters in) using urllib.parse.quote.
Additionally some ‘safe’ characters are set: “!$&'()*+,;=/~:@”. These will not be escaped.

How is the full url built?

The full url is built with two variables, candidate_pat and text_candidate_subs.

The former, candidate_pat will be the view url, it uses stringf-style formatting (it’ll look something like this for example: ‘/test/%(parameter)s/’)

The latter, text_candidate_subs will be the parameters passed to reverse (not quoted).

Alas, the final url is built by first combining our two variables using the aforementioned stringf-style formatting followed by quoting. Before this string is returned, its leading slashes (if any) will also be encoded in escape_leading_slashes(). Here is the docstring for that function which explains why that step is necessary:

If redirecting to an absolute path (two leading slashes), a slash must be
escaped to prevent browsers from handling the path as schemaless and
redirecting to another host.

My proposal

This is precisely where i would propose a change:
Right after candidate_pat is built, we can go over all parameters and urlencode them.

Problems/Issues

Views will be given an escaped version of all parameters passed. This would, of course, break a lot of stuff. The moment a view expects a parameter to have some string with spaces in it, the entire thing breaks down the moment that patch rolls around.

Lets fix that problem!

Django will run the BaseHandler’s _get_response method to, well, get the response (good naming going on there :slight_smile: )
Thats precisely where we can jump in and decode the kwargs which will be passed down to the request and the appropriate view.
That should probably be done here, right before the callback is built.

1 Like