Using Google re2 instead of re in Django

shaib · June 8, 2024, 10:55am

Following my presentation earlier this week in DjangoCon Europe, I would like to propose that we replace as much as we can of our use of the stdlib’s re module with google-re2, in order to avoid REDoS vulnerabilities.

This would entail several code changes, other than adding the dependency and changing the imports, because re2 is not a drop-in replacement for re; besides some differences in features, there’s differences in the API – e.g. re2 does not have the flag constants (re.I, re.DOTALL etc); it uses keyword arguments for some of them (re2.compile(pattern, case_sensitive=False) etc.) and multiline mode can only be set within the pattern (i.e. the only equivalent of re.M is to add (?m) in the pattern).

In terms of non-supported features:

In spite of what I said in the talk, re2 does support word boundary checks
There are 18 uses of look-around assertions, in 12 patterns in the Django codebase. these will have to be changed, probably from just-regex to regex-with-some-more-code
I could find no use of backreferences or conditionals in searches; there are references to group matches in substitutions (re.sub() calls), which re2 seems to support.

In terms of other effects:

This is expected to make some of the regex searches and matches a little slower.

Further notes:

re2 is written in C++, but Python bindings are part of the main project (bindings for other languages are provided by third parties). It is maintained and updated.
There has been an attempt to smooth over the differences, and provide a wrapper of re2 that is a drop-in replacement for re (up to the patterns themselves, of course); that attempt seems to have been abandoned 8 years ago, but still holds the name re2 on PyPI. Personally, I think that would have been the wrong way to go about it anyway.
Apache Airflow, a project of magnitude and high-profile comparable to Django, has made this change last year – first introduced re2 into their code base
and then replaced almost all uses of stdlib’s re with re2

Next steps:

First of all, I would like to get some feedback – do we want this, or is it just me?
Presuming we do want it, do you think this is a DEP-requiring change?

Looking forward to your feedback!

carltongibson · June 9, 2024, 8:00am

Thanks for proposing this @shaib. Avoiding a (seemingly) constant stream of (small) redos security issues would be a big win.

I guess there’ll be some concern about this, but it’s going to be a very small hit in general right? (I’m imaging the difference as one of microseconds range, with computers being what they are… ) — Unless it were somehow devastating, the greater security here would justify a small hit.

I wouldn’t say it’s a major change requiring a DEP.

theorangeone · June 9, 2024, 9:45am

Just to play devil’s advocate, is there a way (and is it useful to) do this change this without porting to re2? I’m wondering if we can mitigate much of the risk without adding an additional dependency and migrating a reasonably large number of patterns. Might there be a notable performance difference between “re2 + code” vs “Simplified re + code” (assuming the simplified use of re avoids the ReDoS patterns)?

Also, is there a value in keeping the simpler patterns (ie those we can say with confidence don’t backtrack) using re, and only porting the more “interesting” patterns to use re2? That might give a best-of-both-worlds outcome - mitigate ReDoS whilst having some performance benefit. I don’t know how easy it is to determine (statically or otherwise) whether a given pattern backtracks or could have other ReDoS implications.

shaib · June 9, 2024, 11:43am

The point of moving to re2 is to stop looking over our backs. I mean, for all I know, in the current main branch (and maintained stable branches), there are no catastrophic backtracks, at least not where end-users could invoke them. Getting to a point where all of our expressions are safe is probably doable with re. The hard part is keeping it that way – when new regexes are added, or when old regexes get exposed through changes in functionality.

I don’t know of a tool that can statically analyze regexes and declare them safe, and even if there was one, some regexes are safe for matching but dangerous for searching.

I should note in passing that the use of features which aren’t supported by re2 does not, in itself, imply that the regex is dangerous (nor does not using them imply safety).

tom · June 9, 2024, 11:51am

This sounds like a good change to me, would like to see how the implementation changes the existing code.

apollo13 · June 9, 2024, 7:27pm

I think that re2 would be nice from a security standpoint. It is just one thing less to worry about – and it’s not the first time that we missed a backtracking issue and it won’t be the last.

I see that the pypi project has wheels for most (all?) platforms. That is certainly something to double check and confirm because we really do not want people to compile that manually.

Do you have any numbers on how much slower and under which circumstances?

Cheers,
Florian

benc · June 10, 2024, 1:37pm

How soon in the release process did Google re2 support Python 3.11 and Python 3.12? Does it support beta Python 3.13 yet?

Edit: I checked on PyPI, and it looks like Google re2 does not have wheels for Python 3.13 yet. So I assume that testing Django on Python 3.13 will be on hold, until Google re2 supports it (or someone wants to build Google re2 locally for Python 3.13, assuming that works)?

shaib · June 16, 2024, 8:47pm

No, not quite. I had the idea to try to do a simplistic replacement in the Django code base and see how that affects the timing of running relevant tests, but I hadn’t gotten to it yet.

RE2’s own documentation includes some benchmarks from 2010, comparing it to PCRE – Perl’s regex library; I don’t know how PCRE compares with SRE, the Python implementation, and I’m not sure what happened in re2 since then.

Regular Expression Matching in the Wild (look for “Performance”).

There are more benchmarks out there, showing re2 to be reasonable and competitive to other engines; e.g. Performance comparison of regular expression engines (from 2015) – still not including SRE.

Agreed.

shaib · June 16, 2024, 9:16pm

Those are good questions I hadn’t considered. Thanks for bringing them up.

I’m not sure about releases – I see no proper release history log.

In the code, Python 3.12 release was added on October 25, 2023 – that’s after the final release (on October 2, 2023). And it seems the first commit to support 3.11 in CI was only in May 2023, more than 6 months after the Python release.

So, yes, if we do this, we’ll probably need to take care of testing with future Python versions ourselves.

theorangeone · June 24, 2024, 7:30am

Django depending on re2 adds another interesting feature - allowing downstream users to depend on it, too.
Many libraries (and applications for that matter) might not want to depend on re2 themselves for the extra maintenance burden (and additional dependency), but if it comes with Django, they might be more willing to use it, thus benefiting the wider Django ecosystem.

adamchainz · July 1, 2024, 10:38pm

Hi Shai. Thanks again for your talk.

I like the proposal to move to re2, and would certainly like to see the end of reDoS security reports. But… the lack of timely Python version support in re2 makes me a -1 right now.

re2 is a complex C++ extension, so not something Django fellows/contributors could easily fork if needed. We’d be reliant on Google, which doesn’t have a stellar reputation for keeping products alive…

We have been “burned” by similar reliance in the past. The autoreloader integration with Facebook’s Watchman stopped working because Watchman didn’t support Python 3.10 for years after its release. This meant no efficient reloading for anyone during that period. (Even now I have found it buggy on 3.12.)

I think we also need to have higher standards for C-level extensions updating reliably. It is a time of great change in Python’s C API:

Python 3.13 has many new “better” functions and deprecations/removals of old ones.
The ongoing “free-threaded” Python project (GIL removal) will probably require extensions to change to support it. I’d hope we can set up Django projects to take advantage of free-threaded Python shortly after it’s stable.
Subinterpreters, a multiprocessing alternative, need C API changes to ensure extension isolation between them.

From my experience optimizing some string-related functions in Django, for security issues and otherwise, I have found that regex-with-some-more-code can be much slower, sometimes 100x. Python is not generally a fast language for string searching and manipulation, mostly because many string operations need to copy data and build a new string object. So, each of these changes would come with a risk of adding a DoS vector.

theorangeone · August 12, 2024, 4:34pm

Somewhat relevant to this discussion - a Python forum thread on adding better NFA support to Python itself:

Topic		Replies	Views
Settings refactor Django Internals	8	946	October 23, 2024
adding support for valkey (Redis alternetive) Django Internals	2	273	September 2, 2024
URLResolver failing to find pattern match for admin POST request - cPanel/Apache Using Django	15	3711	September 23, 2020
First ticket: adding querystring and URL fragment support to reverse Django Internals	5	92	November 26, 2024
Can we have an assertRegexMatch method? Using Django	2	881	September 27, 2020

Using Google re2 instead of re in Django

Related topics