Using Google re2 instead of re in Django

Following my presentation earlier this week in DjangoCon Europe, I would like to propose that we replace as much as we can of our use of the stdlib’s re module with google-re2, in order to avoid REDoS vulnerabilities.

This would entail several code changes, other than adding the dependency and changing the imports, because re2 is not a drop-in replacement for re; besides some differences in features, there’s differences in the API – e.g. re2 does not have the flag constants (re.I, re.DOTALL etc); it uses keyword arguments for some of them (re2.compile(pattern, case_sensitive=False) etc.) and multiline mode can only be set within the pattern (i.e. the only equivalent of re.M is to add (?m) in the pattern).

In terms of non-supported features:

  • In spite of what I said in the talk, re2 does support word boundary checks
  • There are 18 uses of look-around assertions, in 12 patterns in the Django codebase. these will have to be changed, probably from just-regex to regex-with-some-more-code
  • I could find no use of backreferences or conditionals in searches; there are references to group matches in substitutions (re.sub() calls), which re2 seems to support.

In terms of other effects:

  • This is expected to make some of the regex searches and matches a little slower.

Further notes:

Next steps:

  • First of all, I would like to get some feedback – do we want this, or is it just me?
  • Presuming we do want it, do you think this is a DEP-requiring change?

Looking forward to your feedback!

1 Like

Thanks for proposing this @shaib. Avoiding a (seemingly) constant stream of (small) redos security issues would be a big win.

I guess there’ll be some concern about this, but it’s going to be a very small hit in general right? (I’m imaging the difference as one of microseconds range, with computers being what they are… :thinking:) — Unless it were somehow devastating, the greater security here would justify a small hit.

I wouldn’t say it’s a major change requiring a DEP.

1 Like

Just to play devil’s advocate, is there a way (and is it useful to) do this change this without porting to re2? I’m wondering if we can mitigate much of the risk without adding an additional dependency and migrating a reasonably large number of patterns. Might there be a notable performance difference between “re2 + code” vs “Simplified re + code” (assuming the simplified use of re avoids the ReDoS patterns)?

Also, is there a value in keeping the simpler patterns (ie those we can say with confidence don’t backtrack) using re, and only porting the more “interesting” patterns to use re2? That might give a best-of-both-worlds outcome - mitigate ReDoS whilst having some performance benefit. I don’t know how easy it is to determine (statically or otherwise) whether a given pattern backtracks or could have other ReDoS implications.

The point of moving to re2 is to stop looking over our backs. I mean, for all I know, in the current main branch (and maintained stable branches), there are no catastrophic backtracks, at least not where end-users could invoke them. Getting to a point where all of our expressions are safe is probably doable with re. The hard part is keeping it that way – when new regexes are added, or when old regexes get exposed through changes in functionality.

I don’t know of a tool that can statically analyze regexes and declare them safe, and even if there was one, some regexes are safe for matching but dangerous for searching.

I should note in passing that the use of features which aren’t supported by re2 does not, in itself, imply that the regex is dangerous (nor does not using them imply safety).

1 Like

This sounds like a good change to me, would like to see how the implementation changes the existing code.

I think that re2 would be nice from a security standpoint. It is just one thing less to worry about – and it’s not the first time that we missed a backtracking issue and it won’t be the last.

I see that the pypi project has wheels for most (all?) platforms. That is certainly something to double check and confirm because we really do not want people to compile that manually.

Do you have any numbers on how much slower and under which circumstances?

Cheers,
Florian

1 Like

How soon in the release process did Google re2 support Python 3.11 and Python 3.12? Does it support beta Python 3.13 yet?

Edit: I checked on PyPI, and it looks like Google re2 does not have wheels for Python 3.13 yet. So I assume that testing Django on Python 3.13 will be on hold, until Google re2 supports it (or someone wants to build Google re2 locally for Python 3.13, assuming that works)?

1 Like

No, not quite. I had the idea to try to do a simplistic replacement in the Django code base and see how that affects the timing of running relevant tests, but I hadn’t gotten to it yet.

RE2’s own documentation includes some benchmarks from 2010, comparing it to PCRE – Perl’s regex library; I don’t know how PCRE compares with SRE, the Python implementation, and I’m not sure what happened in re2 since then.

Regular Expression Matching in the Wild (look for “Performance”).

There are more benchmarks out there, showing re2 to be reasonable and competitive to other engines; e.g. Performance comparison of regular expression engines (from 2015) – still not including SRE.

Agreed.

Those are good questions I hadn’t considered. Thanks for bringing them up.

I’m not sure about releases – I see no proper release history log.

In the code, Python 3.12 release was added on October 25, 2023 – that’s after the final release (on October 2, 2023). And it seems the first commit to support 3.11 in CI was only in May 2023, more than 6 months after the Python release.

So, yes, if we do this, we’ll probably need to take care of testing with future Python versions ourselves.

Django depending on re2 adds another interesting feature - allowing downstream users to depend on it, too.
Many libraries (and applications for that matter) might not want to depend on re2 themselves for the extra maintenance burden (and additional dependency), but if it comes with Django, they might be more willing to use it, thus benefiting the wider Django ecosystem.

1 Like