Don't know if I implemented nh3.clean() properly in my Django project

Hi,

I ended up implementing nh3 in my Django project. it works nicely. It removes “offending” tags from the live html, which is great. But the offending tags are still present in my edit_post.html textarea. Is that what is supposed to happen?

This is how I use it in my project:

# boards/forms.py

import nh3

class SanitizedTextareaField(forms.CharField):
    def clean(self, value):
        value = super().clean(value)
        return nh3.clean(value, tags={
            "a",
            "abbr",
            "acronym",
            "b",
            "blockquote",
            "code",
            "em",
            "i",
            "li",
            "ol",
            "strong",
            "ul",
        },
        attributes={
            "a": {"href", "title"},
            "abbr": {"title"},
            "acronym": {"title"},
        },
        url_schemes={"https"},
        link_rel=None,)

class PostForm(forms.ModelForm):
    message = SanitizedTextareaField(widget=forms.Textarea)

    class Meta:
        model = Post
        fields = ['message', ]
# boards/models.py

import nh3

class Post(models.Model):
    message = models.TextField()
    topic = models.ForeignKey(Topic, on_delete=models.CASCADE, related_name="posts")
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(null=True)
    created_by = models.ForeignKey(User, on_delete=models.CASCADE, related_name="posts")
    updated_by = models.ForeignKey(
        User, on_delete=models.CASCADE, null=True, related_name="+"
    )
    likes = models.ManyToManyField(User, blank=True, related_name="post_likes")

    def total_likes(self):
        return self.likes.count()

    def __str__(self):
        # truncated_message = Truncator(self.message)
        # return truncated_message.chars(30)
        return self.message

    def get_absolute_url(self):
        return reverse("post_detail", kwargs={"pk": self.pk})

    def get_message_as_markdown(self):
        clean_content = nh3.clean(self.message, tags={
            "a",
            "abbr",
            "acronym",
            "b",
            "blockquote",
            "code",
            "em",
            "i",
            "li",
            "ol",
            "strong",
            "ul",
        },
        attributes={
            "a": {"href", "title"},
            "abbr": {"title"},
            "acronym": {"title"},

        },
        url_schemes={"https"},
        link_rel=None,)
        rendered_content = markdown(clean_content, extensions=['fenced_code', 'codehilite'])
        return mark_safe(rendered_content)

nh3.clean() removes any html element which is not listed in tags, but the offending tags still appear in edit_post.html. Like the script tag, for example:

But in the post-detail.html view, it does not appear:

And same with the live html:

my implementation of nh3.clean() might be a bit of overkill. The behavior is the same whether I implement it both in forms.py and models.py, or just one of them I have never done this before in Python/Django, and I don’t want to be relaying information or code that does not safeguard my users and site from bad actors (i.e., css attacks, for example).

And btw, as indicated in the code above, I also have implemented fenced code and code highlighting in my markdown.

Lastly, if I do not include anchor elements in my nh3 allowed tags, I still am able to create them successfully in the markdown. Why is that? So I am wondering if there are other tags which are overlooked by nh3 even though they are not included in tags. Thanks in advance for any feedback!

I am not finding my answers in any nh3 documentation or anywhere else. And bleach is deprecated and has vulnerability issues. No longer maintained.

1 Like

It’s been a while since I did anything like this with custom form fields, but reading this example of using nh3 with a Django field maybe you should follow that and put it in the to_python() method?

Assuming you don’t already have any potentially-unsafe HTML saved in that model field in the database, I don’t think there’s any point also using nh3 in the model – you want to sanitise the HTML on the way in to the database.

If there’s any other way HTML can be saved to that field, aside from this form – like an API – then you’d need to also clean the input from that.

hi @philgyford, Thanks for the response. I will check out the post you shared. I don’t use any apis. Don’t have anything unsafe in the database since I haven’t even pushed it live yet. Basically just testing things out. I’ll check out the post you shared and see if it does help. Thanks!

I do know the post. It did not quite answer my questions. Still left in the dark. I also know another post by Daniel Roy Greenfield entitled Converting from bleach to nh3. Still left in the dark with this one too. Thanks anyway!

What are you in the dark about? Have you tried implementing it in the way shown in the post? What happened?

Yes, I implemented what was shown in the post. It is still implemented. And I stated my questions here which were not addressed in both posts. Even provided screenshots. That’s ok. I will look elsewhere for the answers. And the project will not go live until I get those answers. If I ever get them. I really do appreciate the help you gave me.

I kept the nh3.clean() code in the form, and removed it from the model. Same difference:

# boards/models.py
class Post(models.Model):
    message = models.TextField()
    topic = models.ForeignKey(Topic, on_delete=models.CASCADE, related_name="posts")
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(null=True)
    created_by = models.ForeignKey(User, on_delete=models.CASCADE, related_name="posts")
    updated_by = models.ForeignKey(
        User, on_delete=models.CASCADE, null=True, related_name="+"
    )
    likes = models.ManyToManyField(User, blank=True, related_name="post_likes")

    def total_likes(self):
        return self.likes.count()

    def __str__(self):
        # truncated_message = Truncator(self.message)
        # return truncated_message.chars(30)
        return self.message

    def get_absolute_url(self):
        return reverse("post_detail", kwargs={"pk": self.pk})

    def get_message_as_markdown(self):
        rendered_content = markdown(self.message, extensions=['fenced_code', 'codehilite'])
        return mark_safe(rendered_content)

And this is what the form looks like:

boards/forms.py

class SanitizedTextareaField(forms.CharField):
def clean(self, value):
value = super().clean(value)
return nh3.clean(value, tags={
“a”,
“abbr”,
“acronym”,
“b”,
“blockquote”,
“code”,
“em”,
“i”,
“li”,
“ol”,
“strong”,
“ul”,
},
attributes={
“a”: {“href”, “title”},
“abbr”: {“title”},
“acronym”: {“title”},
},
url_schemes={“https”},
link_rel=None,)
return value

class PostForm(forms.ModelForm):
    message = SanitizedTextareaField(widget=forms.Textarea)

    class Meta:
        model = Post
        fields = ['message', ]

I just added the tags to match what bleach did as per the other post.

You say you’ve done what’s in the post, but your code still shows that you have nh3.clean() in the Form’s clean() method, not in the to_python() method which the post uses. Have you tried that?

I’ve copied your model and form and it works for me as you have it – if I create a post then tags like <script> are removed from the message field.

So I’m not sure how you’re getting them to appear in the edit_post.html.

If you look in the database at the message field of Posts, do they appear as you would expect? i.e. with the “bad” HTML tags stripped out?

I don’t know either. Because I don’t do anything there. That’s the mystery. The other thing is that if I just put the code in the PostForm, the script tag etc. still appears in the live html. I double checked and cleared the cache. When I cleared the cache, the script reappeared. So for me, it only works in the model. And don’t understand why the offending tags appear in edit_post.html.

I’m wondering if the following might have something do with the offending tags appearing in post_edit.html:

# boards/views.py

@login_required
def reply_topic(request, pk, topic_pk):
    topic = get_object_or_404(Topic, board__pk=pk, pk=topic_pk)
    if request.method == 'POST':
        form = PostForm(request.POST)
        if form.is_valid():
            post = form.save(commit=False)
            post.topic = topic
            post.created_by = request.user
            post.save()

            topic.last_updated = timezone.now()
            topic.save()

            topic_url = reverse('topic_posts', kwargs={'pk': pk, 'topic_pk': topic_pk})
            topic_post_url = '{url}?page={page}#{id}'.format(
                url=topic_url,
                id=post.pk,
                page=topic.get_page_count()
            )

            return redirect(topic_post_url)
    else:
        form = PostForm()
    return render(request, 'reply_topic.html', {'topic': topic, 'form': form})

@method_decorator(login_required, name='dispatch')
class PostUpdateView(UpdateView):
    model = Post
    fields = ('message', )
    template_name = 'edit_post.html'
    pk_url_kwarg = 'post_pk'
    context_object_name = 'post'

    def get_queryset(self):
        queryset = super().get_queryset()
        return queryset.filter(created_by=self.request.user)

    def form_valid(self, form):
        post = form.save(commit=False)
        post.updated_by = self.request.user
        post.updated_at = timezone.now()
        post.save()
        return redirect('topic_posts', pk=post.topic.board.pk, topic_pk=post.topic.pk)

I use post = form.save(commit=False) in reply_topic and PostUpdateView. Could that have something to do with it I wonder?

I’m getting closer to finding my answer. nh3.clean() is a function that sanitizes HTML input by removing potentially dangerous tags and attributes, but it doesn’t directly interact with your Django database.

How nh3.clean() Works:

  • It takes HTML as input.
  • It removes tags and attributes that are not considered safe, helping to prevent Cross-Site Scripting (XSS) attacks.
  • It returns the sanitized HTML.
  1. Sanitize on Display:
  • If you didn’t sanitize the data when saving it, you can sanitize it when you display it on a template. However, it’s generally better to sanitize the data before saving it.

I.E.:

{{ content|safe }}

  • Database Integrity:

Sanitizing the data before saving it ensures that your database remains clean and free of potentially harmful code.

  • Performance:

Sanitizing data before saving it is generally more efficient than sanitizing it every time you display it.

Now I’m going to see how I might be able to achieve this.

I got nh3.clean() to remove offending tags when creating a reply_topic. However, when I want to update that reply, and test nh3.clean() by adding offending tag(s), they are not removed. They remain. I believe this is a flaw in nh3. I have reached out to the maintainers on GitHub about this.

This is what I have right now:

# boards/forms.py

class HtmlSanitizedCharField(forms.CharField):
    def to_python(self, value):
        value = super().to_python(value)
        if value not in self.empty_values:
            value = nh3.clean(
                value,
                # Allow only tags and attributes from our rich text editor
                tags={
                    "a",
                    "abbr",
                    "acronym",
                    "b",
                    "blockquote",
                    "code",
                    "em",
                    "I",
                    "li",
                    "ol",
                    "strong",
                    "ul",
                },
                attributes={
                    "a": {"href", "title"},
                    "abbr": {"title"},
                    "acronym": {"title"},
                },
                url_schemes={"https"},
                link_rel=None,)
        return value

class PostForm(forms.ModelForm):
    message = HtmlSanitizedCharField(widget=forms.Textarea)
    class Meta:
        model = Post
        fields = ['message', ]

# boards/models.py

class Post(models.Model):
    message = models.TextField()
    topic = models.ForeignKey(Topic, on_delete=models.CASCADE, related_name="posts")
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(null=True)
    created_by = models.ForeignKey(User, on_delete=models.CASCADE, related_name="posts")
    updated_by = models.ForeignKey(
        User, on_delete=models.CASCADE, null=True, related_name="+"
    )
    likes = models.ManyToManyField(User, blank=True, related_name="post_likes")

    def total_likes(self):
        return self.likes.count()

    def __str__(self):
        # truncated_message = Truncator(self.message)
        # return truncated_message.chars(30)
        return self.message

    def get_absolute_url(self):
        return reverse("post_detail", kwargs={"pk": self.pk})

    def get_message_as_markdown(self):
        rendered_content = markdown(self.message, extensions=['fenced_code', 'codehilite'])
        return rendered_content

# reply_topic.html

 <div class="mb-2 mt-3" method="post">{{ post.get_message_as_markdown|safe|truncatewords:5 }}</div>

# post_detail.html

<div class="post-message">{{ post.get_message_as_markdown|safe }}</div>

#edit_post.html

{% extends "base.html" %}
{% load static %}
{% block title %}
  Edit post
{% endblock title %}
{% block breadcrumb %}
  <li class="breadcrumb-item">
    <a href="{% url 'index' %}">Boards</a>
  </li>
  <li class="breadcrumb-item">
    <a href="{% url 'board_topics' post.topic.board.pk %}">{{ post.topic.board.name }}</a>
  </li>
  <li class="breadcrumb-item">
    <a href="{% url 'topic_posts' post.topic.board.pk post.topic.pk %}">{{ post.topic.subject }}</a>
  </li>
  <li class="breadcrumb-item active">Edit post</li>
{% endblock breadcrumb %}
{% block content %}
  <form method="post" class="mb-4" novalidate>
    {% csrf_token %}
    {% include "includes/form.html" %}
    <button type="submit" class="btn btn-success">Save changes</button>
    <a href="{% url 'topic_posts' post.topic.board.pk post.topic.pk %}"
       class="btn btn-outline-secondary"
       role="button">Cancel</a>
  </form>
{% endblock content %}

I succeeded in removing offending tags from the edit_post view. This is the code I have now and that works as it should:

# boards/forms.py

import nh3

class HtmlSanitizedCharField(forms.CharField):
    def to_python(self, value):
        value = super().to_python(value)
        if value not in self.empty_values:
            value = nh3.clean(
                value,
                # Allow only tags and attributes from our rich text editor
                tags={
                    "a",
                    "abbr",
                    "acronym",
                    "b",
                    "blockquote",
                    "code",
                    "em",
                    "I",
                    "li",
                    "ol",
                    "strong",
                    "ul",
                    "s",
                    "sup",
                    "sub",
                },
                attributes={
                    "a": {"href"},
                    "abbr": {"title"},
                    "acronym": {"title"},
                },
                url_schemes={"https"},
                link_rel=None,)
        return value

class PostForm(forms.ModelForm):
    message = HtmlSanitizedCharField(widget=forms.Textarea)
    class Meta:
        model = Post
        fields = ['message', ]

Then:

# boards/models.py

class Post(models.Model):
    message = models.TextField()

    def get_message_as_markdown(self):
        clean_content = nh3.clean(self.message, tags={
            "a",
            "abbr",
            "acronym",
            "b",
            "blockquote",
            "code",
            "em",
            "I",
            "li",
            "ol",
            "strong",
            "ul",
            "s",
            "sup",
            "sub",
        },
        attributes={
            "a": {"href"},
            "abbr": {"title"},
            "acronym": {"title"},
        },
        url_schemes={"http", "https", "mailto"},
        link_rel=None,)
        rendered_content = markdown(clean_content, extensions=['fenced_code', 'codehilite'])
        return mark_safe(rendered_content)

Then:

# boards/views.py

@method_decorator(login_required, name='dispatch')
class PostUpdateView(UpdateView):
    model = Post
    fields = ('message', )
    template_name = 'edit_post.html'
    pk_url_kwarg = 'post_pk'
    context_object_name = 'post'
    success_url = "/"

    def get_queryset(self):
        queryset = super().get_queryset()
        return queryset.filter(created_by=self.request.user)

    def form_valid(self, form):
        if form:
            form.instance.message = nh3.clean(form.instance.message, # Allow only tags and attributes from our rich text editor
                tags={
                    "a",
                    "abbr",
                    "acronym",
                    "b",
                    "blockquote",
                    "code",
                    "em",
                    "I",
                    "li",
                    "ol",
                    "strong",
                    "ul",
                    "s",
                    "sup",
                    "sub",
                },
                attributes={
                    "a": {"href"},
                    "abbr": {"title"},
                    "acronym": {"title"},
                },
                url_schemes={"https"},
                link_rel=None,)
            super().form_valid(form)
        post = form.save(commit=False)
        post.updated_by = self.request.user
        post.updated_at = timezone.now()
        post.save()
        print(post.save, 'save the updated data')
        return redirect('topic_posts', pk=post.topic.board.pk, topic_pk=post.topic.pk)

The templates remain the same.

Well yes. It’s a Python package but has nothing at all to do with Django (unless you want to use it in a Django project).

You seem to be making this much too complicated – it’s a sign something’s wrong if you’re having to do the nh3.clean() in three separate places! That’s also a potential for future errors – you might update the tags you want to allow in one or two places but forget the other.

I’ve created a simple test version which works. It doesn’t have any Topics or require a login, but the basics all work.

# models.py
from django.db import models

class Post(models.Model):
    message = models.TextField()
    updated_at = models.DateTimeField(null=True, blank=True)
# forms.py
import nh3
from django import forms
from .models import Post

class SanitizedTextareaField(forms.CharField):
    def clean(self, value):
        value = super().clean(value)
        return nh3.clean(
            value,
            tags={
                "a",
                "abbr",
                "acronym",
                "b",
                "blockquote",
                "code",
                "em",
                "i",
                "li",
                "ol",
                "strong",
                "ul",
            },
            attributes={
                "a": {"href", "title"},
                "abbr": {"title"},
                "acronym": {"title"},
            },
            url_schemes={"https"},
            link_rel=None,
        )

class PostForm(forms.ModelForm):
    message = SanitizedTextareaField(widget=forms.Textarea)

    class Meta:
        model = Post
        fields = ["message"]
# urls.py
from django.urls import path
from . import views

urlpatterns = [
    path("", views.PostCreateView.as_view(), name="post_create"),
    path("<int:pk>/", views.PostUpdateView.as_view(), name="post_update"),
]
# views.py
from django.views.generic import CreateView, UpdateView
from django.shortcuts import redirect
from django.urls import reverse
from django.utils import timezone

from .forms import PostForm
from .models import Post

class PostCreateView(CreateView):
    template_name = "post_form.html"
    form_class = PostForm

    def form_valid(self, form):
        self.object = form.save(commit=False)
        self.object.updated_at = timezone.now()
        self.object.save()
        return redirect(self.get_success_url())

    def get_success_url(self):
        return reverse("post_update", args=[self.object.pk])

class PostUpdateView(UpdateView):
    template_name = "post_form.html"
    form_class = PostForm
    model = Post

    def form_valid(self, form):
        self.object = form.save(commit=False)
        self.object.updated_at = timezone.now()
        self.object.save()
        return redirect(self.get_success_url())

    def get_success_url(self):
        return reverse("post_update", args=[self.object.pk])

post_form.html:

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Document</title>
</head>
<body>
	<h1>Form</h1>
	<form method="post" action="">
		{% csrf_token %}
		{{ form.as_p }}
		<input type="submit">
	</form>
</body>
</html>

It does this:

  1. Displays a page with a form for entering a Post’s message
  2. The form sanitizes the message using nh3.clean()
  3. If the form is valid the PostCreateView() adds the updated_at value and saves the Post.
  4. Redirects to the post update page, displaying the Post, with the message sanitized (which is how it is in the database).
  5. Saving the update form sanitizes the message, updates the updated_at value, saves the Post, and refreshes the page.