Django Sitemap Best Practice

Django Sitemap Best Practice

I have about 200k urls. I set the per page limit to 5000, and page_cache at 86400 (24 hours). The sitemap is dynamically generated by sitemap index (The sitemap framework | Django documentation | Django). It is super helpful.

1 problem, example.com/sitemap.xml sometimes crash the server. The AWS EC2 console shows service are ok, but unable to ssh to the server, and web app is down.

I am still testing and not 100% sure about root cause, maybe

  1. Internal factor, due to EC2, nginx, gunicorn or django
  2. External factor, because of web crawlers (Googlebot)?

How do you ensure sitemap has high performance? What are the best practices if you have millions of urls? Or the only solution is to increase the server specs?

Would like to hear your experience.

Sitemap items

from django.contrib.sitemaps import Sitemap
from blog.models import Entry

class BlogSitemap(Sitemap):
    changefreq = "never"
    priority = 0.5

    def items(self):
        return Entry.objects.filter(is_draft=False)

    def lastmod(self, obj):
        return obj.pub_date

Is there anything we can do on the queryset to improve performance, such as selecting specific fields (urls, last_modified). Or since it’s lazy, it’s already optimized?

Cache

I am using page_cache, are there any tips that would make a difference?

Compress sitemap

1 technique is to create sitemap with .gz extension.

If we create the .gz files ourselves, how do we integrate that with the sitemap index?

Cached vs compressed sitemap performance

Is compressed sitemap more performant than cached sitemap?

What are the pros and cons?

That indicates something seriously wrong with your configuration.

In the absence of more detailed information, my best guess for this is that this is being caused by using too much memory.

What’s the actual size of the sitemap file(s)?

Does it come back eventually, or do you need to reboot the instance?

What size instance is this? Do you have a swap file allocated?

Are you running the database in that same instance?

Are you using something like cgroups or docker to set an upper limit on memory allocation by your Python processes? (I know that uwsgi has some settings to help facilitate that, I don’t know about gunicorn.)

Have you run this in a test environment such that you can determine a memory usage profile from it?

Is it possible for you to divide this large number of urls into sections?

I will get back with more details.

Setup

1 server that host both django webapp and postgresql database.

I use filesystem caching. Yes, redis / memcached has better performance, as they use memory instead of FS. Will move to that eventually. Using FS now since storage is cheaper than memory.

Server specs

When I was using AWS EC2 t3a.micro (1GB RAM), and execute ping_google, the server will hang and go offline.

Once I change it to t3a.small (2GB RAM), I can ping_google with no problem. Overnight it crash. The EC2 failed 1/2 instance checks. This is not conclusive, still need a few more days to observe.

If I delete the django cache and regenerate it manually, the server don’t crash.

Not using docker, direct deploy with gunicorn + uvicorn

Sitemap

It is broken into 2 parts, static (<10) and section1 (200k+, 5k per page)

sitemap.xml
  sitemap-static.xml
  sitemap-section1.xml?p=1
  ...
  sitemap-section1.xml?p=47

Thoughts

I will load the data locally to do memory profile.

Haven’t done it yet since my local pc has 128GB memory, unlikely to see performance issue, and it’s tricky to test for gunicorn / uvicorn / nginx locally. Need to think how to do it.

I think t3a.small (2GB RAM) maybe the issue. The memory is not enough to share between postgresql, django and linux services. But my monitoring shows that it is not a problem.

Eventually, I will increase the sitemap urls to roughly 500M. Even if I set 50k per page, that will be 10k pages. Which is why I need to think about sitemap optimization, on top of increasing server memory.

Actually, that makes it easier. Just track the memory utilization of your processes. You’ve got enough “headroom” that you don’t need to worry about an out-of-memory situation.

And I’m not sure why you think it’s tricky to test locally - what operating system are you using for your development environment?

Using linux, I can use all linux distro, so env isn’t a issue for me.

Tricky since my PROD is the only real one. My local setup till date is to test only code and feature, I don’t load it with same data. I understand this is bad practice, I will start doing it properly.

Or, if you want to more replicate the production environment, you could use VirtualBox to create a constrained environment that more closely resembles the EC2 instance. That also provides the isolation you may want between this productionish system and your normal development environment.

I will setup and do memory profile.

What really confuse me is the stats. It doesn’t look like it’s struggling.

The maximum cpu usage is 50%. So CPU shouldn’t be bottleneck.

Memory usage maybe.

Is sitemap CPU or memory intensive for reading action?

What about generating FS cache, memory intensive?

Most crude method

My local pc is running on Unraid OS. I have multiple VMs and dockers.

This VM1 only host django webapp. Postgresql is hosted outside of VM1.

When I enter VM1ip/sitemap.xml on another VM2 browser

Memory usage jumps from 4.8GB to 7.4GB, about 2.6GB.

I guess for my AWS EC2 t3a.small (2GB), when I enter sitemap.xml for the 1st time, it probably will use 2.6GB (django webapp) + Postgresql memory usage (not recorded).

This is a challenge? If 200k urls need 3-4 GB RAM. Then 500M urls need 7500 - 10000 GB RAM?

Quite possibly.

I can’t imagine any valid reason from creating a sitemap file that large. I think you might want to rethink the site design if you have 500M non-parameterized unique URLs.

Yes, I can’t offer that kind of server.

I have that many url as I am using django to build a search engine.

I guess I shouldn’t add all to sitemap, need to be selective.

No, that’s not the purpose of sitemap. You need to fundamentally rethink your applications architecture.

I think you are right, other “search engine” don’t put all their urls in their sitemap.

Crunchbase has 11.3M urls, that’s roughly 169.5 - 226 GB RAM. I wonder if they dedicate a server just for sitemap.

Why do you think that those sites actually create a “sitemap” file? What you’re showing doesn’t in any way imply that they create an xml file with this data.

Crunchbase: https://www.crunchbase.com/www-sitemaps/sitemap-index.xml

The sitemap pattern is very similar to django sitemap index, just that they use compressed file (.gz) instead of ?p=

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">
<sitemap>
<loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-0.xml.gz</loc>
<lastmod>2023-01-08T00:08:38.000Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-1.xml.gz</loc>
<lastmod>2023-01-08T00:08:41.000Z</lastmod>
</sitemap>
<sitemap>
<loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-2.xml.gz</loc>
<lastmod>2023-01-08T00:08:44.000Z</lastmod>
</sitemap>
...
</sitemapindex>

I think crunchbase use cron job to generate gz file, and maybe they don’t update older gz files.

If I follow this pattern, I shouldn’t use django sitemap index. I should create the gz files myself, then somehow let sitemap reference them.

I will close this since this is more generic and not django specific.

The fact that crunchbase.com sends you a file from a url named sitemap-index.xml does not mean that they generate and store that as an xml file. It could just as easily be generated on the fly using the equivalent of a Django view from data stored in their database.

Good point, I am doing more research, maybe can be done using django view like you say.

EC2 hanged again :sleeping:, CPU usage will suddenly spike before that.

Quite sure must be caused by bots and sitemap.

I will find a proper solution soon