Hello!
We are using Django app with Nginx, PostgreSQL and Daphne. Also there are redis-server and chanel layer with groups. The app is used by about 300 users. 2-3 hours after launch, the CPU begins to load 100 percent and app stops responding to requests. Django and redis logs without errors.
After restart the load drops to 5-10 percent and another 2-3 hours after
all repeats.
What can I do to find the root of the problem?
I think the first step is to identify where the CPU is being consumed (which PID) and whether this is correlated to an extensive increase in memory utilization by that process. Your next steps would then depend upon what it is that is causing the issue.
If it’s your Daphne process, then you may want to add some code to show internal memory utilization. You may also what to check your database for long-running queries or hanging connections or something like that.
But in general, the first step is to get a more definitive handle on what’s going wrong.
Thanks you for reply! I checked with htop which proccess is consume the CPU. That was our manage.py runsevrer
CPU: 1100% and 10% memory (total 24 gb) by one process.
This happens after 2-3 hours of user activity.
Sometimes 5 seconds after the load on the CPU spikes, the load drops to the usual 10-15%, but sometimes the server stops responding until we restart it.
We had problems with database hanging connections (ASGI Django - PostgreSQL), these problems were displayed in the logs, and we fixed it by setting CONN_MAX_AGE to 0
Can I ask you for some advice?
I’ve read a lot of articles and topics on this topic, but I couldn’t come to any specific solution.
We tried this configuration.
and everything works fine. HTTP and ws requests are processed successfully.
But I’m not sure if we’re doing everything correctly
How correct is this configuration? Are there any nuances?
There’s not really any way for me to tell from this - at least not without digging into a number of other areas. But the best indicator would be that it’s working for you - and my experience has been, once you get it working, it stays working.
When you have an issue you don’t know how to solve, the first step is to gather more information.
This goes back to one of my earlier recommendations here:
Other things that you might want to do:
Capture the network traffic between nginx and daphne using something like Wireshark or tcpdump - see if you can identify what traffic is being passed before this happens.
Check all the system logs - everything from all the docker containers and the host OS, along with the database and redis logs for any apparently-related messages.
Re-examine your code for anything that might be doing a “busy-wait”.
Increase the logging level of those various processes
Try to recreate this in a test environment.
Are you also servicing regular Django requests using daphne in addition to handling the websocket traffic? If this is the case, then also check for what is happening on the Django side right before this happens.
Is your Django app async? If so, it might be worth trying to isolate the problem by running your Django app in a different instance of Daphne than your websocket handler.
If your Django app is a traditional synchronous project, then I’d definitely suggest giving it its own wsgi server - running it in either uwsgi (that’s what we do) or gunicorn.