Django app stops responding after 2-3 hours of use

Hello!
We are using Django app with Nginx, PostgreSQL and Daphne. Also there are redis-server and chanel layer with groups. The app is used by about 300 users. 2-3 hours after launch, the CPU begins to load 100 percent and app stops responding to requests. Django and redis logs without errors.
After restart the load drops to 5-10 percent and another 2-3 hours after
all repeats.
What can I do to find the root of the problem?

docker-compose.yaml


version: '3.9'

services:

  nginx-docker:
    container_name: nginx-docker
    hostname: nginx
    image: repo.saber3d.net/docker/library/nginx:latest
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /etc/localtime:/etc/localtime
      - /etc/timezone:/etc/timezone
      - /opt/docker/data/etc/nginx:/etc/nginx
      - /etc/ssl/saber3d.net:/etc/ssl/saber3d.net
      - /opt/docker/data/nginx-log:/var/log/nginx
      - /var/www:/var/www:ro
      - /mnt/temp:/mnt/temp
    depends_on:
     - my-app-backend
     - my-app-frontend
    tmpfs:
      - /tmp

  my-app-backend:
    container_name: my-app-backend
    image: path.to/my-app-backend
    restart: always
    environment:
      FIX_WWW_DATA: false
    volumes:
      - /etc/localtime:/etc/localtime
      - /etc/timezone:/etc/timezone
      - /mnt/tmp:/mnt/tmp
      - /opt/docker/data/main:/var/lib/postgresql/12/main
      - /opt/docker/data/config.yaml:/app/api/config.yaml
      - /opt/docker/data/users_uploads:/app/api/static/uploads/users_uploads
    depends_on:
     - redis-docker
    links:
     - redis-docker
    tmpfs:
      - /tmp
      - /var/lib/php/sessions

  redis-docker:
    container_name: redis-docker
    image: redis:6.2-alpine
    restart: always
    ports:
      - '6379:6379'
    command: redis-server --save 20 1 --loglevel warning
    volumes: 
      - /opt/docker/data/redis-log:/data

some settings.py

CHANNEL_LAYERS = {
    "default": {
        "BACKEND": "channels_redis.core.RedisChannelLayer",
        "CONFIG": {
            "hosts": [("redis-docker", 6379)],
        },
    },
}

consumer.py

class MyAppConsumer(AsyncWebsocketConsumer):
    async def connect(self):
        self.room_name = self.scope["url_route"]["kwargs"]["some_id"]
        self.room_group_name = "some_%s" % self.room_name
        # Join room group
        await self.channel_layer.group_add(self.room_group_name, self.channel_name)
        await self.accept()

    async def disconnect(self, close_code):
        # Leave room group
        await self.channel_layer.group_discard(self.room_group_name, self.channel_name)

    async def send_info(self, event):
        info= event["some"]

        # Send message to WebSocket
        new_text = json.dumps({
                "some": info
                })
        await self.send(
            text_data=new_text
        )

    async def receive(self, text_data):
        text_data_json = json.loads(text_data)
        ...some func

I think the first step is to identify where the CPU is being consumed (which PID) and whether this is correlated to an extensive increase in memory utilization by that process. Your next steps would then depend upon what it is that is causing the issue.

If it’s your Daphne process, then you may want to add some code to show internal memory utilization. You may also what to check your database for long-running queries or hanging connections or something like that.

But in general, the first step is to get a more definitive handle on what’s going wrong.

1 Like

Thanks you for reply! I checked with htop which proccess is consume the CPU. That was our manage.py runsevrer
CPU: 1100% and 10% memory (total 24 gb) by one process.
This happens after 2-3 hours of user activity.
Sometimes 5 seconds after the load on the CPU spikes, the load drops to the usual 10-15%, but sometimes the server stops responding until we restart it.
We had problems with database hanging connections (ASGI Django - PostgreSQL), these problems were displayed in the logs, and we fixed it by setting CONN_MAX_AGE to 0

our req file

Django==5.0.2
sqlparse==0.4.1
django-chartjs==2.2.1
django-mptt==0.13.4
requests==2.26.0
psycopg2==2.9.9
APScheduler==3.9.1
django-debug-toolbar==3.4.0
django-apscheduler==0.6.2
djangorestframework==3.15.1
django-cors-headers==3.13.0
channels~=4.0.0
asgiref~=3.8.1
daphne~=4.1.0
cmake~=3.25.0
channels_redis~=4.2.0
transliterate~=1.10.2

Can you give us advice how we can to get a more definitive handle on what’s going wrong, please?

That’s one major problem right there.

Quoting directly from the docs for runserver:

** DO NOT USE THIS SERVER IN A PRODUCTION SETTING.**

Under no circumstances should you be relying upon this to run consistently for hours or days.

yeah run server is for local development purposes only.

Install daphne as a service (OS’s systemctl or other) or via supervisor (or equivalent).

Only the size of our gratitude is greater than the scale of the stupidity of our problem. Thanks you so much!

Thanks you too!
Could you share some link to information about setting django daphne nginx? We want to use asgi for sockets and http response

Some of the specifics are going to depend upon your particular environment.

You can find a number of sample configurations in this section of the forum showing different configurations that people are using.

Can I ask you for some advice?
I’ve read a lot of articles and topics on this topic, but I couldn’t come to any specific solution.
We tried this configuration.

docker-compose.yaml

version: '3.9'

services:

  nginx-docker:
    container_name: nginx-docker
    hostname: nginx
    image: repo.ourapp.net/docker/library/nginx:latest
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /etc/localtime:/etc/localtime
      - /etc/timezone:/etc/timezone
      - /opt/docker/data/etc/nginx:/etc/nginx
      - /etc/ssl/ourapp.net:/etc/ssl/ourapp.net
      - /opt/docker/data/nginx-log:/var/log/nginx
      - /var/www:/var/www:ro
      - /mnt/temp:/mnt/temp
    depends_on:
     - my-app-backend
     - my-app-frontend
    tmpfs:
      - /tmp

  my-app-backend:
    container_name: my-app-backend
    image: path.to/my-app-backend
    restart: always
    environment:
      FIX_WWW_DATA: false
    volumes:
      - /etc/localtime:/etc/localtime
      - /etc/timezone:/etc/timezone
      - /mnt/tmp:/mnt/tmp
      - /opt/docker/data/main:/var/lib/postgresql/12/main
      - /opt/docker/data/config.yaml:/app/api/config.yaml
      - /opt/docker/data/users_uploads:/app/api/static/uploads/users_uploads
    depends_on:
     - redis-docker
    links:
     - redis-docker
    tmpfs:
      - /tmp
      - /var/lib/php/sessions

  redis-docker:
    container_name: redis-docker
    image: redis:6.2-alpine
    restart: always
    ports:
      - '6379:6379'
    command: redis-server --save 20 1 --loglevel warning
    volumes: 
      - /opt/docker/data/redis-log:/data

asgi.py

import os

from channels.auth import AuthMiddlewareStack
from channels.routing import ProtocolTypeRouter, URLRouter
from channels.security.websocket import AllowedHostsOriginValidator
from django.core.asgi import get_asgi_application

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "my_app.settings")
django_asgi_app = get_asgi_application()

from my_app import routing

application = ProtocolTypeRouter(
    {
        "http": get_asgi_application(),
        "https": get_asgi_application(),
        "websocket": AllowedHostsOriginValidator(
            AuthMiddlewareStack(URLRouter(routing.websocket_urlpatterns))
        ),
    }
)

and backend container docker-entypoin.sh

..db start
..collectstatic
..migrate
daphne -b 0.0.0.0 -p 8000 my_app.asgi:application

and everything works fine. HTTP and ws requests are processed successfully.
But I’m not sure if we’re doing everything correctly
How correct is this configuration? Are there any nuances?

There’s not really any way for me to tell from this - at least not without digging into a number of other areas. But the best indicator would be that it’s working for you - and my experience has been, once you get it working, it stays working.

Thank you very much for your help and advice!

Hello!
Unfortunately, the problem is not resolved.
The CPU is being 1100% consumed after 2 hours of user activity.
But now the proccess is

/usr/local/bin/python /usr/local/bin/daphne -b 0.0.0.0 -p 8000 my_app.asgi:application

One minute before max the CPU load was about 5-10% by core

And I still have no idea where to look to solve the problem

what kind of requests/response do you “handle” in daphne?

When you have an issue you don’t know how to solve, the first step is to gather more information.

This goes back to one of my earlier recommendations here:

Other things that you might want to do:

  • Capture the network traffic between nginx and daphne using something like Wireshark or tcpdump - see if you can identify what traffic is being passed before this happens.

  • Check all the system logs - everything from all the docker containers and the host OS, along with the database and redis logs for any apparently-related messages.

  • Re-examine your code for anything that might be doing a “busy-wait”.

  • Increase the logging level of those various processes

  • Try to recreate this in a test environment.

  • Are you also servicing regular Django requests using daphne in addition to handling the websocket traffic? If this is the case, then also check for what is happening on the Django side right before this happens.

    • Is your Django app async? If so, it might be worth trying to isolate the problem by running your Django app in a different instance of Daphne than your websocket handler.
    • If your Django app is a traditional synchronous project, then I’d definitely suggest giving it its own wsgi server - running it in either uwsgi (that’s what we do) or gunicorn.

Thanks you very much again! I will be back with more info