Some weeks ago i have a problem related with a big request generated by some users that has a lot of data. The problem was in a dashboard page that allows the user to select many rows from a table and generates a request with the information of that rows.
The first problem was the request size (~30MB). The fast solution, increase the max payload size of the web server. This was the first solution but it was quite obvious that it was not the right solution because of DDOS attacks.
A second problem was the parsing process of that JSON data. Python takes many minutes for only parse all the json data of the payload, so if that wasn’t enough, the maximum time limit for a petition had to be increased (To further fuel the possibility of a DDOS)
Finally, I was able to reduce the size of the request compressing it before it was shipped to the backend (Final request ~3MB), but the problem of the parsing process still persist.
I know that a web server must avoid three types of tasks:
- Those that have an indeterminate time to finish (For example, checking an email)
- Those tasks limited by CPU (Processing, calculations, etc).
- Those tasks limited by ram memory
I am aware that my situation falls mainly in the last two categories, and I know that a task queuing mechanism like celery could help me to solve both problems, but I am not sure what is the flow or if there is a “standard” for this kind of situations, in particular I am referring to those situations where you have a large payload to be used in a background process (running with the help of celery).
I have thought about this solution:
- Generate a compressed request but never reach the web server, instead host it in a service like S3 through a presigned url.
- Notify django or have a lambda that is triggered by the insertion of an object in s3 and that generates a task in celery.
- When the task is executed in the celery worker, decompress the payload and do all the heavy processing of parsing and calculations (I can split this part in two stages, so it’s possible to constraint the decompress process in resources and time in a simpler way)
It seems to me an acceptable and scalable solution but I have not had experience with this kind of flows, do you know a better way to do it?. Clearly the client could be receiving feedback of the progress of the process through a websocket or http polling, but I still have some doubts about the implications of this flow:
- This type of pipelines that are generated internally in my infrastructure can cause me problems if an intermediate step is broken and consequently the subsequent steps in the pipeline do not continue. I would really like to be able to have a record of what failed and what was executed, and have the possibility to restart a failed process.
One solution (Ideal, but maybe not necessary for this simple case) is to use something like airflow, but I have a feeling that for this case you could do everything with celery, is it?
- What things could I be missing. For example
- I guess making each step idempotent is a necessary feature.
- Perhaps it would require a process to delete or save the compressed requests periodically?
Thank you for your response