Ingesting a large JSON through my Django endpoint

I need to implement a Django endpoint which is able to receive a large unsorted JSON payload, sort it and then return it. I was thinking about:

  • ijson streams over JSON arrays, yielding items without loading the whole file.

  • Each chunk is written to a temporary file, sorted.

  • Then heapq.merge merges them like an external sort

  • Then the data is returned using StreamingHTTPResponse

But I’m currently stuck on getting the data in. I’m using the Django dev server and I think the issue is that the dev server buffers the entire request body before passing it to Django, meaning incoming chunks are not available incrementally during the request and large JSON payloads will be fully loaded into memory before the view processes them.

So, my questions are if this is a viable idea and do I need something like gunicorn for this ? I’m not looking to build a production grade system, just a working poc.

The task is (homework), not to build a production grade system, just to sort and return a “large” json given a memory constraint. So, the way I see it, I don’t have much other choice than to use disk space in form of temporary files. I don’t want to get bogged down in infrastructural complexity.

Thanks in advance. I’d be very grateful for any tips, ideas or just being pointed in the right direction.

I think you have to look about Celery, you can run an queue task withouth overloading your application, and then when the task ends you can be even notified.

(What do you consider large? Anything less than 10 MB shouldn’t be a problem. Anything larger than 100MB really shouldn’t be using HTTP.)

If it is required that you accept the input as an HTTP post or put, then you don’t really have a lot of choice, because something is going to be holding that http request until it is complete.

Having said that, if you have the JSON submitted as a file upload, Django will give you the ability to process the file pieces as individual chunks. (See Upload handlers)

But in real terms, this would be a case where you’d be better off not using HTTP as a transport protocol. There would be a lot better ways of handling this - but then you’d need to coordinate this with the sending process.

Any real recommendations should be based on a complete understanding of the requirements and the environment.