A big bulk uploading ..!

Hey , am in job for uploading big excel sheet into database. Am creating an API for that using rest_framework and am using pandas to read file request. my problem is that it takes long time than normal time. It takes more than 20 min to create half of excel sheet. I know that time complexity is too bad. but what can i actually do with these large excel sheet upload. I also need to do some other database query while creating model instances. Here , Am using a loop to create model instance from excel sheet in a row wise.

I tried a lot with multiprocessing and other like parallel job from joblib, but still not working. Actually I need a help. Anyone please help me to solve this.

Have you tried using celery?

not yet , instead of thinking about celery can we use batch processing

What do you mean by that?

multiprocessing and parallel programming

That’s actually what celery does for you (plus a number of other things as well).

@rohitbrleaf, which method did you use in the end that produces the best performance?

Context

  1. Using a simple for loop on the pandas dataframe
  2. I am currently uploading data using django ORM
  3. Each object has 9 other foreign keys, 1 manytomany and 1 image upload
  4. My local pc and the AWS EC2 Django + Postgres are located at opposite side of the world
  5. My internet download and upload speed is ~900mbps (measured by google), likely faster than my AWS EC2 network speed.

Performance

Originally each object creation takes an average of 25 secs (latency due to distance, if I write to local Django + Postgres, it’s blazing fast).

After I replace update_or_create with create, drop to roughly 16 secs.

It will take me months to fully upload all data at current performance

Approaches

  1. bulk_create: since my objects has so many relationship, I think shouldn’t use this based on the caveats (QuerySet API reference | Django documentation | Django)
  2. acreate: I think async won’t help in object creation since it’s bottleneck by my simple for loop?
  3. multiprocessing: seems promising, unlike bulk_create, I can picture each process will handle the object and relations in 1 flow. Even if I use 23 processes (23 threads), I think django should be able to handle 23 connections in parallel?
  4. celery: I wonder if this is overkill, because this data upload is a 1 time job, and the source data is send from my local pc to the remote server

@KenWhitesell, how would you do it? Use celery?