A big bulk uploading ..!

@rohitbrleaf, which method did you use in the end that produces the best performance?

Context

  1. Using a simple for loop on the pandas dataframe
  2. I am currently uploading data using django ORM
  3. Each object has 9 other foreign keys, 1 manytomany and 1 image upload
  4. My local pc and the AWS EC2 Django + Postgres are located at opposite side of the world
  5. My internet download and upload speed is ~900mbps (measured by google), likely faster than my AWS EC2 network speed.

Performance

Originally each object creation takes an average of 25 secs (latency due to distance, if I write to local Django + Postgres, it’s blazing fast).

After I replace update_or_create with create, drop to roughly 16 secs.

It will take me months to fully upload all data at current performance

Approaches

  1. bulk_create: since my objects has so many relationship, I think shouldn’t use this based on the caveats (QuerySet API reference | Django documentation | Django)
  2. acreate: I think async won’t help in object creation since it’s bottleneck by my simple for loop?
  3. multiprocessing: seems promising, unlike bulk_create, I can picture each process will handle the object and relations in 1 flow. Even if I use 23 processes (23 threads), I think django should be able to handle 23 connections in parallel?
  4. celery: I wonder if this is overkill, because this data upload is a 1 time job, and the source data is send from my local pc to the remote server

@KenWhitesell, how would you do it? Use celery?