@rohitbrleaf, which method did you use in the end that produces the best performance?
Context
- Using a simple for loop on the pandas dataframe
- I am currently uploading data using django ORM
- Each object has 9 other foreign keys, 1 manytomany and 1 image upload
- My local pc and the AWS EC2 Django + Postgres are located at opposite side of the world
- My internet download and upload speed is ~900mbps (measured by google), likely faster than my AWS EC2 network speed.
Performance
Originally each object creation takes an average of 25 secs (latency due to distance, if I write to local Django + Postgres, it’s blazing fast).
After I replace update_or_create
with create
, drop to roughly 16 secs.
It will take me months to fully upload all data at current performance
Approaches
-
bulk_create
: since my objects has so many relationship, I think shouldn’t use this based on the caveats (QuerySet API reference | Django documentation | Django) -
acreate
: I think async won’t help in object creation since it’s bottleneck by my simple for loop? - multiprocessing: seems promising, unlike
bulk_create
, I can picture each process will handle the object and relations in 1 flow. Even if I use 23 processes (23 threads), I think django should be able to handle 23 connections in parallel? - celery: I wonder if this is overkill, because this data upload is a 1 time job, and the source data is send from my local pc to the remote server
@KenWhitesell, how would you do it? Use celery?