Multimaster django setup

Multimaster Django setup

Django supports multiple databases, but I wonder if it’s possible to do master-master Django instance setup.

If you have done something similar, but using other methods (load balancer etc), please share your experience :grin:

I maybe asking too much, but preferably as simple as possible.

As it’s just an indie project. It will be overkill if I go full-blown AWS app load balancer + ECS cluster. I don’t have that kind of traffic, nor that kind of :moneybag:. This data load thing is a batch job, not going to be a constant workload.

Use case

1 Django instance (webapp + postgres) hosted in AWS US EC2

I am writing huge amount of data using Django ORM, but I am located in APAC. It took an average of 16 sec per record. ORM instead of bulk_create because of relations + django-reversion (A big bulk uploading ..! - #7 by shawnngtq)

I am thinking of setting APAC EC2 Django instance to speed up.

But I have no idea how to make APAC and US EC2 django instance sync, with master-master setup.

My expect outcome

1 Django instance (webapp + postgres) hosted in AWS US EC2

1 Django instance (webapp + postgres) hosted in AWS APAC EC2

I will create record using AWS APAC EC2 instance. Then AWS US EC2 will sync with AWS APAC EC2.

Why not multiple databases

My current data flow:

APAC send data → AWS US EC2 webapp + postgres

If I create AWS EC2 APAC postgres database. I expect the data flow to become:

APAC send data → AWS US EC2 Django database router → AWS APAC EC2 postgres

Which is worse in both performance and cost (AWS data transfer).

Load balancer

Even if I use load balancer, that is able to redirect the user traffic to either US / APAC, it will be pointless if both Django instances is not in sync (master-master / master-slave)?

Is my understanding of load balancer wrong?

Multimaster postgres

Is using multimaster postgres the solution? Maybe the data sync between US and APAC EC2 shouldn’t be handled by Django, but by postgres?

But what if records are created in US & APAC at the same time? Wouldn’t that result in data conflict as sync is not real-time?

Reference

https://lincolnloop.com/high-performance-django/intro.html

In thinking about this, there may other options, but the specifics are really going to depend upon the specifics of your exact situation.

You mention that “This … is a batch job, not going to be a constant workload”. Does that mean that this is a 1-time event, or something that is going to happen periodically? (If periodically, how frequently?)

Are there other updates being applied to the database? How frequently does that occur?

Do you have the ability to stop the system for long enough to do these bulk updates? Or must that system remain consistently available?

Do you know the link speed between “APAC” and “AWS EC2”? If that link is no faster than the link speed between “YOU” and “AWS ECS”, then a multi-master situation isn’t going to provide any benefit.

You mention that “This … is a batch job, not going to be a constant workload”. Does that mean that this is a 1-time event, or something that is going to happen periodically? (If periodically, how frequently?)

Maybe once every half a year?

Do you have the ability to stop the system for long enough to do these bulk updates? Or must that system remain consistently available?

I can stop the system if required. It takes roughly 36 days for importing around 200k records. If my next batch of data is 500M, that won’t work.

Do you know the link speed between “APAC” and “AWS EC2”? If that link is no faster than the link speed between “YOU” and “AWS ECS”, then a multi-master situation isn’t going to provide any benefit.

Link speed between AWS EC2 APAC and USA should be 5Gbps* business internet.

I think it should be faster than the speed between my APAC computer and EC2 USA since mine is 1Gbps* commercial internet.

The latency between these 2 regions are the worst since it’s literally opposite side of the world.

My temp “solution”

After this long data import. I think the easiest solution for me is to move from AWS EC2 USA to AWS EC2 APAC, to overcome the latency problem, even if it’s going to increase my AWS bill by 70%.

I don’t have the resource to deal with multiple region challenge right now :sweat_smile:

I see other most popular framework such as Java Spring and Ruby on Rails don’t deal with this challenge either, Django is still the king to me.

Possible django future enhancement

  1. Django cluster in same region (multiple availability zone)
    • Where django instances share their app state / cache
    • My idea is that all the django instances use the same remote redis cache. If there is a django settings that allow us to do it, that will be perfect. Enhancing the cache framework (Django’s cache framework | Django documentation | Django). Or the current redis can already handle this?
  2. Django cluster in multi-region
    • Maybe impossible due to latency
    • At DB level, with CAP theorem, if we want ACID database, we can’t have high availability

I think load balance & web accelerator is out-of-scope since it’s upstream. But Django app → cache → database maybe an area to explore.

Reference

https://lincolnloop.com/high-performance-django/intro.html#why-the-rush-to-cache

Load balancer → Web accelerator → Django app → Cache → Database

I’d be looking at a different approach for this. Since you’re only looking at doing this twice a year, I’d consider replicating the data locally, doing the data prep, and then exporting the updates as an update file that could then be copied to the target system and updated directly.

1 Like

I think I get what you mean

  1. load local db with the latest data dump
  2. create record to local db
  3. dump that local db to sql file
  4. upsert that sql file to target db

Nice, I will research step 4 since it’s scary.

But I think we should keep this open since I am still very interested in possibility of multimaster django, and it’s the 1st in this forum.

Yes. You could make it more sophisticated (reduce the size of the reload) by doing a diff between the file you retrieved and the file you’re sending back, by only reloading those rows that have changed or been inserted.