I am facing a problem and I am not too sure which way is the best to go and I would need your help and experience to make the best choice.
I have several Single Board Computers (SBC) running a Django server with an SQLiteDB. With them I collect all sort of data, and can be run independently from each other. Most of the time they are running on a LAN only.
I have a web server, connected to Internet running Django, which will run the main database PostgreSQL.
users can input info some data and visualize them directly from the website.
Regularly, each SBC will connect to the internet and will need to synchronise their data with the main DB.
how should I handle this synchronisation?
Should I declare the main DB in each SBC, and run a script to synchronize the data?
Should I get some script which will send post request to the Mother DB to populate the db?
Should I get a script to prepare a csv/json file that will be uploaded to the main server and a routine on the mother DB will handle them.
How can I manage the other synchronisation? from the mother to the children?
Well at the moment I am a bit lost and would need your opinion on how to handle this.
Trying to get this straight:
The users can enter data on the web site to the main DB.
Do the SBCs only retrieve data from the main DB? Or do they also supply data to the main DB?
When an SBC is pulling data from the main DB, is it needing to pull all the data, or is the data segregated by SBC?
- If the SBCs are sending data to the main DB, is that data then propagated to the other SBCs?
From the perspective of the SBC, is the data that it’s pulling from the main DB a replacement for the data already on the SBC, or does it add to the data that currently exists?
Do you have a feel for how much data is being moved between the nodes?
(Let’s start with these and see where we can go from here)
Users can enter data on the main DB as well as SBCs. I would need to think of a way to manage conflicts for sure.
the SBC retrieve and supply. I think I would need a “last-edit” date field and synchronise the fields if the last edit is later than the last time synchronised
It doesn’t to pull all the data. only selected tables and I could also even condition if needed to limit the amount of data pulled. Yes, after modified on the main DB, it would be propagated to the others.
It adds, but if there was an edit on the main db (example name) the edit should be propagated.
quantity of data is not enormous, I would say few thousands table entries
The more I think about it, the clearer it is. I would need to get the main DB architecture to be a bit different from the SBCs DB.
The main DB should keep record of the last time each device has been updated.
each table should get a “last_edit”.
well, there is some work to put there but I am not lazy.
The big question is: How should I interact with the main DB? through a view in my Main DB? or each SBC to directly connect to the DB?
I think it is best to go through a view on the main server (taking as parameter the ID of the SBC as well as a JSON of all the new data generated / edited on the SBC) and exchange the data with a JSON response, and then get one script on the SBC side to update it’s own database and the view on the main DB to handle the update of the main one. I think it is much safer in case of models update.
Do you think this is good direction to take?
I think you’ve got a really good handle on this. So I’m just going to toss out some random thoughts for your consideration. You may or may not find any of this to be helpful.
This may be a really good case for using a version 1 UUID field as primary keys for your tables.
Yes, I would manage the data exchanges as JSON. Depending upon the number of tables involved, I might even suggest using DRF. (I run Django / DRF on a Raspberry Pi 3B+ with zero problems.)
I wouldn’t have the SBC connect directly to the DB. All interaction would be through APIs on the primary server.
Again depending upon usage patterns and requirements, I’d consider using Celery to propagate the transactions to other servers at the time the update is made. This doesn’t necessarily mean that those transactions must be applied _immediately.
- User makes an update on the primary server
- View submits a Celery task to distribute that update
- View response to user with next page
- Celery task examines the transaction and calls APIs on all applicable SBCs
- Or, the transaction is queued in specific queues for the appropriate SBCs
- Main server exposes an API to retrieve data from the queues
Then, the SBC either exposes an API for step 4 for incremental updates, or, based upon whatever scheduling is required, calls an API to retrieve data from the 4.2 API as a batch.
This same logic can be reversed from the perspective of the SBC for handling its updates as well.
Regarding the updates themselves, again based on needs and data structures, you could send the transactions as a set of updates based on ([field name, old value, new value], timestamp) rather than just associating a timestamp with the entire record. That way it becomes possible to merge updates across non-overlapping field sets and identify those conditions where they do overlap.
The UUIS v1, I didn’t know it at all, I will consider it.
Regarding DRF, I don’t know it at all, and I think for the volume of what I need to do I will try to handle it with standard view, I will have a look at it though to understand a bit more what it is about.
Agreed for the API
For celery, same as DRF, I will have a closer look at it. I think it might be useful for many things.
regarding your last point I agree, it would be good to get the down to the detail of each field has been modified. But here I don’t see much how to handle it. I don’t know very well how Django manage the saving of the objects. Am I able to extra the detail of each transaction and see exactly which field has been modified? this would also be really good to get the history and roll back if an update was a mistake.
DRF - Would be useful if you’re creating an update process that needs to work across a number of tables - more than 3 or 4. If all you’re updating is one table, my opinion is that there’s not going to be a huge benefit to it.
Really, when you get right down to it, DRF is the “boilerplate code” you would otherwise write to create an API for a model. (It really is a fantastic project. I just don’t see where it helps all that much if you’re only exposing one or two endpoints - you don’t end up creating that much in the way of boilerplate code to do that.)
Yes, Django does make that information available to you if you want to take advantage of it. If you’re creating the transactions, you can have it contain whatever information you’d like. As to how to do that, again, that in some measure depends upon some other factors. (For example, if your table only consists of two or three columns, or, if your table consists of 20 columns but it’s common that 18 of them get updated every time, then it’s probably not worth it to track changes by column - track it by the entire row. Yes, it’s a judgement call.)
How big are your SBC’s databases and do you typically purge old records on them?
If they aren’t huge, I would consider copying or rsyncing the whole sqlite database file back to your main server to be imported. That might be faster depending on their size.