I’m wondering how others keep the data in their development database in sync with their production database, without entirely duplicating them ?
I have developed a toy Django application and deployed it successfully to a SaaS provider. I am using a local instance of postgresql and the Django development server for development and a separate postgres database and gunicorn in production.
I can manually upload backups of the production data to the development database, but there are some parts of the production database that I don’t want to replicate. For example, I want to keep separate username and password credentials in development and I don’t want to pollute the development database with migrations records from the production database.
I can exclude the internal Django tables in the production database from the manual backups or uploads to development, but then the foreign key links (eg. to the user model) will break. Natural keys seems to give me more detailed information about the relations between table entries, and helps to avoid some database inconsistencies, but doesn’t seem to solve this particular problem.
What are some approaches to work around this issue? Is there an opinionated ‘Django way’ to handle it?
Thanks in advance,
This is a great answer, and i think that there’s not a “single way” of doing it, this mostly becomes out of preference, or requirements. So i will give you my personal opinion/experience:
I just don’t, because 99% i don’t need any of the data that is on production to develop new features. So whenever i need a new feature, after creating it, i add tests (unit/integration/end-to-end) to assert that the feature is working properly, and on this tests i create the required data for them to work. If i need to manually test my application during development then i create the required data manually, or you can do better than me by setting up fixtures.
It’s mostly a question of a why: why do you need a production data to develop new features?
Thanks Leandro, for your interesting answer!
I have had a bit of a think about your question:
First, my main driver, has been to make sure the two databases are similar enough for reliable feature development. I think the missing piece there is testing, which I am starting to focus on. I have explored fixtures - but not yet as part of testing. (I initially used them to load data into the development database, though I am now writing data migrations to load initial data into both development and production databases.)
Second, I think I imagined that it would be helpful to have a working back up of the production database locally and the development database seemed convenient. On reflection, I can see that doesn’t work - because you want the development database to be more disposable and not actually a part of the production systems at all.
So, in short, I probably need to re-focus my efforts on developing tests (with re-useable fixtures to mirror snippets of production-like data), rather than looking to keep the development and production databases in sync.
My rule of thumb for using fixtures vs data migrations is this:
- Fixtures: Data that is not required for the system to work, but can help in development.
- Data migrations: Data that is required for the system to work, like creating objects that can’t be created on the admin, objects that should always be present on the application, etc.
And about this:
This will probably make way more sense when your production database is really large. For example, how you dump a 100GB production database to your machine? Woff…
Offering a different perspective here:
We wrote an “anonymizer”. It reads the production DB and writes to a test DB, with all the key data made “reasonably random”. (Names, addresses, emails, etc) It allows us to produce a version that can be shown to the client, and allow them to verify pages and reports that they wouldn’t ordinarily be able to see, before deploying it to production.
That’s interesting, and makes sense.
We take another approach - and it’s because of our use-case.
We use fixtures for all data that “is required for the system to work”. We have multiple deployments, where the deployments have some data that is different per-deployment. So we create a fixture directory for each deployment in addition to a “base fixture” directory. Deployment consists of applying the base fixtures followed by the per-deployment fixtures.
I’ve got some experience of this from two companies.
In the first we built a tool around the time we started to hear in the UK about GDPR and what impact it’d have on our ability to store personally identifiable information (PII) which was a huge part of the business. And around then it was fairly common place for developers to work with a copy of production data if they needed to reproduce specific scenarios.
This tool was a python app that’d run through the database and identify columns and replace any data that was deemed to be private with dummy data. There were some config options so specific projects/database tables could override how this worked incase a “name” field was a business name, not PII. Then this tool was integrated into the backup pipeline which used EC2 to take backups and truncate large tables that weren’t relevant to the development copy of the backup.
This was an excellent solution and a neat user experience for devs - we even built a script into projects that could pull the latest anonimised database and load it into your development environment & update django’s settings.
At the company I’m with now, we’ve got a similar tool, where project’s have a management command which knows where PII is stored and replaces it with anonymous data. There’s less focus on PII data in the projects here so it doesn’t need the robust approach above, but this does show that the key really is to have a management command or python package that can run on your project.
If you want to generate realistic data, you can use Faker
pip install Faker
It’s quite neat for keeping a realistic experience, while using random data.
from faker import Faker
fake = Faker()
d = dict()
d['first_name'] = lambda: fake.first_name()
d['last_name'] = lambda: fake.last_name()
d['personal_email'] = lambda: fake.email()
d['birth_date'] = lambda: fake.date_between_dates(date_start=datetime(1960, 1, 1), date_end=datetime(2000, 1, 1))