How to temporarily route queries originating from 1 script run to a different database with the same model structure?

Bear with me on this. I’m learning as I go here, and I likely have some things wrong that need corrected. Our project is over a year old now and there is a large code base, so unless I can figure out a way to juggle these 2 databases, there will have to be a major refactor, which would have to be a long-term goal, so what I’m looking for is whether or not what I’m about to describe is possible or not…

There are a number of factors that lead to this question, but basically, we have a data submission validation page so that users (who want to submit files for us (the curators) to load into the database) can validate their files and fix simple issues before it gets to us.

Originally, we wanted to simply load the files using the load scripts in debug/dry-run mode in the validation view, but given the way the load scripts were developed, they have side-effects, even when an exception is raised. Applying an atomic transaction was attempted, but there are a number of methods in the load that make queries, which raises an exception during atomic transactions.

One quick fix we used was to simply create a separate database, and after the initial implementation (just being careful to use .using(db) and .save(using=db), it worked for awhile - validation loads went into the validation database and errors were returned that users could act on, but we found that adding new code unpredictably lead to some confusing problems and errors from the validation load. Honestly, we don’t fully understand what is going on. Various methods called from the load scripts that make independent queries turned out to still be going to the default database, which we didn’t realize would happen (this is our first Django project and we’re learning as we go).

Everything not involving the validation interface (including the load scripts) work as designed and pass all tests. It’s just that we’re having a hard time getting control over what database is used when attempting to get the validation view to only use the validation database. For example, a problem I worked around yesterday was a call in a new load script that was calling get_or_create (that goes through using("validation")). I added debug prints in the django base code that showed 4 or 5 queries were going to validation, but 2 subsequent queries were going to default and the error we were getting was about the primary key (an AutoField) didn’t meet the unique constraint - and I knew that that table in the validation database was empty. I worked around the problem by doing separate get and create calls - and that worked. No more unique constraint violations.

So while trying to figure all this out, I learned of “database routers” and figured I could create a router and switch its default database whenever we run the validation loads, but as I read the docs this morning, it seems that this use-case isn’t really what its designed for. It seems like it’s designed for individual models always defaulting to a statically set database?

I started to wonder how I could implement the desired outcome of having all queries from 1 script go to the validation database and wondered if any such attempt could actually work without affecting queries of other users who are browsing the site while 1 user runs the validation code…

Is this even possible? Should I start an overhaul to try and make atomic transactions in the load scripts possible so that we don’t need a second database?

That’s going to be tough.

As you’ve identified, if you’re calling external libraries that are making queries, you don’t really have any effective means of controlling those queries.

So my first reaction to this would be to spin this action off into a separate task, where that task is launched with a different settings file having a different database default. (What mechanism you use for this may depend upon how frequently this is going to be done. You could set this up as a Celery task, but it’s not necessary to do that. You could just run this as an external process.)

Huh. That could work. At one point in this project I used celery to implement a download status bar, but my PR was rejected by consensus. I was the only one advocating for it. They determined it was overkill, and I couldn’t blame them. It was indeed a lot of overhead just so users could tell how long their download was going to take.

I think, given my experience with the team, a task(/celery) implementation with separate settings would also be a hard sell. Probably better to start working on the refactor to be able to wrap the loads in an atomic transaction.

But I’m gratified to hear that my instincts were right about the database router alone not being appropriate for this use case. I was worried I was missing something.

Like I said, it doesn’t need to be celery. It doesn’t really need to be anything special, unless the process takes so long that it would exceed the timeout value for a request - in which case you’d need to develop an alternative method regardless of the routing issue. But there’s nothing preventing you from running a task from within a view using Python’s multiprocessing module.

You could even set the core of this up as a custom management command, and run this with a custom DJANGO_SETTINGS_MODULE environment setting. This doesn’t appear that it needs to be a persistent task.

OK, thanks! I was just looking over the possibility of a refactor and it is a bit daunting. I’ll definitely bring this up as a viable solution. Maybe I’ll give it a whirl today and see how it goes.