[GSOC 2020] Parallel Test Runner

Hello everyone!

Communication

I’ve setup a documentation blog here to document my progress/thoughts about the project so far. I’ve made an initial post about the tickets I’ve tackled during the community bonding period. You can find this post here. I’m mainly going to use the blog to give updates about my progress and post here for direct communication if I’m stuck on a particular problem. Of course, if my mentors prefer otherwise, I do not mind posting my progress on here and the blog as well or to use another communication channel.

Scheduling:

The schedule I already mentioned on my proposal is the one I’ll largely be sticking to. I have a long period of exams/assessments running from mid-May till the first week of July. This however won’t affect my proposed schedule; I’m going to compensate by beginning work on the first milestone this week so I give myself more buffer time during June.

The schedule for documentation purposes is largely this:

  • First Milestone: fixing worker database connections and adapting SQLite’s cloning method to using spawn.
  • Second Milestone: Adding Oracle backend support to the parallel test runner.
  • Third Milestone: General cleanup, documentation, and tackling other related tickets.

After the first milestone, the parallel test runner will be fully operational on Windows and macOS using spawn. I’ll also ensure the added logic didn’t break running the parallel test runner using fork.

Current implementation issue

Running the entire test suite with the current patch leads to nondeterministic failures on these tests:

  • check_framework
  • servers
  • test_utils
  • queries
  • transactions
  • transaction_hooks
  • view_tests
  • utils_tests

I say nondeterministic because both the number of errors and failures and the tests that fail vary every single test run.

The majority of errors are operational errors due to the queried tables not existing. These errors also result in failures due to again the specified tables not existing.

Curious enough, running these tests in isolations leads to no errors. Although, two failures remain from test_utils and utils_tests.

I’m not sure what exactly causes the tables to be removed/not exist when the full test suite is run versus only running one set of tests in isolation.

Running the test suite with --start-at=test_utils also gives no errors, just the two failures from test_utils and utils_tests. After reading through the test runner options, I’m going to use --pair and --bisect to determine what causes the failures and post my results afterwards.

Here’s the link to the Jenkins build. It shows the exact errors and failures along with the test names.

3 Likes

Hey Ahmad! It’s fantastic to see your progress. I had a play with your branch tonight - it’s looking good!

Regarding the spurious failures this seems to be releated to the Django database teardown code. From my investigation it seems that:

  1. A spawned worker dies for whatever reason, which triggers Django to tear down the database (and thus remove the file)
  2. The multiprocessing.pool restarts it, which somehow results in an empty database
  3. Subsequent tests then fail due to the database being wiped clean.

I’m not entirely sure if this is exactly the cause of events, but running runtests.py --keepdb on my Macbook results in a lot less of these random failures, but quite a few ones relating to AttributeError: Can't pickle local object 'infix.<locals>.Operator' which appears to come from a TemplateDoesNotExist exception.

Multiprocessing and files can get tricky - I’m sorry if we’ve already covered/considered this, but have you thought about a two phase approach where the master process writes a sqlite.db file with all the migrations, then each child loads it into memory?

connection.settings_dict['NAME'] = ":memory:" 
dest = os.path.join(DIR, 'main.sqlite3')
old_conn = sqlite3.connect(dest)
new_conn = sqlite3.connect(":memory:")
old_conn.backup(new_conn)
new_conn.commit()
connection.connection = new_conn

That might avoid issues where a worker picks up an invalid file due to a previous crash by keeping the on-disk database untouched?

Hey Tom! Thanks for checking out my branch! I pushed a new commit now fixing all test failures except three that are thankfully consistent:

  • test_main_module_is_resolved (utils_tests.test_autoreload.TestIterModulesAndFiles)
  • test (test_utils.test_transactiontestcase.TestSerializedRollbackInhibitsPostMigrate)
  • test_registered_check_did_run (check_framework.tests.ChecksRunDuringTests)

There are also 2 new failures from the Jenkins build.

After trying out different approaches this week, the current somewhat stable version is using in-memory databases the way you suggested. Only one database is created per alias; I did this by ignoring any suffix number that is greater than 1. This is a slight optimization that makes a lot of sense in my opinion because multiple processes can read the same database file concurrently.

Cloning is done during worker initialization backing up the on-disk database into a unique in-memory database for each connection like so:

import sqlite3
sourcedb = sqlite3.connect('%s.sqlite3' % str(alias))
settings_dict = connection.settings_dict
settings_dict['NAME'] = 'file:memorydb_{}_{}?mode=memory&cache=shared'.format(str(alias), str(_worker_id))
connection.settings_dict.update(settings_dict)
connection.connect()
sourcedb.backup(connection.connection)

Adding in connection.connect() removed failures related to database setup such as database transaction behavior in atomic blocks and missing database functions. During spawn, connections are uninitialized and workers do not see existing databases. For PostgreSQL and MySQL, we don’t connect during worker initialization, we just point the worker to the correct database name and the worker connects appropriately using the correct name during the first test run.

Initializing a connection for SQLite on-memory databases was necessary to backup an existing database onto them.

Also, it’s important to connect the database to a unique URI filename to separate it from other in-memory databases in the same process as per documentation

Connecting it like in the first line causes cache conflicts and db connection conflicts with other database aliases in the same process.

settings_dict['NAME'] = ':memory:'  # Not unique

settings_dict['NAME'] =  'file:memorydb_{}_{}?mode=memory&cache=shared'.format(str(alias), str(_worker_id)) # Unique (possible to remove _worker_id from the name, but I kept it for consistency with test names)

The last major change I made was switching from VACUUM INTO to backup().
After testing using in-memory databases, there’s always a 20-40 second difference with the two methods. In any case, I’ll be benchmarking this again after the last three five? stubborn failures are dealt with to put the matter to rest.

@adamchainz might be interested to see how the benchmark turns out. I’m betting on backup! :slight_smile:

Side note: directory management wasn’t necessary so I stripped it all away, might be necessary if we consider django users I think, will test later because another small side-goal I have for the parallel test runner in general is to make it more usable and extensible for django users