You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A configurable setting should be added to astrolabe that specifies a special namespace, e.g. sentinel_database.sentinel_collection in the test database, which will then be used by astrolabe and workload executors to synchronize their operations.
This might look something like this:
After astrolabe starts the workload executor, it writes the following record (with writeConcern: majority) to sentinel_database.sentinel_collection:
{ '_id': <run_id>, status: 'inProgress' }
Here, run_id is some identifier that is known to both astrolabe as well as the workload executor.
After each iteration of running all operations in the operations array (see https://mongodb-labs.github.io/drivers-atlas-testing/spec-test-format.html), the workload executor checks the sentinel_database.sentinel_collection collection (with readConcern: majority) for the record bearing _id: <run_id>. On seeing that the status is still inProgress, the workload executor continues onto the next iteration of running operations.
Once the maintenance has completed and astrolabe wants to tell the workload executor to quit, it updates the sentinel record (using writeConcern: majority) to:
{ '_id': <run_id>, status: 'done' }
On the next check, the workload executor sees that the status is now done, and it updates this record with execution statistics (using writeConcern: majority):
Astrolabe waits on the $PID of the workload executor to exit. Once it has exited, it reads the execution statistics that are written by the workload executor.
Advantages of this approach
No more signal handling - this has been a thorn in implementation and we are only up to 2 languages at this point. Workload executors are already equipped to talk to the Atlas deployment so we know that the approach proposed herein will be painless to implement.
Workload executors can 'run anywhere' - since we no longer rely on platform-specific signals, we can coordinate between astrolabe and a workload executor no matter where they are running. This will be especially helpful in the context of running the workload executors inside containers where signals are not a viable option for process synchronization.
Enable support for more complex communication - we can support more complex interactions between the workload executor and astrolabe with this design
No more sentinel files - we no longer rely on files written by the workload executor to communicate execution stats.
Use what you build - this one is pretty obvious (DBs are used to store state and communicate state between processes that might get partitioned).
Edge cases
Workload executor is partitioned from the Atlas test cluster: this will make the W-E unable to read the sentinel document (could be caused, e.g. by a bug in the driver being tested or due to the Atlas test cluster going offline). This can be handled by using an appropriate timeout on the wait performed by astrolabe on the workload-executors $PID. If the workload executor does not stop running within the timeout, an error will be reported.
Astrolabe is partitioned from the Atlas test cluster: this is possible even in the current design. If astrolabe cannot write the sentinel document at the start of a run, we can mark the run a system failure. If astrolabe cannot update the record when it needs to signal the W-E to stop, OR it cannot read the execution stats, we can mark this as a test failure as the maintenance or workload possibly broke something.
The text was updated successfully, but these errors were encountered:
prashantmital
changed the title
Use test database instance to synchronize operations between astrolabe and workload executors
Use Atlas test cluster to synchronize operations between astrolabe and workload executors
Jun 30, 2020
A configurable setting should be added to astrolabe that specifies a special namespace, e.g.
sentinel_database.sentinel_collection
in the test database, which will then be used by astrolabe and workload executors to synchronize their operations.This might look something like this:
writeConcern: majority
) tosentinel_database.sentinel_collection
:Here,
run_id
is some identifier that is known to both astrolabe as well as the workload executor.After each iteration of running all operations in the
operations
array (see https://mongodb-labs.github.io/drivers-atlas-testing/spec-test-format.html), the workload executor checks thesentinel_database.sentinel_collection
collection (withreadConcern: majority
) for the record bearing_id: <run_id>
. On seeing that thestatus
is stillinProgress
, the workload executor continues onto the next iteration of running operations.Once the maintenance has completed and astrolabe wants to tell the workload executor to quit, it updates the sentinel record (using
writeConcern: majority
) to:done
, and it updates this record with execution statistics (usingwriteConcern: majority
):After this, the workload executor exits.
Advantages of this approach
Edge cases
CC: @mbroadst @vincentkam
The text was updated successfully, but these errors were encountered: