Add alternative mechanism to signals for stopping workload executors #77

prashantmital · 2020-06-02T23:32:31Z

The current design for this project uses SIGINT (on *nix) and CTRL_BREAK_EVENT (on Windows) to coordinate the shutdown of the workload executor process after maintenance has been successfully run on the Atlas cluster.
Driver authors have to rely on standard APIs provided by their language in order to write a workload executor to conform to this spec. In practice, this has proven to be easier said than done. To reduce implementation complexity, we should consider providing an alternative mechanism to signals - something that is easier to implement and more platform-independent. An obvious solution would be to have astrolabe write a tombstone file to a pre-determined location when maintenance has completed, having workload executors periodically check for the existence of this file, and having them terminate when the file is eventually found.

The text was updated successfully, but these errors were encountered:

prashantmital · 2020-06-02T23:33:23Z

Note that this issue is blocking .NET integration. @vincentkam to add a code example that reproduces the issue we are seeing with signals on Windows + .NET.

mbroadst · 2020-06-03T13:14:14Z

@prashantmital note that using a tombstone file will not work if the workload executor is implemented as a docker container. In such scenarios, the container would not have access to that file unless it were explicitly linked into the container, and may yet still cause trouble. Another approach might be to use a lightweight RPC over domain sockets/named pipes, or to use something like a value in a mongodb document. I still think the path of least complexity lays with using system signals, so I'm eager to see @vincentkam 's code exemplifying the difficulty with trapping these signals on windows.

jyemin · 2020-06-03T15:35:31Z

I haven't delved too deeply yet, but I know that on the JVM there is no straightforward way to install a signal handler. There is also no JVM support for domain sockets.

vincentkam · 2020-06-03T18:32:23Z

The following sample exemplifies the issues we've been running into in getting signal handling to work with a combination of cygwin bash + windows python + dotnet.
https://github.com/vincentkam/drivers-atlas-testing/tree/dotnet-signaling-issues

The workload-executor is a cygwin bash script adapted from the python driver's bash script. It in turns executes the "native" workload executor, which in this case, is basically Program.cs.

I've commented Program.cs to illustrate the flow a bit better as I know not everyone may not have a Windows box handy, although a spawnhost with the dotnet toolchain installed should work if anyone wants to play with this example.

The TLDR is that it appears that something is terminating the native workload executor before the it can finish executing. I suspect it's a Cygwin bash problem because I see similar behavior when using kill -INT on the workload-executor bash script.

Here is a sample test run using Cygwin bash to invoke astrolabe which was installed via pip via Python on Windows:

Vincent@Astorma:~/projects/drivers-atlas-testing/integrations/dotnet$ /cygdrive/c/Python38/Scripts/astrolabe.exe spec-tests validate-workload-executor -e workload-executor --connection-string mongodb://localhost
test_num_errors (astrolabe.validator.ValidateWorkloadExecutor) ... INFO:astrolabe.utils:Starting workload executor subprocess
INFO:astrolabe.utils:Started workload executor [PID: 19028]
INFO:astrolabe.utils:Waiting 1.0 seconds for the workload executor subprocess to start
+ set -o errexit
+ FRAMEWORK=netcoreapp2.1
+ MAGIC_FILE_NAME=nox
+ CONNECTION_STRING=mongodb://localhost
+ WORKLOAD_SPEC='{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}'
+ echo I am 1307...
I am 1307...
+ rm -f nox
+ export MAGIC_FILE_NAME
+ trap 'echo You have activated my trap card; touch $MAGIC_FILE_NAME; wait $NATIVE_WORKLOAD_EXECUTOR_PID; exit $?' INT
+ export NATIVE_WORKLOAD_EXECUTOR_PID=1309
+ dotnet run --framework netcoreapp2.1 -p workload-executor.csproj mongodb://localhost '{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}'
+ NATIVE_WORKLOAD_EXECUTOR_PID=1309
+ wait 1309
dotnet main> Magic: nox
dotnet main> Arg: mongodb://localhost
dotnet main> Arg: {"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}
INFO:astrolabe.utils:Stopping workload executor [PID: 19028]

dotnet int handler> The main program has been interrupted.
dotnet int handler>  Key pressed: ControlBreak
dotnet int handler>  Cancel property: False
dotnet int handler> Setting the Cancel property to true...
dotnet int handler> Spinning until 4s have elapsed. Time (ms) elapsed: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ++ echo You have activated my trap card
7 7 7 You have activated my trap card
7 8 8 ++ touch nox
8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23 23 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 26 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28 28 28 28 28 28 28 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 ++ wait 1309
INFO:astrolabe.utils:Stopped workload executor [PID: 19028]
INFO:astrolabe.utils:Reading sentinel file 'C:\\users\\vincent\\Projects\\drivers-atlas-testing\\integrations\\dotnet\\results.json'
ERROR:astrolabe.utils:Sentinel file not found
FAIL
test_simple (astrolabe.validator.ValidateWorkloadExecutor) ... INFO:astrolabe.utils:Starting workload executor subprocess
INFO:astrolabe.utils:Started workload executor [PID: 4416]
INFO:astrolabe.utils:Waiting 1.0 seconds for the workload executor subprocess to start
+ set -o errexit
+ FRAMEWORK=netcoreapp2.1
+ MAGIC_FILE_NAME=nox
+ CONNECTION_STRING=mongodb://localhost
+ WORKLOAD_SPEC='{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}'
+ echo I am 1311...
I am 1311...
+ rm -f nox
+ export MAGIC_FILE_NAME
+ trap 'echo You have activated my trap card; touch $MAGIC_FILE_NAME; wait $NATIVE_WORKLOAD_EXECUTOR_PID; exit $?' INT
+ export NATIVE_WORKLOAD_EXECUTOR_PID=1313
+ dotnet run --framework netcoreapp2.1 -p workload-executor.csproj mongodb://localhost '{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}'
+ NATIVE_WORKLOAD_EXECUTOR_PID=1313
+ wait 1313
dotnet main> Magic: nox
dotnet main> Arg: mongodb://localhost
dotnet main> Arg: {"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}
INFO:astrolabe.utils:Stopping workload executor [PID: 4416]

dotnet int handler> The main program has been interrupted.
dotnet int handler>  Key pressed: ControlBreak
dotnet int handler>  Cancel property: False
dotnet int handler> Setting the Cancel property to true...
dotnet int handler> Spinning until 4s have elapsed. Time (ms) elapsed: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 ++ echo You have activated my trap card
8 9 You have activated my trap card
9 9 9 ++ touch nox
9 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 21 21 21 21 23 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 26 26 26 26 26 27 27 27 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 30 31 31 31 31 31 31 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 33 34 34 34 34 34 35 35 35 35 35 35 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 38 38 38 38 38 38 39 39 39 40 40 40 40 41 41 41 41 41 41 42 42 42 42 42 42 42 43 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45 46 46 46 46 46 46 46 46 46 47 47 47 47 47 48 48 48 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49 50 50 50 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 51 51 52 52 52 52 52 53 53 53 53 53 54 54 54 54 54 54 54 54 54 55 55 55 55 55 56 56 56 56 56 56 56 56 57 57
 58 58 58 58 59 59 59 59 59 59 59 59 60 60 60 60 60 60 60 60 61 61 61 61 61 61 61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 63 63 63 64 ++ wait 1313
INFO:astrolabe.utils:Stopped workload executor [PID: 4416]
INFO:astrolabe.utils:Reading sentinel file 'C:\\users\\vincent\\Projects\\drivers-atlas-testing\\integrations\\dotnet\\results.json'
ERROR:astrolabe.utils:Sentinel file not found
FAIL

======================================================================
FAIL: test_num_errors (astrolabe.validator.ValidateWorkloadExecutor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 122, in test_num_errors
    stats = self.run_test(driver_workload)
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 71, in run_test
    self.fail("The workload executor did not write a results.json "
AssertionError: The workload executor did not write a results.json file in the expected location, or the file that was written contained malformed JSON.

======================================================================
FAIL: test_simple (astrolabe.validator.ValidateWorkloadExecutor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 92, in test_simple
    stats = self.run_test(driver_workload)
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 71, in run_test
    self.fail("The workload executor did not write a results.json "
AssertionError: The workload executor did not write a results.json file in the expected location, or the file that was written contained malformed JSON.

----------------------------------------------------------------------
Ran 2 tests in 16.298s

FAILED (failures=2)

prashantmital · 2020-06-30T04:17:05Z

See #79 for a proposed alternate strategy for communicating state between astrolabe and workload-executors.

prashantmital added the high priority This is needed urgently label Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alternative mechanism to signals for stopping workload executors #77

Add alternative mechanism to signals for stopping workload executors #77

prashantmital commented Jun 2, 2020

prashantmital commented Jun 2, 2020

mbroadst commented Jun 3, 2020

jyemin commented Jun 3, 2020

vincentkam commented Jun 3, 2020 •

edited

Loading

prashantmital commented Jun 30, 2020

Add alternative mechanism to signals for stopping workload executors #77

Add alternative mechanism to signals for stopping workload executors #77

Comments

prashantmital commented Jun 2, 2020

prashantmital commented Jun 2, 2020

mbroadst commented Jun 3, 2020

jyemin commented Jun 3, 2020

vincentkam commented Jun 3, 2020 • edited Loading

prashantmital commented Jun 30, 2020

vincentkam commented Jun 3, 2020 •

edited

Loading