Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alternative mechanism to signals for stopping workload executors #77

Open
prashantmital opened this issue Jun 2, 2020 · 5 comments
Labels
high priority This is needed urgently

Comments

@prashantmital
Copy link
Contributor

The current design for this project uses SIGINT (on *nix) and CTRL_BREAK_EVENT (on Windows) to coordinate the shutdown of the workload executor process after maintenance has been successfully run on the Atlas cluster.
Driver authors have to rely on standard APIs provided by their language in order to write a workload executor to conform to this spec. In practice, this has proven to be easier said than done. To reduce implementation complexity, we should consider providing an alternative mechanism to signals - something that is easier to implement and more platform-independent. An obvious solution would be to have astrolabe write a tombstone file to a pre-determined location when maintenance has completed, having workload executors periodically check for the existence of this file, and having them terminate when the file is eventually found.

@prashantmital
Copy link
Contributor Author

Note that this issue is blocking .NET integration. @vincentkam to add a code example that reproduces the issue we are seeing with signals on Windows + .NET.

@prashantmital prashantmital added the high priority This is needed urgently label Jun 2, 2020
@mbroadst
Copy link
Member

mbroadst commented Jun 3, 2020

@prashantmital note that using a tombstone file will not work if the workload executor is implemented as a docker container. In such scenarios, the container would not have access to that file unless it were explicitly linked into the container, and may yet still cause trouble. Another approach might be to use a lightweight RPC over domain sockets/named pipes, or to use something like a value in a mongodb document. I still think the path of least complexity lays with using system signals, so I'm eager to see @vincentkam 's code exemplifying the difficulty with trapping these signals on windows.

@jyemin
Copy link
Contributor

jyemin commented Jun 3, 2020

I haven't delved too deeply yet, but I know that on the JVM there is no straightforward way to install a signal handler. There is also no JVM support for domain sockets.

@vincentkam
Copy link

vincentkam commented Jun 3, 2020

The following sample exemplifies the issues we've been running into in getting signal handling to work with a combination of cygwin bash + windows python + dotnet.
https://github.com/vincentkam/drivers-atlas-testing/tree/dotnet-signaling-issues

The workload-executor is a cygwin bash script adapted from the python driver's bash script. It in turns executes the "native" workload executor, which in this case, is basically Program.cs.

I've commented Program.cs to illustrate the flow a bit better as I know not everyone may not have a Windows box handy, although a spawnhost with the dotnet toolchain installed should work if anyone wants to play with this example.

The TLDR is that it appears that something is terminating the native workload executor before the it can finish executing. I suspect it's a Cygwin bash problem because I see similar behavior when using kill -INT on the workload-executor bash script.

Here is a sample test run using Cygwin bash to invoke astrolabe which was installed via pip via Python on Windows:

Vincent@Astorma:~/projects/drivers-atlas-testing/integrations/dotnet$ /cygdrive/c/Python38/Scripts/astrolabe.exe spec-tests validate-workload-executor -e workload-executor --connection-string mongodb://localhost
test_num_errors (astrolabe.validator.ValidateWorkloadExecutor) ... INFO:astrolabe.utils:Starting workload executor subprocess
INFO:astrolabe.utils:Started workload executor [PID: 19028]
INFO:astrolabe.utils:Waiting 1.0 seconds for the workload executor subprocess to start
+ set -o errexit
+ FRAMEWORK=netcoreapp2.1
+ MAGIC_FILE_NAME=nox
+ CONNECTION_STRING=mongodb://localhost
+ WORKLOAD_SPEC='{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}'
+ echo I am 1307...
I am 1307...
+ rm -f nox
+ export MAGIC_FILE_NAME
+ trap 'echo You have activated my trap card; touch $MAGIC_FILE_NAME; wait $NATIVE_WORKLOAD_EXECUTOR_PID; exit $?' INT
+ export NATIVE_WORKLOAD_EXECUTOR_PID=1309
+ dotnet run --framework netcoreapp2.1 -p workload-executor.csproj mongodb://localhost '{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}'
+ NATIVE_WORKLOAD_EXECUTOR_PID=1309
+ wait 1309
dotnet main> Magic: nox
dotnet main> Arg: mongodb://localhost
dotnet main> Arg: {"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}, {"object": "collection", "name": "doesNotExist", "arguments": {"foo": "bar"}}]}
INFO:astrolabe.utils:Stopping workload executor [PID: 19028]

dotnet int handler> The main program has been interrupted.
dotnet int handler>  Key pressed: ControlBreak
dotnet int handler>  Cancel property: False
dotnet int handler> Setting the Cancel property to true...
dotnet int handler> Spinning until 4s have elapsed. Time (ms) elapsed: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ++ echo You have activated my trap card
7 7 7 You have activated my trap card
7 8 8 ++ touch nox
8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 23 23 23 23 23 23 23 23 23 23 23 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 26 26 26 26 26 26 26 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28 28 28 28 28 28 28 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 ++ wait 1309
INFO:astrolabe.utils:Stopped workload executor [PID: 19028]
INFO:astrolabe.utils:Reading sentinel file 'C:\\users\\vincent\\Projects\\drivers-atlas-testing\\integrations\\dotnet\\results.json'
ERROR:astrolabe.utils:Sentinel file not found
FAIL
test_simple (astrolabe.validator.ValidateWorkloadExecutor) ... INFO:astrolabe.utils:Starting workload executor subprocess
INFO:astrolabe.utils:Started workload executor [PID: 4416]
INFO:astrolabe.utils:Waiting 1.0 seconds for the workload executor subprocess to start
+ set -o errexit
+ FRAMEWORK=netcoreapp2.1
+ MAGIC_FILE_NAME=nox
+ CONNECTION_STRING=mongodb://localhost
+ WORKLOAD_SPEC='{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}'
+ echo I am 1311...
I am 1311...
+ rm -f nox
+ export MAGIC_FILE_NAME
+ trap 'echo You have activated my trap card; touch $MAGIC_FILE_NAME; wait $NATIVE_WORKLOAD_EXECUTOR_PID; exit $?' INT
+ export NATIVE_WORKLOAD_EXECUTOR_PID=1313
+ dotnet run --framework netcoreapp2.1 -p workload-executor.csproj mongodb://localhost '{"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}'
+ NATIVE_WORKLOAD_EXECUTOR_PID=1313
+ wait 1313
dotnet main> Magic: nox
dotnet main> Arg: mongodb://localhost
dotnet main> Arg: {"database": "validation_db", "collection": "validation_coll", "testData": [{"_id": "validation_sentinel", "count": 0}], "operations": [{"object": "collection", "name": "updateOne", "arguments": {"filter": {"_id": "validation_sentinel"}, "update": {"$inc": {"count": 1}}}}]}
INFO:astrolabe.utils:Stopping workload executor [PID: 4416]

dotnet int handler> The main program has been interrupted.
dotnet int handler>  Key pressed: ControlBreak
dotnet int handler>  Cancel property: False
dotnet int handler> Setting the Cancel property to true...
dotnet int handler> Spinning until 4s have elapsed. Time (ms) elapsed: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 ++ echo You have activated my trap card
8 9 You have activated my trap card
9 9 9 ++ touch nox
9 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 17 17 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 21 21 21 21 23 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 26 26 26 26 26 27 27 27 29 29 29 29 29 29 29 29 30 30 30 30 30 30 30 30 30 30 30 31 31 31 31 31 31 32 32 32 32 32 32 32 33 33 33 33 33 33 33 33 33 34 34 34 34 34 35 35 35 35 35 35 36 36 36 36 36 36 36 36 36 37 37 37 37 37 37 38 38 38 38 38 38 39 39 39 40 40 40 40 41 41 41 41 41 41 42 42 42 42 42 42 42 43 43 43 43 43 44 44 44 44 44 44 44 45 45 45 45 46 46 46 46 46 46 46 46 46 47 47 47 47 47 48 48 48 48 48 48 48 48 48 48 48 49 49 49 49 49 49 49 49 50 50 50 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 51 51 52 52 52 52 52 53 53 53 53 53 54 54 54 54 54 54 54 54 54 55 55 55 55 55 56 56 56 56 56 56 56 56 57 57
 58 58 58 58 59 59 59 59 59 59 59 59 60 60 60 60 60 60 60 60 61 61 61 61 61 61 61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 63 63 63 64 ++ wait 1313
INFO:astrolabe.utils:Stopped workload executor [PID: 4416]
INFO:astrolabe.utils:Reading sentinel file 'C:\\users\\vincent\\Projects\\drivers-atlas-testing\\integrations\\dotnet\\results.json'
ERROR:astrolabe.utils:Sentinel file not found
FAIL

======================================================================
FAIL: test_num_errors (astrolabe.validator.ValidateWorkloadExecutor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 122, in test_num_errors
    stats = self.run_test(driver_workload)
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 71, in run_test
    self.fail("The workload executor did not write a results.json "
AssertionError: The workload executor did not write a results.json file in the expected location, or the file that was written contained malformed JSON.

======================================================================
FAIL: test_simple (astrolabe.validator.ValidateWorkloadExecutor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 92, in test_simple
    stats = self.run_test(driver_workload)
  File "C:\python38\lib\site-packages\astrolabe\validator.py", line 71, in run_test
    self.fail("The workload executor did not write a results.json "
AssertionError: The workload executor did not write a results.json file in the expected location, or the file that was written contained malformed JSON.

----------------------------------------------------------------------
Ran 2 tests in 16.298s

FAILED (failures=2)

@prashantmital
Copy link
Contributor Author

See #79 for a proposed alternate strategy for communicating state between astrolabe and workload-executors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority This is needed urgently
Projects
None yet
Development

No branches or pull requests

4 participants