-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amqp link recovery fix #611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @varunpuranik !
/cc @tameraw @myagley @abhipsaMisra While working on tests we found a gap in our fault injection code: it will not actually tear-down the AMQP connection (just links). We need to implement a different technique for fault injection that tests this scenario. The reason this change doesn't have any automated tests associated at the moment: My proposal is to use network-driver level tools such as clumsy/WinDivert to ensure connections torn down at TCP layer get recovered for all protocols. This is planned together with long-running tests (this fault only happens if the remote endpoint is down for two retries). @varunpuranik manually tested this for both AMQP and MQTT. |
@@ -219,7 +220,11 @@ async Task<T> ExecuteWithErrorHandlingAsync<T>(Func<Task<T>> asyncOperation, boo | |||
throw new IotHubClientTransientException("Transient error occurred, please retry.", ex); | |||
} | |||
|
|||
await this.Reset(openCompletionBeforeOperationStarted, handlerBeforeOperationStarted).ConfigureAwait(false); | |||
if (reset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to investigate all other cases for this as it appears to be an old design-level bug: InternalClient (Device/Module) has a circular dependency through some of the callbacks.
Instead of having the state machine spread out between Error, Retry and InternalClient callbacks we should push the logic into the correct handler.
@varunpuranik we're currently investigating issues with method recovery E2E tests that have failed trying to run the CI system. /cc @abhipsaMisra |
6ff1d20
to
caacbf1
Compare
Blocking merge: @prmathur-microsoft @tameraw @yzhong94 - we will not accept new PRs until we can enable all E2E tests. @varunpuranik this change is breaking IoT Hub recovery so we cannot accept it as-is. On top of that, to avoid further regressions we first need to enable all existing E2E tests. |
@CIPop - Can you please provide some more details on what you mean by it is breaking IoT Hub recovery? What scenarios is it breaking? Do you have any logs that might help understand why? If you have any particular E2E tests that are breaking because of this, maybe I can take a look at those. By now, I am fairly familiar with the way the SDK behaves. Also, the other day I provided logs of the working scenario to @abhipsaMisra and also showed her the demo. This fix has been tested and validated by @ancaantochi to be working as well. |
@varunpuranik there are two issues here:
|
@CIPop - Alright, for # 1 - if you can tell me how to run the method recovery tests and get logs, I can try to run them and see what is going on. For # 2, let me know when the E2E tests are enabled. Do you have an ETA on that? |
Our existing E2E tests fail with your change but pass w/o it. Fill in relevant portions of https://github.com/Azure/azure-iot-sdk-csharp/tree/master/e2e/test/prerequisites, run the script to load the environment variables then start Visual Studio from the same command line. Run all of the Methods tests with the IoTHub-FaultInjection trait.
I refactored the code for E2E here: https://github.com/CIPop/azure-iot-sdk-csharp/tree/mqtttests1 While the message tests are all passing for me (only tried Windows), the Method and Twin tests fault injection are still failing intermittently. I've added even more logging but didn't get a chance to diff the running case with the failing case yet. I do suspect we're hitting something similar to the AMQP DPS synchronous completion but for IoT Hub (similar to the one you've addressed in dotNetty): #552 (proposed fix here: #619 (comment) ). This needs to be fixed in all AMQP clients (IoT Device, ServiceClient and DPS) and for all potential operations using IOCP that might complete synchronously. |
caacbf1
to
0537945
Compare
@varunpuranik adding here for completeness: we have found that using TcpView and the Method sample the current code is recovering and is able to receive more method calls. |
@varunpuranik temporarily closing this as it's been open for a while. Windows logs:
Linux logs show the exact same stopping point:
Looking at a passing run, the test that fails is Method recovery:
|
Checklist
master
branch.Description of the changes
Issue: Over AMQP, the link recovery callbacks don't retry multiple times. Because of this, if the server is not available when the first retry happens, the link goes silent. This affects receiving links - receiving messages, method invocations and twin updates. In particular, I was investigating this for the receiving messages link.
Underlying cause: The error delegating handler cleans up all connections if an exception is thrown. This causes the links to get closed entirely, and the retry gets cancelled. This works for MQTT, since it expects the connection to be closed, and re-opened in the next retry. On the other hand, for AMQP, each link is treated separately, and as such, the connection (and all its links) should not be closed if the recovery fails once. So added code to reset the connection only for MQTT.
I spent significant time trying other options and somehow "unifying" the approach for both AMQP and MQTT, but because of the way the 2 protocols are written and expected to behave, this was not possible, without a significant refactor of the codebase.
The proposed fix works for both cases.
Tests: I will work with @CIPop to add E2E tests for these scenarios.
Reference/Link to the issue solved with this PR (if any)
#571