Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amqp link recovery fix #611

Closed
wants to merge 1 commit into from

Conversation

varunpuranik
Copy link
Contributor

@varunpuranik varunpuranik commented Sep 5, 2018

Checklist

Description of the changes

Issue: Over AMQP, the link recovery callbacks don't retry multiple times. Because of this, if the server is not available when the first retry happens, the link goes silent. This affects receiving links - receiving messages, method invocations and twin updates. In particular, I was investigating this for the receiving messages link.

Underlying cause: The error delegating handler cleans up all connections if an exception is thrown. This causes the links to get closed entirely, and the retry gets cancelled. This works for MQTT, since it expects the connection to be closed, and re-opened in the next retry. On the other hand, for AMQP, each link is treated separately, and as such, the connection (and all its links) should not be closed if the recovery fails once. So added code to reset the connection only for MQTT.
I spent significant time trying other options and somehow "unifying" the approach for both AMQP and MQTT, but because of the way the 2 protocols are written and expected to behave, this was not possible, without a significant refactor of the codebase.
The proposed fix works for both cases.

Tests: I will work with @CIPop to add E2E tests for these scenarios.

Reference/Link to the issue solved with this PR (if any)

#571

Copy link
Member

@CIPop CIPop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @varunpuranik !

@CIPop
Copy link
Member

CIPop commented Sep 6, 2018

/cc @tameraw @myagley @abhipsaMisra

While working on tests we found a gap in our fault injection code: it will not actually tear-down the AMQP connection (just links). We need to implement a different technique for fault injection that tests this scenario.

The reason this change doesn't have any automated tests associated at the moment: My proposal is to use network-driver level tools such as clumsy/WinDivert to ensure connections torn down at TCP layer get recovered for all protocols. This is planned together with long-running tests (this fault only happens if the remote endpoint is down for two retries).

@varunpuranik manually tested this for both AMQP and MQTT.

@CIPop CIPop requested a review from abhipsaMisra September 6, 2018 22:44
@@ -219,7 +220,11 @@ async Task<T> ExecuteWithErrorHandlingAsync<T>(Func<Task<T>> asyncOperation, boo
throw new IotHubClientTransientException("Transient error occurred, please retry.", ex);
}

await this.Reset(openCompletionBeforeOperationStarted, handlerBeforeOperationStarted).ConfigureAwait(false);
if (reset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @tameraw @abhipsaMisra

We need to investigate all other cases for this as it appears to be an old design-level bug: InternalClient (Device/Module) has a circular dependency through some of the callbacks.
Instead of having the state machine spread out between Error, Retry and InternalClient callbacks we should push the logic into the correct handler.

@CIPop
Copy link
Member

CIPop commented Sep 8, 2018

@varunpuranik we're currently investigating issues with method recovery E2E tests that have failed trying to run the CI system. /cc @abhipsaMisra

@CIPop CIPop added the do-not-merge PR is not ready to be merged in. label Sep 18, 2018
@CIPop CIPop self-assigned this Sep 18, 2018
@CIPop
Copy link
Member

CIPop commented Sep 18, 2018

Blocking merge: @prmathur-microsoft @tameraw @yzhong94 - we will not accept new PRs until we can enable all E2E tests.

@varunpuranik this change is breaking IoT Hub recovery so we cannot accept it as-is. On top of that, to avoid further regressions we first need to enable all existing E2E tests.

@varunpuranik
Copy link
Contributor Author

@CIPop - Can you please provide some more details on what you mean by it is breaking IoT Hub recovery? What scenarios is it breaking? Do you have any logs that might help understand why? If you have any particular E2E tests that are breaking because of this, maybe I can take a look at those. By now, I am fairly familiar with the way the SDK behaves.

Also, the other day I provided logs of the working scenario to @abhipsaMisra and also showed her the demo. This fix has been tested and validated by @ancaantochi to be working as well.

@CIPop
Copy link
Member

CIPop commented Sep 19, 2018

@varunpuranik there are two issues here:

  1. When we tried to merge your change through CI, method recovery tests failed. I don't have the logs anymore but we could try to run tests again.
  2. I found a lot of issues with the E2E tests and @prmathur-microsoft , @tameraw and I decided that we will block PRs until we get all tests enabled to avoid any new regressions.

@varunpuranik
Copy link
Contributor Author

@CIPop - Alright, for # 1 - if you can tell me how to run the method recovery tests and get logs, I can try to run them and see what is going on.

For # 2, let me know when the E2E tests are enabled. Do you have an ETA on that?

@CIPop
Copy link
Member

CIPop commented Sep 19, 2018

if you can tell me how to run the method recovery tests and get logs, I can try to run them and see what is going on.

Our existing E2E tests fail with your change but pass w/o it. Fill in relevant portions of https://github.com/Azure/azure-iot-sdk-csharp/tree/master/e2e/test/prerequisites, run the script to load the environment variables then start Visual Studio from the same command line. Run all of the Methods tests with the IoTHub-FaultInjection trait.

For # 2, let me know when the E2E tests are enabled. Do you have an ETA on that?

I refactored the code for E2E here: https://github.com/CIPop/azure-iot-sdk-csharp/tree/mqtttests1

While the message tests are all passing for me (only tried Windows), the Method and Twin tests fault injection are still failing intermittently.

I've added even more logging but didn't get a chance to diff the running case with the failing case yet.

I do suspect we're hitting something similar to the AMQP DPS synchronous completion but for IoT Hub (similar to the one you've addressed in dotNetty): #552 (proposed fix here: #619 (comment) ). This needs to be fixed in all AMQP clients (IoT Device, ServiceClient and DPS) and for all potential operations using IOCP that might complete synchronously.

@CIPop
Copy link
Member

CIPop commented Oct 23, 2018

@varunpuranik adding here for completeness: we have found that using TcpView and the Method sample the current code is recovering and is able to receive more method calls.
If we use the change we're seeing that methods don't recovery anymore.

@CIPop
Copy link
Member

CIPop commented Oct 23, 2018

@varunpuranik temporarily closing this as it's been open for a while.
The current master tests did not pass in our gate: the tests ran for 2hrs on both Windows and Linux then VSTS cancelled the run.

Windows logs:

2018-10-22T22:48:02.6351184Z Test run for D:\a\1\s\e2e\test\bin\Release\netcoreapp2.1\Microsoft.Azure.Devices.E2ETests.dll(.NETCoreApp,Version=v2.1)
[...]
2018-10-22T23:03:25.3502170Z Passed   Message_QuotaExceededRecovery_Http
2018-10-22T23:04:37.4356436Z Passed   Message_AuthenticationRecovery_Amqp
2018-10-22T23:04:49.1407201Z Passed   Message_AuthenticationRecovery_AmqpWs
2018-10-22T23:05:01.3844043Z Passed   Message_AuthenticationRecovery_Http
2018-10-22T23:05:12.8635000Z Passed   Message_GracefulShutdownSendRecovery_Amqp
2018-10-22T23:05:24.5573442Z Passed   Message_GracefulShutdownSendRecovery_AmqpWs
2018-10-22T23:06:38.1487101Z Passed   Message_GracefulShutdownSendRecovery_Mqtt
2018-10-22T23:07:55.8802798Z Passed   Message_GracefulShutdownSendRecovery_MqttWs
2018-10-22T23:07:57.3927158Z Passed   Message_TimeOutReachedResponse
2018-10-22T23:07:57.3927562Z Passed   Message_NoTimeoutPassed
2018-10-22T23:07:58.9345009Z Passed   X509_DeviceSendSingleMessage_Amqp
2018-10-22T23:08:01.3204310Z Passed   X509_DeviceSendSingleMessage_AmqpWs
2018-10-22T23:08:02.8889075Z Passed   X509_DeviceSendSingleMessage_Mqtt
2018-10-22T23:08:02.8890097Z Passed   X509_DeviceSendSingleMessage_MqttWs
2018-10-22T23:08:04.4146759Z Passed   X509_DeviceSendSingleMessage_Http
2018-10-22T23:08:07.6682405Z Passed   Method_DeviceReceivesMethodAndResponse_Mqtt
2018-10-22T23:08:09.2447829Z Passed   Method_DeviceReceivesMethodAndResponse_MqttWs
2018-10-22T23:08:09.2448367Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_Mqtt
2018-10-22T23:08:10.8500953Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_MqttWs
2018-10-22T23:08:10.8501476Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_Mqtt
2018-10-22T23:08:10.8502441Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_MqttWs
2018-10-22T23:08:12.4121346Z Passed   Method_DeviceReceivesMethodAndResponse_Amqp
2018-10-22T23:08:12.4121903Z Passed   Method_DeviceReceivesMethodAndResponse_AmqpWs
2018-10-22T23:08:13.9799461Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_Amqp
2018-10-22T23:08:13.9799946Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_AmqpWs
2018-10-22T23:08:13.9801059Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_Amqp
2018-10-22T23:08:15.5297865Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_AmqpWs
2018-10-22T23:08:15.5302908Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_Mqtt
2018-10-22T23:08:15.5303374Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_MqttWs
2018-10-22T23:08:15.5303779Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_Mqtt
2018-10-22T23:08:15.5304655Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_MqttWs
2018-10-22T23:08:15.5305780Z Skipped  Method_DeviceMethodTcpConnRecovery_Amqp
2018-10-23T00:37:00.6349559Z Attempting to cancel the build...
2018-10-23T00:37:00.6355471Z Attempting to cancel the build...
2018-10-23T00:37:00.6665804Z      1>Done Building Project "D:\a\1\s\e2e\test\E2ETests.csproj" (VSTest target(s)) -- FAILED.
2018-10-23T00:37:00.6890530Z 
2018-10-23T00:37:00.6890808Z Build FAILED.
2018-10-23T00:37:00.6890945Z     0 Warning(s)

Linux logs show the exact same stopping point:

2018-10-22T22:58:35.8720782Z Passed   X509_DeviceSendSingleMessage_Amqp
2018-10-22T22:58:37.3978270Z Passed   X509_DeviceSendSingleMessage_AmqpWs
2018-10-22T22:58:38.9364626Z Passed   X509_DeviceSendSingleMessage_Mqtt
2018-10-22T22:58:40.4683807Z Passed   X509_DeviceSendSingleMessage_MqttWs
2018-10-22T22:58:41.9900253Z Passed   X509_DeviceSendSingleMessage_Http
2018-10-22T22:58:47.4524557Z Passed   Method_DeviceReceivesMethodAndResponse_Mqtt
2018-10-22T22:58:48.9829057Z Passed   Method_DeviceReceivesMethodAndResponse_MqttWs
2018-10-22T22:58:48.9830065Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_Mqtt
2018-10-22T22:58:50.5234349Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_MqttWs
2018-10-22T22:58:50.5235894Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_Mqtt
2018-10-22T22:58:52.0630886Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_MqttWs
2018-10-22T22:58:53.6234037Z Passed   Method_DeviceReceivesMethodAndResponse_Amqp
2018-10-22T22:58:53.6235845Z Passed   Method_DeviceReceivesMethodAndResponse_AmqpWs
2018-10-22T22:58:53.6236364Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_Amqp
2018-10-22T22:58:55.2393732Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_AmqpWs
2018-10-22T22:58:55.2395584Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_Amqp
2018-10-22T22:58:55.2397335Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_AmqpWs
2018-10-22T22:58:55.2399042Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_Mqtt
2018-10-22T22:58:55.2399375Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_MqttWs
2018-10-22T22:58:55.2399624Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_Mqtt
2018-10-22T22:58:55.2399895Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_MqttWs
2018-10-22T22:58:55.2400114Z Skipped  Method_DeviceMethodTcpConnRecovery_Amqp
2018-10-23T00:36:59.2505853Z ##[error]The operation was canceled.
2018-10-23T00:36:59.2534659Z ##[section]Finishing: build

Looking at a passing run, the test that fails is Method recovery:

2018-10-21T07:26:36.0214253Z Passed   Method_DeviceReceivesMethodAndResponse_Amqp
2018-10-21T07:26:36.0215015Z Passed   Method_DeviceReceivesMethodAndResponse_AmqpWs
2018-10-21T07:26:36.0215329Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_Amqp
2018-10-21T07:26:37.5739387Z Passed   Method_DeviceReceivesMethodAndResponseWithObseletedSetMethodHandler_AmqpWs
2018-10-21T07:26:37.5739837Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_Amqp
2018-10-21T07:26:39.1250296Z Passed   Method_DeviceReceivesMethodAndResponseWithDefaultMethodHandler_AmqpWs
2018-10-21T07:26:39.1254512Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_Mqtt
2018-10-21T07:26:39.1255653Z Skipped  Method_DeviceReceivesMethodAndResponseRecovery_MqttWs
2018-10-21T07:26:39.1256997Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_Mqtt
2018-10-21T07:26:39.1257440Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_MqttWs
2018-10-21T07:26:39.1258077Z Skipped  Method_DeviceMethodTcpConnRecovery_Amqp
2018-10-21T07:26:42.1018071Z Passed   Method_DeviceMethodTcpConnRecovery_AmqpWs              <------------------------------------ passing (w/o your changes)
2018-10-21T07:26:42.1092850Z Skipped  Method_DeviceMethodAmqpConnLostRecovery_Amqp
2018-10-21T07:26:42.1093693Z Skipped  Method_DeviceMethodAmqpConnLostRecovery_AmqpWs
2018-10-21T07:26:42.1094044Z Skipped  Method_DeviceMethodSessionLostRecovery_Amqp
2018-10-21T07:26:42.1094650Z Skipped  Method_DeviceMethodSessionLostRecovery_AmqpWs
2018-10-21T07:26:42.1094935Z Skipped  Method_DeviceMethodReqLinkDropRecovery_Amqp
2018-10-21T07:26:42.1095201Z Skipped  Method_DeviceMethodReqLinkDropRecovery_AmqpWs
2018-10-21T07:26:42.1095525Z Skipped  Method_DeviceMethodRespLinkDropRecovery_Amqp
2018-10-21T07:26:42.1095856Z Skipped  Method_DeviceMethodRespLinkDropRecovery_AmqpWs
2018-10-21T07:26:42.1096128Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_Amqp
2018-10-21T07:26:43.6170106Z Skipped  Method_DeviceMethodGracefulShutdownRecovery_AmqpWs

@CIPop CIPop closed this Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge PR is not ready to be merged in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants