Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] System.InvalidOperationException at Azure.Messaging.EventHubs.EventProcessorClient+<OnInitializingPartitionAsync>d__60.MoveNext #12466

Closed
petrformanek opened this issue Jun 3, 2020 · 4 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs
Milestone

Comments

@petrformanek
Copy link

Description of the bug
Thread concurrency issue found in logs.

Expected behavior
No exceptions thrown.

Actual behavior (include Exception or Stack Trace)

{"method":"System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported","assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","level":0,"line":0}
{"method":"System.Collections.Generic.Dictionary`2.TryInsert","assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","level":1,"line":0}
{"method":"Azure.Messaging.EventHubs.EventProcessorClient+<OnInitializingPartitionAsync>d__60.MoveNext","assembly":"Azure.Messaging.EventHubs.Processor, Version=5.1.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8","level":2,"line":0}
{"method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","level":3,"line":0}
{"method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","assembly":"System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","level":4,"line":0}
{"method":"Azure.Messaging.EventHubs.Primitives.EventProcessor`1+<TryStartProcessingPartitionAsync>d__75.MoveNext","assembly":"Azure.Messaging.EventHubs, Version=5.1.0.0, Culture=neutral, PublicKeyToken=92742159e12e44c8","level":5,"line":0}

To Reproduce
Hard to reproduce.

Environment:

  • Azure.Messaging.EventHubs.Processor v 5.1.0
  • Linux docker container running in AKS with .NET Core 3.1
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jun 3, 2020
@jsquire jsquire self-assigned this Jun 3, 2020
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. Event Hubs Investigate labels Jun 3, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jun 3, 2020
@jsquire
Copy link
Member

jsquire commented Jun 3, 2020

Hi @petrformanek. We're sorry that you're experiencing difficulties and appreciate you reaching out to let us know. This isn't one that we've seen before, but seems like a race condition at first glance. The good news is that this would trigger retry and recovery paths and should not have an impact on the overall heath of the processor. In the worst case, I would expect to see a small delay in this partition being processed.

I'd definitely like to take a deeper look at this, and would appreciate any additional context or insight that you're able to provide. Some things that may be helpful:

  • Was this something that you saw regular occurrences of, in your logs?
  • How many partitions does the Event Hub being processed have?
  • How many EventProcessorClient instances are active against that Event Hub for the consumer group?
  • Do you have a handler hooked up for the PartitionInitializing event and, if so, would you be willing to share the general structure/responsibility of it?

I'd also welcome any additional insight that you think may be relevant. Again, thank you for taking the time to report this and your help in gathering information for investigation.

@jsquire jsquire added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jun 3, 2020
@petrformanek
Copy link
Author

Hi @jsquire. Thank you for a quick response.

As you wrote the processor health was ok, it was able to process events on other partitions, because they were already initialized and were processing events for some time.
When the processor tried to take ownership of partition 0 then it got stuck with the InvalidOperationException and it was calling my PartitionInitializing handler each 10 seconds for like 30 minutes until the ownership of partition was taken by another processing instance running in AKS. So there was 30 minutes delay in events processing.

Answers to your points:

  • I've seen it for the first time after like 2 months of using v5 SDK
  • In this case Event Hub has 8 partitions. We also have second channel with 32 partitions
  • Minimum are 2 instances which can automatically scale up to 8 instances. In case of second channel to 32 instances
  • Yes, it is very simple, see bellow
public virtual Task PartitionInitializingAsync(PartitionInitializingEventArgs arg)
{
    if (arg.CancellationToken.IsCancellationRequested)
    {
        return Task.CompletedTask;
    }
    arg.DefaultStartingPosition = EventPosition.Latest;
    _logger.LogInformation($"Initializing partition {arg.PartitionId}");

    return Task.CompletedTask;
}

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Jun 4, 2020
@jsquire jsquire removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jun 4, 2020
@jsquire
Copy link
Member

jsquire commented Jun 4, 2020

Thanks, @petrformanek. I appreciate the additional context. We'll start investigating.

jsquire added a commit to jsquire/azure-sdk-for-net that referenced this issue Jun 23, 2020
The focus of these changes is a set of small refactorings and tweaks, mostly
unrelated to any core theme.  Included are:

- Refactoring of the processor blob storage manager, mostly changes to formatting
  to help improve readability.

- Refactoring the initialization of starting position for partitions into a concurrent
  dictionary to resolve a potential race condition (Azure#12466)

- Fixing timing on some track one tests for test-infrastructure.  (Azure#12874)

- Ignoring tests with intermittent failures (Azure#12929, Azure#12930)

- Enhancing log information for when the Event Processor starts processing a
  partition to contain the starting position used.

- Enhancing log information for the core service operations (send/receive) to
  group the start/error/end events together with an operation identifier and
  report the total number of retries used when interacting with the service.
jsquire added a commit that referenced this issue Jun 23, 2020
The focus of these changes is a set of small refactorings and tweaks, mostly
unrelated to any core theme.  Included are:

- Refactoring of the processor blob storage manager, mostly changes to formatting
  to help improve readability.

- Refactoring the initialization of starting position for partitions into a concurrent
  dictionary to resolve a potential race condition (#12466)

- Fixing timing on some track one tests for test-infrastructure.  (#12874)

- Ignoring tests with intermittent failures (#12929, #12930)

- Enhancing log information for when the Event Processor starts processing a
  partition to contain the starting position used.

- Enhancing log information for the core service operations (send/receive) to
  group the start/error/end events together with an operation identifier and
  report the total number of retries used when interacting with the service.
@jsquire
Copy link
Member

jsquire commented Jun 23, 2020

Apologies for the delay and, again, thanks for calling attention to this. We introduced a race condition by not using a concurrent set where we should have for tracking the partitions being processed. This was fixed by #12928 and will be included in our next release.

I'm going to close this out as resolved. Please feel free to reopen if you believe that further discussion is needed.

@jsquire jsquire closed this as completed Jun 23, 2020
@jsquire jsquire added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed Investigate labels Jun 23, 2020
@jsquire jsquire added this to the [2020] July milestone Jun 23, 2020
prmathur-microsoft pushed a commit that referenced this issue Jul 8, 2020
The focus of these changes is a set of small refactorings and tweaks, mostly
unrelated to any core theme.  Included are:

- Refactoring of the processor blob storage manager, mostly changes to formatting
  to help improve readability.

- Refactoring the initialization of starting position for partitions into a concurrent
  dictionary to resolve a potential race condition (#12466)

- Fixing timing on some track one tests for test-infrastructure.  (#12874)

- Ignoring tests with intermittent failures (#12929, #12930)

- Enhancing log information for when the Event Processor starts processing a
  partition to contain the starting position used.

- Enhancing log information for the core service operations (send/receive) to
  group the start/error/end events together with an operation identifier and
  report the total number of retries used when interacting with the service.
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs
Projects
None yet
Development

No branches or pull requests

2 participants