Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue #90688

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Oct 5, 2022

There is a race condition in EnrichCoordinatorProxyAction that can result in an item being stuck in its queue even once all threads related to any schedule() calls have completed. The item will be flushed out on the next call to schedule() but there is no guarantee if or when that will happen. This PR adds an additional check for orphaned items in the queue.

Here's what I believe is happening (I can only reproduce it in fewer than 1 in 10,000 tries so I don't have direct evidence):

  • Say thread # 1 calls schedule(), and then coordinateLookups(). It drains the whole queue, and comes through that while loop a second time and there's nothing there so it's about to call remoteRequestPermits.release()
  • But while that happens thread # 2 is in schedule() and calls offer.
  • So now thread # 1 still has the remoteRequestPermits lock and has decided there's nothing in the queue.
  • So now thread # 2 comes into coordinateLookups() and gets false from remoteRequestPermits.tryAcquire() because thread # 1 still has the lock
  • Now thread # 2 exists from coordinateLookups() and thread # 1 exits from coordinateLookups() but there's still 1 thing in the queue
  • So now there's something in the queue that's just going to stick around until someone schedules something else.

(Note that there are actually more threads than just the 2 I mention since coordinateLookups() makes an async call back to itself)

Closes #90598

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Oct 5, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke
Copy link
Member Author

masseyke commented Oct 5, 2022

This PR causes a few more loops in the code, but I don't think it will be a noticeable performance hit -- the additional loops are rare and fast. I ran the test (CoordinatorTests.testAllSearchesExecuted()) 100,000 times, and this branch was hit fewer than 7,000 times. The runtime was no different than it was without the change.

@masseyke masseyke added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Other labels Oct 5, 2022
@masseyke
Copy link
Member Author

masseyke commented Oct 5, 2022

@elasticmachine update branch

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, fixed a small typo is all.

…ich/action/EnrichCoordinatorProxyAction.java

Co-authored-by: James Baiera <[email protected]>
@masseyke masseyke merged commit 120da9b into elastic:main Oct 5, 2022
@masseyke masseyke deleted the fix/EnrichCoordinatorProxyAction-race-condition branch October 5, 2022 20:28
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 10, 2022
* main: (150 commits)
  Remove ToXContent interface from ChunkedToXContent (elastic#90409)
  Remove extra SearchService constructor (elastic#90733)
  Update min version for the diagnosis yaml test (elastic#90731)
  Use the AggTestConfig object in testCase (elastic#90699)
  [DOCS] Add links to clear trained model deployment cache API (elastic#90727)
  Assert wildcards are not expanded as specified by request options  (elastic#90641)
  [TEST] Fix exit snapshot restore exit condition (elastic#90696)
  [TEST] Change to atomic file contents save (elastic#90695)
  Update forbiddenapis to 3.4 (elastic#90624)
  [Tests] Don't use concurrent search in scripted field type tests (elastic#90712)
  [ML] Move scaling is possible check for starting trained model (elastic#90706)
  Add new base test case for chunked xcontent types  (elastic#90707)
  Fix testRedNoBlockedIndicesAndRedAllRoleNodes (elastic#90671)
  Fix nullpointer in docs test setup (elastic#90660)
  Don't produce build logs artifact when in a composite build
  Fixing a race condition in EnrichCoordinatorProxyAction that can leave an item stuck in its queue (elastic#90688)
  docs: update fleet/agent pipeline docs (elastic#90659)
  [HealthAPI] Use plural consistently in resource types (elastic#90682)
  [Testing] Enable bwc and fix sorting for 500_date_range (elastic#90681)
  Add profiling and documentation for dfs phase (elastic#90536)
  ...

# Conflicts:
#	x-pack/plugin/mapper-aggregate-metric/src/test/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapperTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] CoordinatorTests testAllSearchesExecuted failing
5 participants