fix: race condition problem while update upstream.nodes #11916
+1
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Background
For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through
discovery.nodes
and compare it using thecompare_upstream_node
function. Only if there is a change in the node list,new_nodes
will be set toupstream.nodes
. Then we copyupstream
usingtable.clone
, creating a new table to replace the previous upstream.Another function worth mentioning is
fill_node_info
, which fills some necessary fields inupstream.nodes
.Race Condition
Now let me describe a race condition scenario:
Request A gets
[{"port":80,"weight":50,"host":"10.244.1.33"}]
fromdiscovery.nodes
. After executingfill_node_info
, it becomes{"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}
.then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical
upstream
table as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}]
fromdiscovery.nodes
.Since our current code update
upstream.nodes
before callingtable.clone
, when coroutine switches back to Request A,A's
upstream.nodes
has been modified to[{"port":80,"weight":50,"host":"10.244.1.34"}]
.These nodes without priority field cause code panic:
This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.
Fixes # (issue)
Checklist