fix: race condition problem while update upstream.nodes #11916

nic-6443 · 2025-01-15T08:45:08Z

Description

Background

For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through discovery.nodes and compare it using the compare_upstream_node function. Only if there is a change in the node list, new_nodes will be set to upstream.nodes. Then we copy upstream using table.clone, creating a new table to replace the previous upstream.
Another function worth mentioning is fill_node_info, which fills some necessary fields in upstream.nodes.

Race Condition

Now let me describe a race condition scenario:
Request A gets [{"port":80,"weight":50,"host":"10.244.1.33"}] from discovery.nodes. After executing fill_node_info, it becomes {"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}.
then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical upstream table as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}] from discovery.nodes.
Since our current code update upstream.nodes before calling table.clone, when coroutine switches back to Request A,
A's upstream.nodes has been modified to [{"port":80,"weight":50,"host":"10.244.1.34"}].
These nodes without priority field cause code panic:

2024/09/04 14:39:59 [error] 47#47: *739315077 lua entry thread aborted: runtime error: /usr/local/apisix/apisix/balancer.lua:55: table index is nil
stack traceback:
coroutine 0:
        /usr/local/apisix/apisix/balancer.lua: in function 'transform_node'
        /usr/local/apisix/apisix/balancer.lua:80: in function 'fetch_health_nodes'
        /usr/local/apisix/apisix/balancer.lua:115: in function 'create_obj_fun'
        /usr/local/apisix/apisix/core/lrucache.lua:95: in function 'lrucache_server_picker'
        /usr/local/apisix/apisix/balancer.lua:247: in function 'pick_server'
        /usr/local/apisix/apisix/init.lua:498: in function 'handle_upstream'
        /usr/local/apisix/apisix/init.lua:680: in function 'http_access_phase'

This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.

Fixes # (issue)

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Nic <[email protected]>

fix: race condition problem while update upstream.nodes

125f8a1

Signed-off-by: Nic <[email protected]>

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Jan 15, 2025

nic-6443 marked this pull request as draft January 15, 2025 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: race condition problem while update upstream.nodes #11916

fix: race condition problem while update upstream.nodes #11916

nic-6443 commented Jan 15, 2025 •

edited

Loading

fix: race condition problem while update upstream.nodes #11916

Are you sure you want to change the base?

fix: race condition problem while update upstream.nodes #11916

Conversation

nic-6443 commented Jan 15, 2025 • edited Loading

Description

Background

Race Condition

Checklist

nic-6443 commented Jan 15, 2025 •

edited

Loading