Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: race condition problem while update upstream.nodes #11916

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nic-6443
Copy link
Member

@nic-6443 nic-6443 commented Jan 15, 2025

Description

Background

For a route configured with an upstream using service discovery, when a request hits the route, it will fetch the latest nodes of the current upstream through discovery.nodes and compare it using the compare_upstream_node function. Only if there is a change in the node list, new_nodes will be set to upstream.nodes. Then we copy upstream using table.clone, creating a new table to replace the previous upstream.
Another function worth mentioning is fill_node_info, which fills some necessary fields in upstream.nodes.

Race Condition

Now let me describe a race condition scenario:
Request A gets [{"port":80,"weight":50,"host":"10.244.1.33"}] from  discovery.nodes. After executing  fill_node_info, it becomes  {"port":80,"weight":50,"host":"10.244.1.33", "priority": 0}.
then due to some function calls triggering coroutine yield (as tested, pcall triggers yield).
At this point, Request B hits the same route and gets an identical  upstream table as Request A but get new nodes:[{"port":80,"weight":50,"host":"10.244.1.34"}] from  discovery.nodes.
Since our current code update upstream.nodes   before calling table.clone, when coroutine switches back to Request A,
A's upstream.nodes has been modified to [{"port":80,"weight":50,"host":"10.244.1.34"}].
These nodes without priority field cause code panic:

2024/09/04 14:39:59 [error] 47#47: *739315077 lua entry thread aborted: runtime error: /usr/local/apisix/apisix/balancer.lua:55: table index is nil
stack traceback:
coroutine 0:
        /usr/local/apisix/apisix/balancer.lua: in function 'transform_node'
        /usr/local/apisix/apisix/balancer.lua:80: in function 'fetch_health_nodes'
        /usr/local/apisix/apisix/balancer.lua:115: in function 'create_obj_fun'
        /usr/local/apisix/apisix/core/lrucache.lua:95: in function 'lrucache_server_picker'
        /usr/local/apisix/apisix/balancer.lua:247: in function 'pick_server'
        /usr/local/apisix/apisix/init.lua:498: in function 'handle_upstream'
        /usr/local/apisix/apisix/init.lua:680: in function 'http_access_phase'

This issue can only occur when the nodes obtained through service discovery are continuously changing (e.g., during a rolling update of a Kubernetes deployment) and the gateway is receiving high concurrent requests. So I am unable to provide a valid test case.

Fixes # (issue)

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Jan 15, 2025
@nic-6443 nic-6443 marked this pull request as draft January 15, 2025 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant