-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775
Comments
@alptugay Which version of lua-resty-healthcheck do you use finally with error? Do the upstream servers change status frequently at that time? i.e. restart pods or shutdown or new upstreams added, or network transient errors? I suggest using the lua-resty-healthcheck 2.2.3 and increasing the size of And what's the configuration item |
Maybe it's same to #9015 |
Hi @leslie-tsang, Their team is under monitoring the status. They will update when necessary. |
Due to lack of the reporter's response this issue has been labeled with "no response". It will be close in 3 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
We have encountered the issue once again. Our current findings are:
APISIX INSTANCE 1
APISIX INSTANCE 2
APISIX INSTANCE 3
APISIX INSTANCE 4
For example:
Upstream: 58e0a3b27292c0d9267ce78a7dfa45ab
Upstream: f51cbfa1fac84b2110ba93f947c93313
XXX.XXX.XXX.29:32337 is one of the upstreams of 58e0a3b27292c0d9267ce78a7dfa45ab
You can ignore
I think the SSL errors are related to our finding number 3
This is a normal log (after we kill the affected workers) for the related service, it is normal that all upstreams are 502 because of our finding number 3
But this is an unusual log, and the logs related to this service during the problem is like the log below:
Please look at the upstream, upstream_status, upstream_connect_time, upstream_response_time. They are like incomplete. They all and with colon character. |
Hello @alptugay, The issue is under fixing, check api7/lua-resty-healthcheck@f1d81af for more information. |
@alptugay have you try this fix ? api7/lua-resty-healthcheck@f1d81af |
@alptugay you can try with apisix version 3.4>= where the lua-resty-healthcheck library is upgraded |
Fixed by api7/lua-resty-healthcheck@f1d81af |
Current Behavior
Our Apisix instance has more than 2500 route and upstream all of which has active healthcheck enabled. Sometimes (once/twice a week) we see that one or more workers use %100 of CPU. At the same time we see the following error log:
We have encountered this situation on multiple instances and multiple times:
At the same exact time we can see the CPU core running this worker starts to use %100 CPU
That Cpu Core also sees an increase in time spent
Sockstat usage increases as well:
After killing the worker process it returns to normal (BTW we cant kill the process gracefully, we need to use "kill -9")
We are normally using Apisix 3.2.1 but to solve the issue we cherry picked "v3.0.0 of lua-resty-healthcheck", because the lock mechanisim seems to have changed but that caused massive memory leak so we reverted.
All of our upstreams have the following timeout and healthcheck values:
There are some other abnormalities as well, I don't know if it is related or not so I'll briefly share those as well:
We see lots of upstream time out errors, however the upstreams are healthy and running. These connections seem to be related to a watch API of K8s (not sure)
We have TEngine running on parallel with Apisix, both have healthchecks enabled. But connection states are very different, for example on TEngine we see less connection in TimeWait and more inUse:
However in Apisix we see more in Timewait and less inUse:
When we disable healthcheck, it returns to the similar state with TEngine.
Our config.yml
Expected Behavior
Cpu should not be consumed %100
Error Logs
2023/07/04 03:15:53 [error] 161670#161670: 27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:...:8500': unlocked, context: ngx.timer, client: ...*, server: 0.0.0.0:80
Steps to Reproduce
Seems to be random. But 2500+ route/service and upstream. All with active healthchecks enabled
Environment
nginx version: openresty/1.21.4.1
built with OpenSSL 1.1.1s 1 Nov 2022
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.21.4.1 -DNGX_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_HTTP_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.33 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.09 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.9 --add-module=../ngx_stream_lua-0.0.11 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../mod_dubbo-1.0.2 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../ngx_multi_upstream_module-1.1.1 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/stream --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/meta --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../wasm-nginx-module-0.6.4 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../lua-var-nginx-module-v0.5.3 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../grpc-client-nginx-module-v0.4.2 --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module
curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):The text was updated successfully, but these errors were encountered: