-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Etcd jwt refresh error can cause gateway cluster crash. #2899
Comments
what is the error log at APISIX? we can sleep more time if failed to refresh JWT auth. |
Apisix render this error, I think it because of etcd is busy to handle request.
This is very weird, the auth is right, what tigger jwt refresh crazy? |
I read the etcd doc ref: https://etcd.io/docs/v3.4.0/learning/design-auth-v3/
Jwt is stateless and It can auth for every request, This is not a problem. |
I don't think there is a dead loop. If dead loop exists, the request internal won't be tens of ms. It will be much shorter. Can you provide a standalone example to prove your suspect? |
@gy09535 |
Use one user with 3 apisix instances and sometimes can find this issue,I am not sure what tigger it with no obviously error log in log file.I think tigger this is some api called too frequent such as watcher. So change the auth logical can avoid it. I am try to prove it. |
@gy09535 |
4 |
Some interesting results:
Some strange results:
|
Do you find the caller, who is frequently call the request? Any way we should control the auth fail hander , we can not allow request to auth again and again , we must protected the etcd not recevie too many request. Some special condition can cause it ,not every time ,When it happened I restart the gateway the auth is normal, and the etcd request become normal. |
As I said, I can't reproduce it.
You need to provide the backtrack. |
yeah, I am try to find it. |
I try to use bad user and I find too many auth request can cause etcd cpu 100% . Apisx render too many auth error in etcd_config.lua ,ref : https://github.com/apache/apisix/blob/master/apisix/core/config_etcd.lua#L509 etcd cpu 100%etcd error logapisix error log@spacewander You can use my case to reproduce it. |
@gy09535 |
etcd cpu 100% and 30~ requests per second don't look strange to me. They are expected. I have said it yesterday, "the number of requests is bigger than what 4 workers can create (about 2.5x I guess). I can't reproduce it at my side. Need to log the backtrack when the request is fired." 30~ requests are lower than expected peak number (~40) of requests can be created by 4 workers. But the captured packets show that when the issue occurs, the peak number of requests can go ~100 per second. This is what let me feel strange. We already waste too much time on the red herrings. |
Anyway, we can share the etcd client, to reduce the auth requests. Although there may still be |
@spacewander |
I was mislead by the error log screenshots. The two error log are in different time interval. The etcd one is in a second, and the Nginx error log one is in 40 seconds. So actually you have 30 requests in 40 seconds for 1 workers, which is not strange. |
hahaha, ignore it, this code is ok, I will try to find why etcd receive 40 auth fail in a second. |
Anyway, I just submit an ongoing PR: #2932. The PR can reduce the auth request significantly. Haven't finished for some details and tests yet. |
Fore every auth retch, they will do 32 times fetch from this code
I think this "ok" var will always true, if we do auth fail, and it can not sleep 3s.
It will sleep 0.5 s from this code when auth is fail.
they are 16 batch job in on worker, this confused me ,it should be 8.
for one worker, they are 16*2=32 requests (max) in one second when auth is fail, I think it is an problem.I think we should be remove this code , we can receive config is not sync immediately.
And I think we should sleep for more times when fetch is error. |
After I change the sync code ,the etcd auth fail become normal, and the etcd cpu become normal. I think fix this issue should in two directions. After I change the sync code and with error auth ,the etcd render : |
It improves the behavior under apache#2899.
The logs come from two different process, one is worker and the other is privilege agent. You can confirm it via the different pids. I have submitted #2977, which is based on your PR. |
It improves the behavior under apache#2899.
Got it, thanks for your professional responses. |
It improves the behavior under #2899.
Issue description
Today I find some auth error from etcd, I try to connect etcd to find the error, I find so many auth request cause etcd api timeout, this is my etcd logs:
I get package from etcd serve and find this packages:
After I check the apisix code, I find this code can cause dead loop auth, when sometimes jwt refresh is fail.
ref: https://github.com/api7/lua-resty-etcd/blob/master/lib/resty/etcd/v3.lua#L221
Environment
apisix version
):Minimal test code / Steps to reproduce the issue
What's the actual result? (including assertion message & call stack if applicable)
What's the expected result?
The text was updated successfully, but these errors were encountered: