-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spring_security_filterchains "after" longTaskTimer run forever and count increase constantly after external cancel() or random internal cancel() #2935
Comments
@jtorkkel As it is suggested here spring-projects/spring-framework#31365 (comment) |
Thanks, I think there is two separate issue.
The random cancel is root cause in my opinion, but not clear where is the issue now. External cancel can and should not be avoided. Both need proper cancel() handling in observation but unnecessary random cancel() should be avoided as it will result inconsistent metrics too. I can create spring security issue for 2. but not clear where is 1. I already had case for spring framework regarding 1. and it was closed after fixing 2 http_server_requests related fixes. |
For the moment I don't see how Reactor Netty is involved in this issue |
@jtorkkel As this is related to metrics in Spring Security, I'm closing this issue. Please open an issue in Spring Security project. If you think Reactor Netty issues random |
About |
I have not been able to identify where the extra random cancel() comes from. Reactor debugging is out of my skip set. I know it happen frequently impacting metrics/observations. I can only reopen specific cancel() issue to spring framework or reactor netty as it is so easy to reproduce and maybe you could trace what cause it. |
yes sure |
We have done some more analysis and we suspect that reactor netty ChannelOperations terminate() sometimes trigger extra cancel() This exception indicates that cancel probably comes from ChannelOperations. Normal cancel() normally comes +5ms after onComplete(), which might explain why happen more frequently in Windows
Should I post new issue or ... |
I added nettyChannelOperation=TRACE logs and now when cancel is happening I can see extra "eaception" logging full logs having 5 cancel |
Yes please do so |
Current situation:
We are using SpringBoot 3.1 in preproduction and we have noticed that new longTaskTimermetrics "spring_security_filterchains_active_seconds_active_count{spring_security_reached_filter_name="none",spring_security_reached_filter_section="after"}" keep increasing constantly.
In 24h tests where we made 50 Million requests to service using ~20 parallel threads using connection pools, "spring_security_filterchains_active_seconds_active_count increases to value 8000, after 10d the value was still 8000 (server was running, no more requests).
Same issue happen when test client like Jmeter is forced to "stop" requests before getting response. When repeated > 10000 times the spring_security_filterchains_active_seconds_active_count increase to over 10000 indicating no max amount. Each longTaskTimeris stored to umeter registry, so thhis is considered as memory leak and eventually service will OOM or slow down.
When connection pool renew the socket, it close the socket gracefully after using it and open new socket.
Our analysis indicates that
stackTrace does not clearly indicate when the cancel comes from, it originates from
but also success case originates from same "terminate"
In all these cases the cancel() cause the following issues
Issue status:
I am not sure where I should raise this case, and it might be still multiple different case.
Desired situation:
spring_security_filterchains_active_seconds_active_count if used in production might request longTaskTimercount to increase to very high values, probably to millions.
cancel() is not impacting to business logic and actual data streams, in testing no errors seen in HTTP responses. It seems only to impact "observations".
Environment:
Repro steps:
Example application included. example.zip.
It also contains
Make few thousands calls to endpoint "GET /monoCancel" with minimum 10 threads. If no issue use 50 threads. Issue is not reproducing if less threads are used.
Also seems that in Windows this is much more frequent than in linux, in Windows even 5-10% of the calls might result cancel().
If too many console entries without cancel(), use "GET /monoCancelNoLOg", then you see stackTrace when cancel.
set also properties
it disables logging from webfilter, and extra onStart/onStop logging
There are also few other API
Use mock server to external server, adding delay allow cancel while waiting for response.
TEST 1:
Expected request with onComplete ONLY when using external webclient
TEST 2:
With additional Cancel(), can be also
TEST 3: I cannot repro with example application, but easily repro with real service.
with cancel() only
TEST 4: external Cancel(), call canceled during slow external webclient call
TEST 5: active count reach 135.0
calling GET /monoCancelNoLog
Send large number of parallel requests, for example after 5000 requests, 10 threads, 500 requests each, sent in 3s.
Logs shows 136 exception due cancel()
metrics shown
Extra 26 are due (#31365), fixed #31417 (OnComplete()+Cancel()), and related 26 status="UNKNOWN fixed in #31388, rest 116 are just cancel() without onComplete
these 136 cancel() will in addition to already fixed http_server_requests issue result
Detailed metrics.
custom_channel_invocations_total{source="None",uri="/monoCancelNoLog",} 5000.0
reactor_netty_http_server_data_sent_bytes_count{uri="/",} 5000.0
reactor_netty_http_server_data_received_time_seconds_count{method="GET",uri="/",} 5000.0
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/monoCancelNoLog",} 4836.0
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/monoCancelNoLog",} 190.0
spring_security_filterchains_seconds_count{error="none",spring_security_filterchain_position="13",spring_security_filterchain_size="13",spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter",spring_security_reached_filter_section="after",} 4853.0
spring_security_filterchains_active_seconds_active_count{spring_security_filterchain_position="0",spring_security_filterchain_size="0",spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",} 135.0
spring_security_filterchains_active_seconds_duration_sum{spring_security_filterchain_position="0",spring_security_filterchain_size="0",spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",} 7701.453904792
TEST 6: active count 6.0
calling GET /monoCancel
1000 request, 10 threads, 100 each
2023-10-16T14:33:24.842+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.1 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:24.846+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.1 : | request(unbounded)
2023-10-16T14:33:24.847+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.1 : | onNext(test)
2023-10-16T14:33:24.855+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.1 : | onComplete()
2023-10-16T14:33:25.058+03:00 INFO 6956 --- [ parallel-11] reactor.Mono.Just.262 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:25.058+03:00 INFO 6956 --- [ parallel-11] reactor.Mono.Just.262 : | request(unbounded)
2023-10-16T14:33:25.058+03:00 INFO 6956 --- [ parallel-11] reactor.Mono.Just.262 : | onNext(test)
2023-10-16T14:33:25.058+03:00 INFO 6956 --- [ parallel-11] reactor.Mono.Just.262 : | onComplete()
2023-10-16T14:33:25.064+03:00 INFO 6956 --- [ctor-http-nio-6] reactor.Mono.Just.262 : | cancel()
2023-10-16T14:33:25.110+03:00 INFO 6956 --- [ parallel-8] reactor.Mono.Just.352 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:25.110+03:00 INFO 6956 --- [ parallel-8] reactor.Mono.Just.352 : | request(unbounded)
2023-10-16T14:33:25.110+03:00 INFO 6956 --- [ parallel-8] reactor.Mono.Just.352 : | onNext(test)
2023-10-16T14:33:25.110+03:00 INFO 6956 --- [ parallel-8] reactor.Mono.Just.352 : | onComplete()
2023-10-16T14:33:25.115+03:00 INFO 6956 --- [tor-http-nio-10] reactor.Mono.Just.352 : | cancel()
2023-10-16T14:33:25.138+03:00 INFO 6956 --- [ parallel-9] reactor.Mono.Just.416 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:25.138+03:00 INFO 6956 --- [ parallel-9] reactor.Mono.Just.416 : | request(unbounded)
2023-10-16T14:33:25.138+03:00 INFO 6956 --- [ parallel-9] reactor.Mono.Just.416 : | onNext(test)
2023-10-16T14:33:25.138+03:00 INFO 6956 --- [ parallel-9] reactor.Mono.Just.416 : | onComplete()
2023-10-16T14:33:25.142+03:00 INFO 6956 --- [ctor-http-nio-4] reactor.Mono.Just.416 : | cancel()
2023-10-16T14:33:25.202+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.574 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:25.203+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.574 : | request(unbounded)
2023-10-16T14:33:25.203+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.574 : | onNext(test)
2023-10-16T14:33:25.203+03:00 INFO 6956 --- [ parallel-5] reactor.Mono.Just.574 : | onComplete()
2023-10-16T14:33:25.205+03:00 INFO 6956 --- [ctor-http-nio-4] reactor.Mono.Just.574 : | cancel()
2023-10-16T14:33:25.250+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.695 : | onSubscribe([Synchronous Fuseable] Operators.ScalarSubscription)
2023-10-16T14:33:25.251+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.695 : | request(unbounded)
2023-10-16T14:33:25.251+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.695 : | onNext(test)
2023-10-16T14:33:25.251+03:00 INFO 6956 --- [ parallel-1] reactor.Mono.Just.695 : | onComplete()
2023-10-16T14:33:25.260+03:00 INFO 6956 --- [tor-http-nio-12] reactor.Mono.Just.695 : | cancel()
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="/monoCancel",} 989.0
http_server_requests_seconds_count{error="none",exception="none",method="GET",outcome="UNKNOWN",status="UNKNOWN",uri="/monoCancel",} 18.0
reactor_netty_http_server_response_time_seconds_count{method="GET",status="200",uri="/",} 1000.0
reactor_netty_http_server_data_sent_time_seconds_count{method="GET",status="200",uri="/",} 1000.0
custom_channel_invocations_total{source="None",uri="/monoCancel",} 1000.0
spring_security_filterchains_seconds_count{error="none",spring_security_filterchain_position="13",spring_security_filterchain_size="13",spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter",spring_security_reached_filter_section="after",} 993.0
spring_security_filterchains_active_seconds_active_count{spring_security_filterchain_position="0",spring_security_filterchain_size="0",spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",} 6.0
spring_security_filterchains_active_seconds_duration_sum{spring_security_filterchain_position="0",spring_security_filterchain_size="0",spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",} 2885.6595331
The text was updated successfully, but these errors were encountered: