-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix getting stuck when adding producers #12202
Fix getting stuck when adding producers #12202
Conversation
@merlimat @sijie @eolivelli Could you help to check this? |
Lock producerLock = producer.getAccessMode() == ProducerAccessMode.Shared | ||
? lock.readLock() : lock.writeLock(); | ||
producerLock.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counter in internalAddProducer() is not thread-safe, it will be a problem if you use the read lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply! Let me take the liberty to ask, what does counter
refer to in internalAddProducer
?
@wuzhanpeng Do you have the complete stack information? Not sure if there is dead lock in the metadata cache. Have you tried to restart the broker to see if the problem can be resolved. |
@codelipenghui Sometimes restarting the problem broker can solve it, and sometimes it will cause the same problem on other brokers(when bundles are transferred to other brokers, the pressure of handling producers will also be transferred). As I mentioned in the above description, restarting the broker frequently can easily trigger this problem. Because the complete jstack results of the production environment may contain some sensitive information, I am afraid that the full version cannot be uploaded. 😞 |
I doubt the write lock maybe not the root cause of this issue. When checking So in my opinion, you'd better check the zk read latency or there are something wrong in |
@codelipenghui FYI. The file has been desensitized. |
Thank you for your reminder~ We also checked why the caching strategy of ns policy did not take effect. The actual situation is that when the broker gets into a loop waiting problem, every time the There are many ways to break the deadlock condition in this scenario. However, IMHO reducing the use of locks may be a more thorough solution. After all, if the producers of shared mode accounts for most of the topics, the existence of this lock itself is also reducing the overall performance. In addition, the logic involved in the cache layer is extensive, and avoiding modifying the current cache design may be a more secure solution. |
@wuzhanpeng Would you address the reason of |
As I described in motivation, when the core pool is occupied by all those threads waiting for the lock(identified as thread#1 above), the thread(thread#2) that loads ns policy from zk can only wait in the submission queue until it times out. |
Thanks for your contribution. For this PR, do we need to update docs? (The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks) |
@Anonymitaet Thanks for your reminder. No need to update the documentation. |
@wuzhanpeng After taking a look at the complete stack, looks the issue is related to checking topic policies
After #13082, I think the issue has been fixed, but it only can fix the master branch and 2.10. For version < 2.10, I think we should change |
Removing the |
The pr had no activity for 30 days, mark with Stale label. |
@wuzhanpeng Please add the following content to your PR description and select a checkbox:
|
Closed as stale and conflict. Please rebase and resubmit the patch if it's still relevant. |
Motivation
In our production environment, when the broker receives a large number of
PRODUCER
requests in a short period of time, we have observed that the broker will have a loop waiting problem when handling these requests, which will cause the broker to get stuck in severe cases and cause a huge amonut of other requests to time out.To simplify the description of the problem, we assume that multiple producers initiate
PRODUCER
requests to a topic at the same time. As we can see inAbstractTopic#addProducer
, we can simplify the process so that the process ofaddProducer
in eachPRODUCER
request is broken down into:AbstractTopic#isProducersExceeded
in internal adding producer)It should be noted that these 3 processes are serial.
Assuming that the core size of the thread pool(actually is
ForkJoinPool.commonPool()
) that processes the above threads is only 1, and only one thread can successfully obtain the lock(AbstractTopic#lock
) in the simultaneousPRODUCER
requests, the remaining threads must be queued in the submission queue of the thread pool. Unfortunately, there is a high probability that thread#2 will be put into the queue waiting for scheduling. In this situation, the thread#1 that acquired the lock cannot complete because it needs to wait for the thread#2, and the other threads that have not acquired the lock need to acquire the lock first. This process cannot continue until the thread#2 times out and throws an exception.For jstack result, we can easily see
To make matters worse, in the default configuration, both zk timeout and client operation timeout are 30 seconds. This will cause each retry request to end with a timeout and iteratively.
Modifications
This problem is so hidden that it is very difficult to detect, and even we still have no way to reproduce it in the test environment. However, in our production environment, this problem is more likely to occur if the bundle re-load or broker restart operation is triggered frequently(This phenomenon may be more obvious in our production scenarios. Each of our independent topics may have thousands of producers). Once the problem occurs in the cluster, there will be a lot of operation timeout exceptions.
Below we give a solution to reduce the use of locks, because we think that for the conventional production model, it is sufficient to use read locks when adding producers in
Shared
mode.