Fix getting stuck when adding producers #12202

wuzhanpeng · 2021-09-26T12:56:47Z

Motivation

In our production environment, when the broker receives a large number of PRODUCER requests in a short period of time, we have observed that the broker will have a loop waiting problem when handling these requests, which will cause the broker to get stuck in severe cases and cause a huge amonut of other requests to time out.

To simplify the description of the problem, we assume that multiple producers initiate PRODUCER requests to a topic at the same time. As we can see in AbstractTopic#addProducer, we can simplify the process so that the process of addProducer in each PRODUCER request is broken down into:

acquire the lock in thread#1
load data from zk in thread#2 with timeout (i.e. AbstractTopic#isProducersExceeded in internal adding producer)
return back to thread#1 to release the lock

It should be noted that these 3 processes are serial.

Assuming that the core size of the thread pool(actually is ForkJoinPool.commonPool()) that processes the above threads is only 1, and only one thread can successfully obtain the lock(AbstractTopic#lock) in the simultaneous PRODUCER requests, the remaining threads must be queued in the submission queue of the thread pool. Unfortunately, there is a high probability that thread#2 will be put into the queue waiting for scheduling. In this situation, the thread#1 that acquired the lock cannot complete because it needs to wait for the thread#2, and the other threads that have not acquired the lock need to acquire the lock first. This process cannot continue until the thread#2 times out and throws an exception.

For jstack result, we can easily see

"ForkJoinPool.commonPool-worker-110" #482 daemon prio=5 os_prio=0 tid=0x00007fd714021000 nid=0x61a3 waiting on condition  [0x00007fd562772000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00000006284caad0> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
        at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1798)
        at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3146)
        at java.util.concurrent.CompletableFuture.timedGet([email protected]/CompletableFuture.java:1868)
        at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:2021)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.broker.service.AbstractTopic.isProducersExceeded(AbstractTopic.java:156)
        at org.apache.pulsar.broker.service.AbstractTopic.internalAddProducer(AbstractTopic.java:629)
        at org.apache.pulsar.broker.service.AbstractTopic.lambda$addProducer$8(AbstractTopic.java:405)
        at org.apache.pulsar.broker.service.AbstractTopic$$Lambda$1433/1422007940.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAcceptNow([email protected]/CompletableFuture.java:753)
        at java.util.concurrent.CompletableFuture.uniAcceptStage([email protected]/CompletableFuture.java:731)
        at java.util.concurrent.CompletableFuture.thenAccept([email protected]/CompletableFuture.java:2108)
        at org.apache.pulsar.broker.service.AbstractTopic.addProducer(AbstractTopic.java:392)
        at org.apache.pulsar.broker.service.persistent.PersistentTopic.addProducer(PersistentTopic.java:540)
        at org.apache.pulsar.broker.service.ServerCnx.lambda$null$22(ServerCnx.java:1233)
        at org.apache.pulsar.broker.service.ServerCnx$$Lambda$1428/932296811.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire([email protected]/CompletableFuture.java:714)
        at java.util.concurrent.CompletableFuture.postComplete([email protected]/CompletableFuture.java:506)
        at java.util.concurrent.CompletableFuture.complete([email protected]/CompletableFuture.java:2073)
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$null$6(BookkeeperSchemaStorage.java:217)
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$$Lambda$1421/1611023719.apply(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniHandle([email protected]/CompletableFuture.java:930)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire([email protected]/CompletableFuture.java:907)
        at java.util.concurrent.CompletableFuture$Completion.exec([email protected]/CompletableFuture.java:479)
        at java.util.concurrent.ForkJoinTask.doExec([email protected]/ForkJoinTask.java:290)
        at java.util.concurrent.ForkJoinPool.runWorker([email protected]/ForkJoinPool.java:1603)
        at java.util.concurrent.ForkJoinWorkerThread.run([email protected]/ForkJoinWorkerThread.java:177)

   Locked ownable synchronizers:
        - <0x0000000624e2e9a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

To make matters worse, in the default configuration, both zk timeout and client operation timeout are 30 seconds. This will cause each retry request to end with a timeout and iteratively.

Modifications

This problem is so hidden that it is very difficult to detect, and even we still have no way to reproduce it in the test environment. However, in our production environment, this problem is more likely to occur if the bundle re-load or broker restart operation is triggered frequently(This phenomenon may be more obvious in our production scenarios. Each of our independent topics may have thousands of producers). Once the problem occurs in the cluster, there will be a lot of operation timeout exceptions.

Below we give a solution to reduce the use of locks, because we think that for the conventional production model, it is sufficient to use read locks when adding producers in Shared mode.

wuzhanpeng · 2021-09-27T01:41:40Z

@merlimat @sijie @eolivelli Could you help to check this?

315157973 · 2021-09-27T12:07:04Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

+                    Lock producerLock = producer.getAccessMode() == ProducerAccessMode.Shared
+                            ? lock.readLock() : lock.writeLock();
+                    producerLock.lock();


The counter in internalAddProducer() is not thread-safe, it will be a problem if you use the read lock

Thanks for your reply! Let me take the liberty to ask, what does counter refer to in internalAddProducer?

codelipenghui · 2021-09-27T13:18:43Z

@wuzhanpeng Do you have the complete stack information? Not sure if there is dead lock in the metadata cache. Have you tried to restart the broker to see if the problem can be resolved.

wuzhanpeng · 2021-09-28T02:57:26Z

@wuzhanpeng Do you have the complete stack information? Not sure if there is dead lock in the metadata cache. Have you tried to restart the broker to see if the problem can be resolved.

@codelipenghui Sometimes restarting the problem broker can solve it, and sometimes it will cause the same problem on other brokers(when bundles are transferred to other brokers, the pressure of handling producers will also be transferred). As I mentioned in the above description, restarting the broker frequently can easily trigger this problem.

Because the complete jstack results of the production environment may contain some sensitive information, I am afraid that the full version cannot be uploaded. 😞

hangc0276 · 2021-09-28T04:36:59Z

I doubt the write lock maybe not the root cause of this issue.

When checking isProducersExceeded, it will load policy data from zk if topic policy not configured. Once the first thread call isProducersExceeded, the policy will be cached, and the following check won't be blocked.

So in my opinion, you'd better check the zk read latency or there are something wrong in ZooKeeperDataCache. You can also the stack about ZooKeeperDataCache.

wuzhanpeng · 2021-09-29T08:21:17Z

broker.jstack.txt

@codelipenghui FYI. The file has been desensitized.

wuzhanpeng · 2021-09-29T08:52:11Z

I doubt the write lock maybe not the root cause of this issue.

When checking isProducersExceeded, it will load policy data from zk if topic policy not configured. Once the first thread call isProducersExceeded, the policy will be cached, and the following check won't be blocked.

So in my opinion, you'd better check the zk read latency or there are something wrong in ZooKeeperDataCache. You can also the stack about ZooKeeperDataCache.

Thank you for your reminder~

We also checked why the caching strategy of ns policy did not take effect. The actual situation is that when the broker gets into a loop waiting problem, every time the ZooKeeperDataCache#get times out, it will invalidate the z-path by the way. In this way, the next time you get the ns strategy, you still have to get data from zk. Therefore, once a problem occurs, it is difficult to cache successfully and then get out of the predicament.

There are many ways to break the deadlock condition in this scenario. However, IMHO reducing the use of locks may be a more thorough solution. After all, if the producers of shared mode accounts for most of the topics, the existence of this lock itself is also reducing the overall performance. In addition, the logic involved in the cache layer is extensive, and avoiding modifying the current cache design may be a more secure solution.

hangc0276 · 2021-09-30T00:50:16Z

every time the ZooKeeperDataCache#get times out

@wuzhanpeng Would you address the reason of every time the ZooKeeperDataCache#get times out?

wuzhanpeng · 2021-09-30T01:48:22Z

every time the ZooKeeperDataCache#get times out

@wuzhanpeng Would you address the reason of every time the ZooKeeperDataCache#get times out?

As I described in motivation, when the core pool is occupied by all those threads waiting for the lock(identified as thread#1 above), the thread(thread#2) that loads ns policy from zk can only wait in the submission queue until it times out.

Anonymitaet · 2021-09-30T07:36:40Z

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

wuzhanpeng · 2021-10-08T01:53:03Z

Thanks for your contribution. For this PR, do we need to update docs?

(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@Anonymitaet Thanks for your reminder. No need to update the documentation.

codelipenghui · 2022-01-18T03:19:27Z

@wuzhanpeng After taking a look at the complete stack, looks the issue is related to checking topic policies

"ForkJoinPool.commonPool-worker-110" #482 daemon prio=5 os_prio=0 tid=0x00007fd714021000 nid=0x61a3 waiting on condition  [0x00007fd562772000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000006284caad0> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
	at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1798)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3146)
	at java.util.concurrent.CompletableFuture.timedGet([email protected]/CompletableFuture.java:1868)
	at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:2021)
	at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
	at org.apache.pulsar.broker.service.AbstractTopic.isProducersExceeded(AbstractTopic.java:156)
	at org.apache.pulsar.broker.service.AbstractTopic.internalAddProducer(AbstractTopic.java:629)
	at org.apache.pulsar.broker.service.AbstractTopic.lambda$addProducer$8(AbstractTopic.java:405)
	at org.apache.pulsar.broker.service.AbstractTopic$$Lambda$1433/1422007940.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniAcceptNow([email protected]/CompletableFuture.java:753)
	at java.util.concurrent.CompletableFuture.uniAcceptStage([email protected]/CompletableFuture.java:731)
	at java.util.concurrent.CompletableFuture.thenAccept([email protected]/CompletableFuture.java:2108)
	at org.apache.pulsar.broker.service.AbstractTopic.addProducer(AbstractTopic.java:392)
	at org.apache.pulsar.broker.service.persistent.PersistentTopic.addProducer(PersistentTopic.java:540)
	at org.apache.pulsar.broker.service.ServerCnx.lambda$null$22(ServerCnx.java:1233)
	at org.apache.pulsar.broker.service.ServerCnx$$Lambda$1428/932296811.accept(Unknown Source)
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire([email protected]/CompletableFuture.java:714)
	at java.util.concurrent.CompletableFuture.postComplete([email protected]/CompletableFuture.java:506)
	at java.util.concurrent.CompletableFuture.complete([email protected]/CompletableFuture.java:2073)
	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$null$6(BookkeeperSchemaStorage.java:217)
	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$$Lambda$1421/1611023719.apply(Unknown Source)
	at java.util.concurrent.CompletableFuture.uniHandle([email protected]/CompletableFuture.java:930)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire([email protected]/CompletableFuture.java:907)
	at java.util.concurrent.CompletableFuture$Completion.exec([email protected]/CompletableFuture.java:479)
	at java.util.concurrent.ForkJoinTask.doExec([email protected]/ForkJoinTask.java:290)
	at java.util.concurrent.ForkJoinPool.runWorker([email protected]/ForkJoinPool.java:1603)
	at java.util.concurrent.ForkJoinWorkerThread.run([email protected]/ForkJoinWorkerThread.java:177)

   Locked ownable synchronizers:
	- <0x0000000624e2e9a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

After #13082, I think the issue has been fixed, but it only can fix the master branch and 2.10. For version < 2.10, I think we should change topicPolicies.getMaxProducersPerTopic() to only check the topic policies in the cache, it looks like getIfPresent. We should fix the branch-2.8 and branch-2.9.

michaeljmarshall · 2022-02-11T19:34:07Z

Removing the release/2.8.3 label since this will miss the release.

github-actions · 2022-03-14T01:58:57Z

The pr had no activity for 30 days, mark with Stale label.

github-actions · 2022-12-09T14:47:51Z

@wuzhanpeng Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

tisonkun · 2022-12-10T08:23:59Z

Closed as stale and conflict. Please rebase and resubmit the patch if it's still relevant.

Fix getting stuck when adding producers

112cf08

hangc0276 requested review from merlimat, jerrypeng, eolivelli, hangc0276, 315157973 and codelipenghui September 27, 2021 03:38

hangc0276 assigned wuzhanpeng Sep 27, 2021

hangc0276 added area/broker release/2.8.2 type/bug The PR fixed a bug or issue reported a bug labels Sep 27, 2021

hangc0276 added this to the 2.9.0 milestone Sep 27, 2021

hangc0276 added the release/note-required label Sep 27, 2021

315157973 reviewed Sep 27, 2021

View reviewed changes

eolivelli modified the milestones: 2.9.0, 2.10.0 Oct 6, 2021

Anonymitaet added the doc-not-needed Your PR changes do not impact docs label Oct 8, 2021

315157973 added release/2.8.3 and removed release/2.8.2 labels Nov 1, 2021

codelipenghui modified the milestones: 2.10.0, 2.11.0 Jan 21, 2022

michaeljmarshall removed the release/2.8.3 label Feb 11, 2022

github-actions bot added the lifecycle/stale label Mar 14, 2022

codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022

tisonkun removed release/note-required lifecycle/stale labels Dec 9, 2022

github-actions bot added doc-label-missing and removed doc-not-needed Your PR changes do not impact docs labels Dec 9, 2022

tisonkun closed this Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix getting stuck when adding producers #12202

Fix getting stuck when adding producers #12202

wuzhanpeng commented Sep 26, 2021

wuzhanpeng commented Sep 27, 2021

315157973 Sep 27, 2021

wuzhanpeng Sep 28, 2021

codelipenghui commented Sep 27, 2021 •

edited

Loading

wuzhanpeng commented Sep 28, 2021

hangc0276 commented Sep 28, 2021 •

edited

Loading

wuzhanpeng commented Sep 29, 2021

wuzhanpeng commented Sep 29, 2021 •

edited

Loading

hangc0276 commented Sep 30, 2021

wuzhanpeng commented Sep 30, 2021

Anonymitaet commented Sep 30, 2021

wuzhanpeng commented Oct 8, 2021

codelipenghui commented Jan 18, 2022

michaeljmarshall commented Feb 11, 2022

github-actions bot commented Mar 14, 2022

github-actions bot commented Dec 9, 2022

tisonkun commented Dec 10, 2022

Fix getting stuck when adding producers #12202

Fix getting stuck when adding producers #12202

Conversation

wuzhanpeng commented Sep 26, 2021

Motivation

Modifications

wuzhanpeng commented Sep 27, 2021

315157973 Sep 27, 2021

Choose a reason for hiding this comment

wuzhanpeng Sep 28, 2021

Choose a reason for hiding this comment

codelipenghui commented Sep 27, 2021 • edited Loading

wuzhanpeng commented Sep 28, 2021

hangc0276 commented Sep 28, 2021 • edited Loading

wuzhanpeng commented Sep 29, 2021

wuzhanpeng commented Sep 29, 2021 • edited Loading

hangc0276 commented Sep 30, 2021

wuzhanpeng commented Sep 30, 2021

Anonymitaet commented Sep 30, 2021

wuzhanpeng commented Oct 8, 2021

codelipenghui commented Jan 18, 2022

michaeljmarshall commented Feb 11, 2022

github-actions bot commented Mar 14, 2022

github-actions bot commented Dec 9, 2022

tisonkun commented Dec 10, 2022

codelipenghui commented Sep 27, 2021 •

edited

Loading

hangc0276 commented Sep 28, 2021 •

edited

Loading

wuzhanpeng commented Sep 29, 2021 •

edited

Loading