[improve][broker] PIP-307 Expose inflight state waiting time and service channel monitor interval configs. Handle AddEntry failure during topic transfer #21668

heesung-sn · 2023-12-04T20:05:21Z

Fixes: #21654

PIP: #307

Motivation

We want to expose the in-flight state waiting time and service channel monitor interval configs to further control the Extensible Load Balancer behavior. We don't expose these vars in the broker.conf in this PR, as their tuning is rare.
Improve the flakiness of the ExtensibleLoadBalancerImplTest by removing static mock.
Add AddEntry failure handler logic for the PIP-307 to cover the edge-case.

Modifications

Expose the in-flight state waiting time and service channel monitor interval configs in ServiceConfiguration.
Removed static mock in the ExtensibleLoadBalancerImplTest
Added retries in ExtensibleLoadBalancerImplTest test cases.
Add AddEntry failure handler logic for the PIP-307 to cover the edgecase. (added transferring state in AbstracTopic)

Verifying this change

Make sure that the change passes the CI checks.
Add AddEntry failure handler logic for the PIP-307 to cover the edgecase. logic is covered by testTransferClientReconnectionWithoutLookup (this test was flaky because of the AddEntry failure while the ledger is fenced by the unloading)

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#54

… time configs. Handle AddEntry failure during topic transfer

github-actions · 2023-12-04T20:05:56Z

@heesung-sn Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

dragosvictor

Looks good, left some comments.

dragosvictor · 2023-12-04T20:29:23Z

...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java

+        this.stateTombstoneDelayTimeInSeconds = config.getLoadBalancerServiceUnitStateTombstoneDelayTimeInSeconds()
                * 1000;


Nit: this shouldn't be multiplied by 1000 anymore.

oh thanks for this catch. The name should stateTombstoneDelayTimeInMillis

...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java

dragosvictor · 2023-12-04T20:32:14Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

+        if (isClosingOrDeleting
+                && ExtensibleLoadManagerImpl.isLoadManagerExtensionEnabled(getBrokerService().pulsar())) {
+            if (log.isDebugEnabled()) {
+                log.debug("[{}] Failed to persist msg in store: {} while closing or deleting.",
+                        topic, exception.getMessage(), exception);
+            }
+            return;
+        }


Looks correct, but is there anything to be done about the PublishContext callback?

I think the callback is used to signal clients to additionally handle this case on the clients as well. In this case, we don't need to send any to the clients.

dragosvictor · 2023-12-04T20:40:04Z

...test/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImplTest.java

+        Awaitility.await().atMost(5, TimeUnit.SECONDS).untilAsserted(() -> {
+            assertTrue(producer.isConnected());
+            verify(lookup, times(lookupCountBeforeUnload)).getBroker(topicName);
+        });


Can we simplify as suggested below? We aren't expecting any new calls to lookup.getBroker(topicName), we could go further and replace the condition with never().

Suggested change

Awaitility.await().atMost(5, TimeUnit.SECONDS).untilAsserted(() -> {

assertTrue(producer.isConnected());

verify(lookup, times(lookupCountBeforeUnload)).getBroker(topicName);

});

Awaitility.await().atMost(5, TimeUnit.SECONDS).until(producer::isConnected);

verify(lookup, times(lookupCountBeforeUnload)).getBroker(topicName);

dragosvictor · 2023-12-04T20:42:40Z

...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelTest.java

@@ -654,12 +654,12 @@ public void splitAndRetryTest() throws Exception {
        FieldUtils.writeDeclaredField(channel1,
                "inFlightStateWaitingTimeInMillis", 30 * 1000, true);
        FieldUtils.writeDeclaredField(channel1,
-                "semiTerminalStateWaitingTimeInMillis", 300 * 1000, true);
+                "stateTombstoneDelayTimeInSeconds", 300 * 1000, true);


Here and below: are these values still correct, since we switched to a different time unit?

dragosvictor · 2023-12-05T00:25:04Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Producer.java

+                // if the topic is transferring, we don't send error code to the clients.
+                if (producer.getTopic().isTransferring()) {
+                    if (log.isDebugEnabled()) {
+                        log.debug("[{}] Received producer exception: {} while transferring.",
+                                producer.getTopic().getName(), exception.getMessage(), exception);
+                    }
+                    return;
+                }


Would we still need to execute the code inside the lambda below, except the error sending? There's other cleanup operations being performed that can avoid resource leaks.

I agree. Moving this logic into the lambda and continue to call the cleanups. Thank you.

dragosvictor · 2023-12-05T00:26:08Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

@@ -1473,6 +1486,9 @@ public CompletableFuture<Void> close(

        lock.writeLock().lock();
        try {
+            if (!disconnectClients) {
+                transferring = true;


Just confirming that this flag is not meant to ever go back to false. Is that the intent? As it is right now, publishing a message to a topic that was transferred would lead to a TopicClosedException, whereas in the current proposal the exception would be silenced forever.

Yes, transferring will not go back to false.

In the worst case, if transferring is stuck, the leader monitor will send a msg to the source broker to fix the stuck state with handleOwnEvent at the } else if ((data.force() || isTransferCommand(data)) && isTargetBroker(data.sourceBroker())) {. In this case, the topics will be forcefully closed with disconnectClients=true.

I agree, the topics will be forcefully closed as you mentioned, but further messages produced will still be ignored instead of responded to with TopicClosedException.

Demogorgon314 · 2023-12-05T01:31:25Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

+         Instead, we will rely on the service unit state channel's bundle(topic) transfer protocol.
+         At the end of the transfer protocol, at Owned state, the source broker should close the topic properly.
+         */
+        if (transferring) {


Does this change relevant to the flaky test? If not, a separate PR might be better.

yes, this is required. Otherwise,testTransferClientReconnectionWithoutLookup will be flaky.

Demogorgon314 · 2023-12-05T01:33:42Z

pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java

+                    + "by reassigning the ownerships if stuck too long, longer than this period."
+                    + "(only used in load balancer extension logics)"
+    )
+    private long loadBalancerInFlightServiceUnitStateWaitingTimeInMillis = 30 * 1000;


Do we need a PIP to add configuration? /cc @codelipenghui

I don't think we need a PIP for this config, as this is a minor addition (and these need to be tuned rarely). If required, I think we could update the config list in the PIP-192.

Yes, update PIP-192 makes sense. It’s better to share in the mailing list.

I updated the PIP-192 to list the ExtensibleLoadBalancer's Service Configurations.
I shared this info in the mailing list. https://lists.apache.org/thread/zz612q2bhh6rccl04w3jz2mvt0z5kch8

Demogorgon314 · 2023-12-05T01:34:56Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

@@ -157,6 +157,7 @@ public abstract class AbstractTopic implements Topic, TopicPolicyListener<TopicP
    protected final LongAdder msgOutFromRemovedSubscriptions = new LongAdder();
    protected final LongAdder bytesOutFromRemovedSubscriptions = new LongAdder();
    protected volatile Pair<String, List<EntryFilter>> entryFilters;
+    protected volatile boolean transferring = false;


Can we reuse isFenced?

No. I think we should have a different state here. isFenced=true can happen in other cases(e.g. when bk write fails), so we need to define a new state separately.

dragosvictor · 2023-12-05T01:49:29Z

Looks good to me, thanks!

Demogorgon314 · 2023-12-05T02:25:22Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Topic.java

@@ -338,7 +338,7 @@ default boolean isSystemTopic() {

    boolean isPersistent();

-    boolean isFenced();


Should we keep this interface here? It might break some protocol handler compatibility if removed

No. This only introduced for pip-307. Since we dont use it any more, we are deleting it.

I see, thanks for explaining.

[improve][broker] PIP-307 Expose tombstone and inflight state waiting…

34b88b3

… time configs. Handle AddEntry failure during topic transfer

github-actions bot added the doc-label-missing label Dec 4, 2023

heesung-sn changed the title ~~[improve][broker] PIP-307 Expose tombstone and inflight state waiting…~~ [improve][broker] PIP-307 Expose tombstone and inflight state waiting time configs. Handle AddEntry failure during topic transfer Dec 4, 2023

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Dec 4, 2023

dragosvictor reviewed Dec 4, 2023

View reviewed changes

heesung-sn self-assigned this Dec 4, 2023

heesung-sn force-pushed the pip-192-config branch from 37abda8 to 66a50f6 Compare December 4, 2023 21:16

resolved comments

6e168bd

heesung-sn force-pushed the pip-192-config branch from 66a50f6 to 6e168bd Compare December 4, 2023 21:19

added transferring state in topic

140fd2e

dragosvictor reviewed Dec 5, 2023

View reviewed changes

resolved comment

1264543

Demogorgon314 reviewed Dec 5, 2023

View reviewed changes

heesung-sn mentioned this pull request Dec 5, 2023

Flaky-test: ClassCastException in ExtensibleLoadManagerImpl.checkOwnershipAsync #21654

Closed

1 task

dragosvictor approved these changes Dec 5, 2023

View reviewed changes

Demogorgon314 approved these changes Dec 5, 2023

View reviewed changes

heesung-sn added ready-to-test type/flaky-tests labels Dec 5, 2023

heesung-sn added this to the 3.2.0 milestone Dec 5, 2023

heesung-sn closed this Dec 5, 2023

heesung-sn reopened this Dec 5, 2023

heesung-sn requested a review from codelipenghui December 5, 2023 17:40

codelipenghui approved these changes Dec 6, 2023

View reviewed changes

codelipenghui merged commit 93df344 into apache:master Dec 6, 2023
73 of 76 checks passed

nikam14 mentioned this pull request Mar 18, 2024

Flaky-test: ExtensibleLoadManagerImplTest.testSplitBundleAdminAPI #21553

Closed

2 tasks

heesung-sn deleted the pip-192-config branch April 2, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][broker] PIP-307 Expose inflight state waiting time and service channel monitor interval configs. Handle AddEntry failure during topic transfer #21668

[improve][broker] PIP-307 Expose inflight state waiting time and service channel monitor interval configs. Handle AddEntry failure during topic transfer #21668

heesung-sn commented Dec 4, 2023 •

edited

Loading

github-actions bot commented Dec 4, 2023

dragosvictor left a comment

dragosvictor Dec 4, 2023

heesung-sn Dec 4, 2023

dragosvictor Dec 4, 2023

heesung-sn Dec 4, 2023

dragosvictor Dec 4, 2023

heesung-sn Dec 4, 2023

dragosvictor Dec 4, 2023

heesung-sn Dec 4, 2023

dragosvictor Dec 5, 2023

heesung-sn Dec 5, 2023

dragosvictor Dec 5, 2023

heesung-sn Dec 5, 2023 •

edited

Loading

dragosvictor Dec 5, 2023

Demogorgon314 Dec 5, 2023

heesung-sn Dec 5, 2023

Demogorgon314 Dec 5, 2023

heesung-sn Dec 5, 2023

codelipenghui Dec 5, 2023

heesung-sn Dec 5, 2023

Demogorgon314 Dec 5, 2023

heesung-sn Dec 5, 2023

dragosvictor commented Dec 5, 2023

Demogorgon314 Dec 5, 2023

heesung-sn Dec 5, 2023 •

edited

Loading

Demogorgon314 Dec 5, 2023

		this.stateTombstoneDelayTimeInSeconds = config.getLoadBalancerServiceUnitStateTombstoneDelayTimeInSeconds()
		* 1000;

		@@ -338,7 +338,7 @@ default boolean isSystemTopic() {

		boolean isPersistent();

		boolean isFenced();

[improve][broker] PIP-307 Expose inflight state waiting time and service channel monitor interval configs. Handle AddEntry failure during topic transfer #21668

[improve][broker] PIP-307 Expose inflight state waiting time and service channel monitor interval configs. Handle AddEntry failure during topic transfer #21668

Conversation

heesung-sn commented Dec 4, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

github-actions bot commented Dec 4, 2023

dragosvictor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heesung-sn Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dragosvictor commented Dec 5, 2023

Choose a reason for hiding this comment

heesung-sn Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heesung-sn commented Dec 4, 2023 •

edited

Loading

heesung-sn Dec 5, 2023 •

edited

Loading

heesung-sn Dec 5, 2023 •

edited

Loading