Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve information on CancellationException for LROs #2236

Merged
merged 22 commits into from
Nov 14, 2023

Conversation

diegomarquezp
Copy link
Contributor

Fixes #1858
When a LRO runs out of retries/time, it throws a CancellationException that is then swallowed by guava's AbstractFuture, meaning there is no simple way to bubble up a custom exception with message.

This approach uses a logger with a message that points to LRO documentation to ensure the user is informed on what they can do about this.

@diegomarquezp diegomarquezp requested a review from a team as a code owner November 10, 2023 20:19
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Nov 10, 2023
@diegomarquezp diegomarquezp marked this pull request as draft November 10, 2023 20:22
@diegomarquezp diegomarquezp added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 13, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 13, 2023
@diegomarquezp diegomarquezp marked this pull request as ready for review November 13, 2023 16:35
@@ -77,6 +87,13 @@ public boolean shouldRetry(TimedAttemptSettings nextAttemptSettings)
if (super.shouldRetry(nextAttemptSettings)) {
return true;
}
if (LOGGER.isLoggable(Level.SEVERE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ, is there a reason SEVERE logging level was picked? It seems like that is meant for more fatal application issues (and less for an RPC timeout), but I also am not sure if there is a better level (Info? Warning?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. My reasoning was more like "exception=severe", but in the context of a timeout on a request I agree something like WARNING is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decreased it to WARNING

@diegomarquezp diegomarquezp requested a review from lqiu96 November 13, 2023 17:02

/**
* Operation timed polling algorithm, which uses exponential backoff factor for determining when the
* next polling operation should be executed. If the polling exceeds the total timeout this
* algorithm cancels polling.
*/
public class OperationTimedPollAlgorithm extends ExponentialRetryAlgorithm {

@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think the logger needs a VisibleForTesting annotation.

@@ -121,6 +124,8 @@ public void testHttpJson_LROSuccessfulResponse_doesNotExceedTotalTimeout() throw
@Test
public void testGRPC_LROUnsuccessfulResponse_exceedsTotalTimeout_throwsDeadlineExceededException()
throws Exception {
FakeLogHandler logHandler = new FakeLogHandler();
Copy link
Contributor

@lqiu96 lqiu96 Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might be better to create field for FakeLogHandler that all tests can use without having to initialize in multiple tests, if possible

setUp() -> Creates it
clean() -> clears all the message so the next test can run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks! I only added a setUp() since creating a new handler wipes out the old handler with its maessages

Comment on lines 171 to 172
FakeLogHandler logHandler = new FakeLogHandler();
OperationTimedPollAlgorithm.LOGGER.addHandler(logHandler);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor

@lqiu96 lqiu96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few nits. Can you also have @burkedavison or @JoeWang1127 take a quick look? Thanks for this @diegomarquezp


/**
* Operation timed polling algorithm, which uses exponential backoff factor for determining when the
* next polling operation should be executed. If the polling exceeds the total timeout this
* algorithm cancels polling.
*/
public class OperationTimedPollAlgorithm extends ExponentialRetryAlgorithm {

public static Logger LOGGER = Logger.getLogger(OperationTimedPollAlgorithm.class.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the logger has to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so it can be accessed by the test in order to add a log handler. @lqiu96, maybe we can fall back to @VisibleForTesting because it is... visible only for testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Let's try the the default package scope, which I believe should still be accessible. If not, then we can add the annotation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -149,6 +161,9 @@ public void testGRPC_LROUnsuccessfulResponse_exceedsTotalTimeout_throwsDeadlineE
assertThrows(CancellationException.class, operationFuture::get);
int attemptCount = operationFuture.getPollingFuture().getAttemptSettings().getAttemptCount();
assertThat(attemptCount).isGreaterThan(1);
assertEquals(1, logHandler.getAllMessages().size());
assertThat(logHandler.getAllMessages().get(0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guarantee that the message is the first element of the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the implementation has only one case where the logger is used. I think the current assertions are a safe heuristic. Maybe I can assert that at least one message contains the entry I'm looking for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can assert that at least one message contains the entry I'm looking for?

I think this maybe better.

@JoeWang1127
Copy link
Collaborator

I don't have a lot of context of this change so I provide some general suggestions.

@@ -39,6 +43,13 @@
*/
public class ITLongRunningOperation {
Copy link
Member

@burkedavison burkedavison Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this already-complex integration test the best place to add these new assertions? Rephrased: Is the showcase server needed to verify the client's logging behavior?

The changes we're making are in com.google.api.gax.longrunning.OperationTimedPollAlgorithm, but we don't yet have any unit tests for this class. Is it possible for us to write a direct unit test that focuses purely on the logging behavior of OperationTimedPollAlgorithm.shouldRetry()?

Without looking at this PR and only the logic we're trying to test, I would have expected to see a set of unit tests that were roughly:

// WARNING: This is all completely pseudocode...
public void testTroubleshootingLinkProvided() {
  LOGGER.addHandler(handler);
  assertThrows(..., algorithm.shouldRetry(NEXT_ATTEMPT_SHOULD_NOT_RETRY));
  assertThat(handler.getLogMessage().exists())
  assertThat(handler.getLogMessage().contains(LRO_TROUBLESHOOTING_LINK));
}
public void testTroubleshootingLinkNotProvided() {
  LOGGER.addHandler(handler);
  assertThat(algorithm.shouldRetry(NEXT_ATTEMPT_SHOULD_RETRY), is(true));
  assertThat(handler.getLogMessage().doesNotExist())
}

Then the integration test can be kept clear of client-side-only implementation details, allowing the person debugging a problem to focus only on the relevant interactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burkedavison thank you. At some point writing the test in this file helped me find that the exception thrown in OperationTimedPollAlgorithm does not bubble up to the client (hence the use of a logger), but maybe that's just part of finding out about the behavior of LROs. I will move the test case to a unit test in gax

Comment on lines 103 to 114
FakeApiClock clock = new FakeApiClock(System.nanoTime());
OperationTimedPollAlgorithm algorithm =
OperationTimedPollAlgorithm.create(FAST_RETRY_SETTINGS, clock);
TimedAttemptSettings settings =
TimedAttemptSettings.newBuilder()
.setGlobalSettings(FAST_RETRY_SETTINGS)
.setRetryDelay(Duration.ofMillis(1l))
.setRpcTimeout(Duration.ofMillis(1l))
.setRandomizedRetryDelay(Duration.ofMillis(1l))
.setAttemptCount(0)
.setFirstAttemptStartTimeNanos(clock.nanoTime())
.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improvement by moving this to a unit test!

At this point, I'm just nit-picking: you could choose to extract all three of these variables to class fields, and further reduce both of these tests to their essential checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

.setAttemptCount(0)
.setFirstAttemptStartTimeNanos(clock.nanoTime())
.build();
clock.incrementNanoTime(1 * 1000 * 1000 * 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a quick comment here to clarify what part of the logic this is triggering to force a no-retry.
For example (if this is accurate):

clock.incrementNanoTime(1 * 1000 * 1000 * 1000); // Force Rpc Timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I added this comment

Comment on lines 60 to 61
private static TimedAttemptSettings timedAttemptSettings;
private static FakeApiClock clock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making these static and initializing them in an instance method (@Before), the test now becomes flaky if multithreading is ever enabled. One test may overwrite the reference during a different test.

Suggested change
private static TimedAttemptSettings timedAttemptSettings;
private static FakeApiClock clock;
private TimedAttemptSettings timedAttemptSettings;
private FakeApiClock clock;

Copy link
Contributor Author

@diegomarquezp diegomarquezp Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch. thanks. I didn't think through the need for static in this test class. I just switched them as instance variables

edit: I didn't notice this comment had a suggestion but I added the same change in another commit

Copy link

[gapic-generator-java-root] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

80.0% 80.0% Coverage
0.0% 0.0% Duplication

Copy link

[java_showcase_integration_tests] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

80.0% 80.0% Coverage
0.0% 0.0% Duplication

@diegomarquezp diegomarquezp merged commit 741e40c into main Nov 14, 2023
35 checks passed
@diegomarquezp diegomarquezp deleted the longrunning-error-message-2 branch November 14, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve LRO's Error Message on Timeout
4 participants