Modified UserFateStore.create to behave like MetaFateStore #4787

dlmarion · 2024-08-05T14:32:39Z

MetaFateStore.create will retry forever when a collision happens when trying to create a Fate transaction. The probabiliy of a collision is low due to a random UUID being used. Before this change the UserFateStore.create method would retry 5 times then throw an exception. This was removed in favor of an unlimited retry so that the behavior of the two Fate stores is the same.

Closes #4246

MetaFateStore.create will retry forever when a collision happens when trying to create a Fate transaction. The probabiliy of a collision is low due to a random UUID being used. Before this change the UserFateStore.create method would retry 5 times then throw an exception. This was removed in favor of an unlimited retry so that the behavior of the two Fate stores is the same. Closes apache#4246

kevinrr888

This is a good change and keeps the two stores consistent. Wasn't sure how this would conflict with the changes in #4524, but looking at this, shouldn't have any problems.

kevinrr888 · 2024-08-05T15:14:46Z

Hmm thinking about this some more. Wondering if it would be better if we had finite number of retries for both stores instead of infinite. The probability of collision is very low.

I'm not sure that both stores are functioning the same under these new changes. The MFS will throw an error if any other KeeperException is seen other than NodeExistsException. The UFS will retry forever if an UNKNOWN status is received. So if there is something wrong and we keep receiving UNKNOWN, it will keep retrying.

dlmarion · 2024-08-05T15:20:39Z

Hmm thinking about this some more. Wondering if it would be better if we had finite number of retries for both stores instead of infinite. The probability of collision is very low.

I'm not sure that both stores are functioning the same under these new changes. The MFS will throw an error if any other KeeperException is seen other than NodeExistsException. The UFS will retry forever if an UNKNOWN status is received. So if there is something wrong and we keep receiving UNKNOWN, it will keep retrying.

So, I first went down the path of creating an IT to determine what error a user would receive when something like TableOperations.create failed when the number of retries in the UserFateStore would be exceeded. When I realized that MetaFateStore tried infinitely, I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

kevinrr888 · 2024-08-05T15:42:53Z

I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

As it works now, it is testing for a duplicate FateId (while not directly looking at the FateId). It will be REJECTED if there is already a TStatus set for the FateId, and a TStatus is always set on creation of the FateId and persists throughout the existence of the FateId. Maybe I'm misunderstanding you?

Maybe there is no issue with both stores retrying infinitely, but what would be the problem with both stores retrying only a finite number of times? This could be a larger than 5, but still finite. The probability of collision would be, for all intents and purposes, impossible. If it fails after these attempts, then there is a bigger problem and not a problem with collisions

dlmarion · 2024-08-05T18:25:14Z

I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

As it works now, it is testing for a duplicate FateId (while not directly looking at the FateId). It will be REJECTED if there is already a TStatus set for the FateId, and a TStatus is always set on creation of the FateId and persists throughout the existence of the FateId. Maybe I'm misunderstanding you?

Maybe there is no issue with both stores retrying infinitely, but what would be the problem with both stores retrying only a finite number of times? This could be a larger than 5, but still finite. The probability of collision would be, for all intents and purposes, impossible. If it fails after these attempts, then there is a bigger problem and not a problem with collisions

I think I mis-understood your comment in relation to the code. You are saying that we should continue to retry on REJECTED, but throw an exception on UNKNOWN?

kevinrr888 · 2024-08-05T18:30:31Z

I'm just wondering if it would be better/safer to keep a finite number of retries and maybe change MFS to be a finite number of retries as well. It should still retry on UNKNOWN or REJECTED for UFS

kevinrr888 · 2024-08-05T18:45:40Z

My comment regarding the retry for UNKNOWN was just that if there is something wrong with the TabletServer or something else that would cause UNKNOWN to be received indefinitely, this current impl will retry forever.

dlmarion · 2024-08-05T19:57:23Z

I think with the changes in 5882212, the two implementations are consistent. They continue forever when there is a dupe, otherwise they throw an exception.

We can have a discussion about whether or not they should try forever and I'm wondering what they did previously. Because I think that's the answer as the calling code is already written to handle that.

Stale review

core/src/main/java/org/apache/accumulo/core/fate/user/UserFateStore.java

kevinrr888 · 2024-08-05T20:26:44Z

Apologies if any of my comments were confusing/misleading.
My only concern is retrying infinitely, not about the logic for retrying. It may be perfectly okay to retry infinitely, I'm just not sure and was throwing that out for discussion

This reverts commit 5882212.

dlmarion · 2024-08-05T20:43:38Z

Apologies if any of my comments were confusing/misleading. My only concern is retrying infinitely, not about the logic for retrying. It may be perfectly okay to retry infinitely, I'm just not sure and was throwing that out for discussion

All good, I'm just trying to make sure things are handled consistently. Based on the old ZooStore implementation, which is now the MetaFateStore class, I think we should retry indefinitely in the UserFateStore. Reason being that the code that calls Fate.startTransaction() doesn't currently expect or handle an exception that would cause them to retry. The code that calls Fate.startTransaction expects success. If we want to limit the number of retries, then we should throw a RetriesExceededException or similar and modify the calling code to either retry or supply appropriate messaging to the user that their call to TableOperations.compact or similar has failed to start.

cshannon

This seems like a good change to me so things are consistent. I think it makes sense to treat both stores the same and keep retrying forever for the normal create() case where we are generating a UUID (vs supplying it).

As already noted, the meta store is pretty much guaranteed to eventually succeed because it just generates new UUIDs on a collision which is very rare anyways. It will still abort and exit if an exception is thrown due to an issue with Zookeeper (such as the client can't get to the server or something).

For the User store, it also generates a random ID in the same way so it should be fine to keep retrying and I don't really see an issue with that. I think we can keep retrying for REJECTED for sure and probably UNKNOWN.

REJECTED of course just means there was a collision so we definitely want to retry with a new UUID.
For the UNKNOWN case that might mean there was just a transient issue with the client getting confirmation after submission so we don't know if it went through so I think we can try again without issue. If it was actually successful the FateCleaner will handle cleanup and age off.
If another type of AccumuloException is thrown (like the client can't connect at all anymore) or security exception is thrown then a runtime exception bubbles up and we will also exit the method (just like with the meta store and a ZK error).

The time when we should limit the retries would be when a FateKey is provided and we are generating the ID off of the FateKey. That still has a retry of 5 and that makes sense so you don't block forever with multiple attempts as that ID will keep trying the same one over and over and won't generate a new one.

@keith-turner - Do see you any issue with this approach?

keith-turner · 2024-08-09T16:55:16Z

@keith-turner - Do see you any issue with this approach?

No, that all sounds good.

keith-turner · 2024-08-09T16:57:15Z

For the UNKNOWN case that might mean there was just a transient issue with the client getting confirmation after submission so we don't know if it went through so I think we can try again without issue. If it was actually successful the FateCleaner will handle cleanup and age off.

More complexity could be added around the unknown case, but its not worth the maintenance burden. I agree its better to just let the FateCleaner deal with it.

cshannon · 2024-08-09T17:18:14Z

More complexity could be added around the unknown case, but its not worth the maintenance burden. I agree its better to just let the FateCleaner deal with it.

Yeah I figured it wouldn't be worth trying to scan and check if it was successful as it should be rare and the FateCleaner would clean it up. To really handle UNKNOWN effectively you'd need something more complex like how Ample combines a retry and rejection handler API.

dlmarion added this to the 4.0.0 milestone Aug 5, 2024

dlmarion requested review from cshannon and keith-turner August 5, 2024 14:32

dlmarion self-assigned this Aug 5, 2024

dlmarion linked an issue Aug 5, 2024 that may be closed by this pull request

Move the test case in AccumuloStoreIT into FateStoreIT #4246

Closed

kevinrr888 previously approved these changes Aug 5, 2024

View reviewed changes

Merge branch 'elasticity' into 4246-fate-create-retry

90023e5

Throw exception on Unknown status return

5882212

kevinrr888 reviewed Aug 5, 2024

View reviewed changes

core/src/main/java/org/apache/accumulo/core/fate/user/UserFateStore.java Outdated Show resolved Hide resolved

Revert "Throw exception on Unknown status return"

99b1ece

This reverts commit 5882212.

cshannon approved these changes Aug 9, 2024

View reviewed changes

keith-turner approved these changes Aug 9, 2024

View reviewed changes

kevinrr888 approved these changes Aug 9, 2024

View reviewed changes

dlmarion merged commit 0d72500 into apache:elasticity Aug 9, 2024
8 checks passed

dlmarion deleted the 4246-fate-create-retry branch August 9, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified UserFateStore.create to behave like MetaFateStore #4787

Modified UserFateStore.create to behave like MetaFateStore #4787

dlmarion commented Aug 5, 2024

kevinrr888 left a comment

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024 •

edited

Loading

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

cshannon left a comment •

edited

Loading

keith-turner commented Aug 9, 2024

keith-turner commented Aug 9, 2024

cshannon commented Aug 9, 2024

Modified UserFateStore.create to behave like MetaFateStore #4787

Modified UserFateStore.create to behave like MetaFateStore #4787

Conversation

dlmarion commented Aug 5, 2024

kevinrr888 left a comment

Choose a reason for hiding this comment

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024 • edited Loading

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

kevinrr888 commented Aug 5, 2024

dlmarion commented Aug 5, 2024

cshannon left a comment • edited Loading

Choose a reason for hiding this comment

keith-turner commented Aug 9, 2024

keith-turner commented Aug 9, 2024

cshannon commented Aug 9, 2024

kevinrr888 commented Aug 5, 2024 •

edited

Loading

cshannon left a comment •

edited

Loading