Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified UserFateStore.create to behave like MetaFateStore #4787

Merged
merged 4 commits into from
Aug 9, 2024

Conversation

dlmarion
Copy link
Contributor

@dlmarion dlmarion commented Aug 5, 2024

MetaFateStore.create will retry forever when a collision happens when trying to create a Fate transaction. The probabiliy of a collision is low due to a random UUID being used. Before this change the UserFateStore.create method would retry 5 times then throw an exception. This was removed in favor of an unlimited retry so that the behavior of the two Fate stores is the same.

Closes #4246

MetaFateStore.create will retry forever when a collision happens
when trying to create a Fate transaction. The probabiliy of a
collision is low due to a random UUID being used. Before this change
the UserFateStore.create method would retry 5 times then throw an
exception. This was removed in favor of an unlimited retry so that
the behavior of the two Fate stores is the same.

Closes apache#4246
@dlmarion dlmarion added this to the 4.0.0 milestone Aug 5, 2024
@dlmarion dlmarion self-assigned this Aug 5, 2024
@dlmarion dlmarion linked an issue Aug 5, 2024 that may be closed by this pull request
kevinrr888
kevinrr888 previously approved these changes Aug 5, 2024
Copy link
Member

@kevinrr888 kevinrr888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change and keeps the two stores consistent. Wasn't sure how this would conflict with the changes in #4524, but looking at this, shouldn't have any problems.

@kevinrr888
Copy link
Member

Hmm thinking about this some more. Wondering if it would be better if we had finite number of retries for both stores instead of infinite. The probability of collision is very low.

I'm not sure that both stores are functioning the same under these new changes. The MFS will throw an error if any other KeeperException is seen other than NodeExistsException. The UFS will retry forever if an UNKNOWN status is received. So if there is something wrong and we keep receiving UNKNOWN, it will keep retrying.

@dlmarion
Copy link
Contributor Author

dlmarion commented Aug 5, 2024

Hmm thinking about this some more. Wondering if it would be better if we had finite number of retries for both stores instead of infinite. The probability of collision is very low.

I'm not sure that both stores are functioning the same under these new changes. The MFS will throw an error if any other KeeperException is seen other than NodeExistsException. The UFS will retry forever if an UNKNOWN status is received. So if there is something wrong and we keep receiving UNKNOWN, it will keep retrying.

So, I first went down the path of creating an IT to determine what error a user would receive when something like TableOperations.create failed when the number of retries in the UserFateStore would be exceeded. When I realized that MetaFateStore tried infinitely, I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

@kevinrr888
Copy link
Member

kevinrr888 commented Aug 5, 2024

I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

As it works now, it is testing for a duplicate FateId (while not directly looking at the FateId). It will be REJECTED if there is already a TStatus set for the FateId, and a TStatus is always set on creation of the FateId and persists throughout the existence of the FateId. Maybe I'm misunderstanding you?

Maybe there is no issue with both stores retrying infinitely, but what would be the problem with both stores retrying only a finite number of times? This could be a larger than 5, but still finite. The probability of collision would be, for all intents and purposes, impossible. If it fails after these attempts, then there is a bigger problem and not a problem with collisions

@dlmarion
Copy link
Contributor Author

dlmarion commented Aug 5, 2024

I looked at how we could test for a duplicate FateId in the ConditionalMutationWriter. I didn't see an easy way to do that since the FateId is in the row. If we could do that, then we could set a status for duplicate FateId, and continue to retry, and then fail on REJECTED or UNKNOWN.

As it works now, it is testing for a duplicate FateId (while not directly looking at the FateId). It will be REJECTED if there is already a TStatus set for the FateId, and a TStatus is always set on creation of the FateId and persists throughout the existence of the FateId. Maybe I'm misunderstanding you?

Maybe there is no issue with both stores retrying infinitely, but what would be the problem with both stores retrying only a finite number of times? This could be a larger than 5, but still finite. The probability of collision would be, for all intents and purposes, impossible. If it fails after these attempts, then there is a bigger problem and not a problem with collisions

I think I mis-understood your comment in relation to the code. You are saying that we should continue to retry on REJECTED, but throw an exception on UNKNOWN?

@kevinrr888
Copy link
Member

I'm just wondering if it would be better/safer to keep a finite number of retries and maybe change MFS to be a finite number of retries as well. It should still retry on UNKNOWN or REJECTED for UFS

@kevinrr888
Copy link
Member

My comment regarding the retry for UNKNOWN was just that if there is something wrong with the TabletServer or something else that would cause UNKNOWN to be received indefinitely, this current impl will retry forever.

@dlmarion
Copy link
Contributor Author

dlmarion commented Aug 5, 2024

I think with the changes in 5882212, the two implementations are consistent. They continue forever when there is a dupe, otherwise they throw an exception.

We can have a discussion about whether or not they should try forever and I'm wondering what they did previously. Because I think that's the answer as the calling code is already written to handle that.

@kevinrr888 kevinrr888 dismissed their stale review August 5, 2024 20:14

Stale review

@kevinrr888
Copy link
Member

Apologies if any of my comments were confusing/misleading.
My only concern is retrying infinitely, not about the logic for retrying. It may be perfectly okay to retry infinitely, I'm just not sure and was throwing that out for discussion

@dlmarion
Copy link
Contributor Author

dlmarion commented Aug 5, 2024

Apologies if any of my comments were confusing/misleading. My only concern is retrying infinitely, not about the logic for retrying. It may be perfectly okay to retry infinitely, I'm just not sure and was throwing that out for discussion

All good, I'm just trying to make sure things are handled consistently. Based on the old ZooStore implementation, which is now the MetaFateStore class, I think we should retry indefinitely in the UserFateStore. Reason being that the code that calls Fate.startTransaction() doesn't currently expect or handle an exception that would cause them to retry. The code that calls Fate.startTransaction expects success. If we want to limit the number of retries, then we should throw a RetriesExceededException or similar and modify the calling code to either retry or supply appropriate messaging to the user that their call to TableOperations.compact or similar has failed to start.

Copy link
Contributor

@cshannon cshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good change to me so things are consistent. I think it makes sense to treat both stores the same and keep retrying forever for the normal create() case where we are generating a UUID (vs supplying it).

As already noted, the meta store is pretty much guaranteed to eventually succeed because it just generates new UUIDs on a collision which is very rare anyways. It will still abort and exit if an exception is thrown due to an issue with Zookeeper (such as the client can't get to the server or something).

For the User store, it also generates a random ID in the same way so it should be fine to keep retrying and I don't really see an issue with that. I think we can keep retrying for REJECTED for sure and probably UNKNOWN.

  1. REJECTED of course just means there was a collision so we definitely want to retry with a new UUID.
  2. For the UNKNOWN case that might mean there was just a transient issue with the client getting confirmation after submission so we don't know if it went through so I think we can try again without issue. If it was actually successful the FateCleaner will handle cleanup and age off.
  3. If another type of AccumuloException is thrown (like the client can't connect at all anymore) or security exception is thrown then a runtime exception bubbles up and we will also exit the method (just like with the meta store and a ZK error).

The time when we should limit the retries would be when a FateKey is provided and we are generating the ID off of the FateKey. That still has a retry of 5 and that makes sense so you don't block forever with multiple attempts as that ID will keep trying the same one over and over and won't generate a new one.

@keith-turner - Do see you any issue with this approach?

@keith-turner
Copy link
Contributor

@keith-turner - Do see you any issue with this approach?

No, that all sounds good.

@keith-turner
Copy link
Contributor

For the UNKNOWN case that might mean there was just a transient issue with the client getting confirmation after submission so we don't know if it went through so I think we can try again without issue. If it was actually successful the FateCleaner will handle cleanup and age off.

More complexity could be added around the unknown case, but its not worth the maintenance burden. I agree its better to just let the FateCleaner deal with it.

@cshannon
Copy link
Contributor

cshannon commented Aug 9, 2024

More complexity could be added around the unknown case, but its not worth the maintenance burden. I agree its better to just let the FateCleaner deal with it.

Yeah I figured it wouldn't be worth trying to scan and check if it was successful as it should be rare and the FateCleaner would clean it up. To really handle UNKNOWN effectively you'd need something more complex like how Ample combines a retry and rejection handler API.

@dlmarion dlmarion merged commit 0d72500 into apache:elasticity Aug 9, 2024
8 checks passed
@dlmarion dlmarion deleted the 4246-fate-create-retry branch August 9, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Move the test case in AccumuloStoreIT into FateStoreIT
4 participants