Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fate reservations moved out of memory #4524

Merged
merged 24 commits into from
Sep 19, 2024

Conversation

kevinrr888
Copy link
Member

@kevinrr888 kevinrr888 commented May 3, 2024

closes #4131
Fate reservations were moved out of memory. This is a WIP PR* for one part that is needed for having Fate be distributed in the future.

  • Reservations for MetaFateStore were moved out of memory into ZooKeeper
  • Reservations for UserFateStore were moved out of memory into the Accumulo Fate table
    • Is an additional column which just indicates whether the FateId of that row is reserved or not
  • Added test MultipleStoresIT
    • Tests the new functionality of how reservations are stored
    • I also included a couple tests in MultipleStoresIT that are testing the reserve/unreserve functionality and that it still works as intended. This can be moved from this class if desired.
  • Verified existing tests in the fate test package (FateIT, FateOpsCommandsIT, FateStoreIT, ZooMutatorIT, FateMutatorImplIT, UserFateStoreIT) still pass

*There are still a few things I have marked as TODO that I would like input/suggestions on. These are marked in the code as "TODO 4131".

- Reservations for MetaFateStore were moved out of memory into ZooKeeper
- Reservations for UserFateStore were moved out of memory into the Accumulo Fate table
- Added test MultipleStoresIT
This commit is one part needed for having Fate be distributed
@kevinrr888
Copy link
Member Author

Marking this as ready for review, since it may not be reviewed otherwise

@kevinrr888 kevinrr888 marked this pull request as ready for review May 7, 2024 15:28
- Combined the LockID and reservation attempt UUID into one new object: FateReservation
- Now cover edge case with tryReserve() where a write may make it to the server, but the server dies before a response is received
- New FATE Thread: DeadReservationCleaner which deletes reservations held by Managers that have since died
- Created new classes: ColumnValueMappingIterator and ReservationMappingIterator. ColumnValueMappingIterator abstracts out the common functionality of ReservationMappingIterator and StatusMappingIterator. ReservationMappingIterator is an iterator used for determining if the reservation column for a FateId has a FateReservation set or not
- Expanded/improved/simplified MultipleStoresIT tests
@kevinrr888
Copy link
Member Author

kevinrr888 commented Jun 3, 2024

I have made the following changes in e6671fb:

  • Combined the LockID and reservation attempt UUID into one new object: FateReservation
  • Now cover edge case with tryReserve() where a write may make it to the server, but the server dies before a response is received
  • New FATE Thread: DeadReservationCleaner which deletes reservations held by Managers that have since died
  • Created new classes: ColumnValueMappingIterator and ReservationMappingIterator. ColumnValueMappingIterator abstracts out the common functionality of ReservationMappingIterator and StatusMappingIterator. ReservationMappingIterator is an iterator used for determining if the reservation column for a FateId has a FateReservation set or not
  • Expanded/improved/simplified MultipleStoresIT tests

There are also some comments/questions labeled "TODO 4131" that I would like input on.

I also have to resolve the merge conflicts which I will do now. (edit: resolved)

Formatting, fixed a test failure occurring with a newly added test that
was merged in (logic error in my existing code)
hadoop version used in the github build (3.0.3), causing a build failure
hadoop version used in the github build (3.0.3), causing a build
failure. This was not fixed with
0809fd3. Should be fixed now.
Changed MultipleStoresIT from using a shaded library in hadoop (Mockito)
to using EasyMock
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking at this PR. Posting the comments I have so far.

byte[] serialize() {
try (ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos)) {
dos.writeUTF(status.name());
if (isReserved()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change to make in this PR. Thinking if we push all of the Fate data into a single node it would be nice to use JSON. That would make the data stored in ZK more human readable and it would make it easier to serialize and de-serialize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea

@@ -502,4 +520,14 @@ protected Serializable deserializeTxInfo(TxInfo txInfo, byte[] data) {
throw new IllegalStateException("Bad node data " + txInfo);
}
}

/**
* TODO 4131 this is a temporary method used to create a dummy lock when using a FateStore outside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a follow on issue. We can have a path in ZK for utilities to get a ZK lock. Currently I think some of the fate admin utilities that change the fate store require that no manager process is running. These utilities could be changed to get a lock at this utility path in ZK and reserve fate txs they operate on instead of requiring no manager process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a great change. And yes you are correct about some of the fate admin utils requiring the Manager to be down, so that would be helpful for those as well.

@Override
public void run() {
while (keepRunning.get()) {
store.deleteDeadReservations();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked into it, but need to understand what happens if this thread dies/throws an exception. Accumulo servers processes can designate some threads as critical and that would essentially cause the manager process to die when that is done. May want to do that for this thread, if its not already done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the DeadReservationCleaner dies, the Manager continues to run normally. I changed it to be a critical thread. When looking at this, I noticed that none of the other Fate threads are critical. It seems that maybe they should be? I can create a follow on issue if so. If the WorkFinder or any of the actual workers (TransactionRunner) die, Manager will continue to run normally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be good to open a follow on issue about making all of the fate threads critical.

- Simplified MetaFateStore.isReserved(FateId) (unnecessary error handling)
- Added check in MetaFateStore.FateTxStoreImpl.setStatus() that was missing: needed to ensure that the reservation stored in the store and in ZK were equivalent
- Added some logging to MetaFateStore.FateTxStoreImpl.unreserve()
- No longer using a ConditionalMutation to check if a FateId is reserved for UserFateStore, just scan the table instead
- Made the DeadReservationCleaner a critical thread to the Manager (if it dies, so does the Manager)
- Stores now take a Predicate<ZooUtil.LockID> isLockHeld which is used to determine if the given LockID is live. This considerably simplified the new MultipleStoresIT and the stores themselves.
@kevinrr888
Copy link
Member Author

@keith-turner Thanks for the review. I addressed the suggested changes. This PR is ready for re-review

Copy link
Member

@DomGarguilo DomGarguilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small changes. Feel free to take or leave any

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still review this, posting the comments I have made so far. Hoping to circle back to this really soon like later today or tomorrow and finish reviewing.

I went through and resolved all the comments I made last time, all of those updates look good. The one I left unresolved were only because those made need follow on issues opened so I did not want to forget about those.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed everything but the test changes, plan on looking at those tomorrow and then I will be finished reviewing.

@@ -55,7 +60,7 @@

//TODO use zoocache? - ACCUMULO-1297
//TODO handle zookeeper being down gracefully - ACCUMULO-1297

// TODO 4131 noticed this class is not in the fate.zookeeper package. Should it be?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it probably should be. That would be consistent with UserFateStore, which is in a sub package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I thought. This can be done in a follow on issue/PR. This can be kept unresolved until then

- Stricter check of the column value for fate reservations. If anything unexpected is seen, an error is now thrown.
- Simplified MetaFateStore.getActiveReservations() to only read from ZooKeeper once.
- Combined the two scans that were done in AbstractFateStore.runnable() into one. This meant adding FateReservation to FateIdStatus and refactoring Meta/UserFateStore.getTransactions().
- No longer use/store a string representation of the FateReservation (was used in UserFateStore). Now, only the serialized value is used. This keeps the usage of FateReservation consistent across Meta and UserFateStore. It was also unneccessary, so simplifies code.
- Moved AbstractFateStore.createAndReserve() implementation into Meta and UserFateStore, and rewrote the impl for each to work with the new way reservations are stored. This also allowed me to delete methods that were only used by AFS.createAndReserve(): create(FateId, FateKey), getStatusAndKey(FateId), create(FateKey).
- Fixed how concurrentStatusChangeCallers was decremented in AbstractFateStore.waitForStatusChange()
- Small change to MetaFateStore.deleteDeadReservations() to avoid reading from ZK unnecessarily
- Added isReservedBy() method to MetaFateStore.NodeValue to avoid code duplication and make the code more clear.
- Since the FateIdStatus now has the FateReservation, realized Meta and UserFateStore.getActiveReservations() could now be simplified to just call list(). This also made the impls the same, so moved to AbstractFateStore. This also made me realize that I had put getActiveReservations() method signature in FateStore, but would be better suited for ReadOnlyFateStore, so moved it there.
- Deleted FateStore.isReserved(FateId)... No longer needed/used
- Moved UNKNOWN status check in AbstractFateStore.reserve() into waiting loop
- Now log when a dead reservation is detected and deleted
- Minor change to Fate: no longer create the executor for the dead reservation cleaner if it's not going to be used
@kevinrr888
Copy link
Member Author

kevinrr888 commented Jul 23, 2024

@keith-turner, @DomGarguilo
I have addressed all of the review changes, and this PR is ready for re-review. However, some tests are no longer passing (e.g., ComprehensiveIT). I believe this is an issue with my new implementation of createAndReserve. I believe the issue is with the UserFateStore implementation, but could be both. Posting this new commit now to get the other changes out, and to potentially receive feedback on what might be wrong with createAndReserve(). I will continue to look into what the issue might be in the meantime. Please take a look at my new review comments and my new commit message

- New createAndReserve was not working as expected for UserFateStore, fixed the bug
@kevinrr888
Copy link
Member Author

I have found and addressed the bug in my most recent commit. This PR is now ready for full review.
The tests verified to still pass: the sunny day tests, all fate tests (ZooMutatorIT, FateMutatorImplIT, UserFateStoreIT, FateInterleavingIT, FateIT, FateOpsCommandsIT, ComprehensiveFlakyFateIT, DeleteRowsFlakyFateIT, ManagerRepoIT, and MultipleStoresIT (new test added)) except FateStoreIT. FateStoreIT reflects a method that has been removed in this PR, and want to receive feedback about the removal of this method before changes are made to this test. Have a "TODO 4131" marking this so this is not forgotten.

testDeadReservationsCleanup(FateInstanceType.USER);
}

private void testDeadReservationsCleanup(FateInstanceType storeType) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good. Looking over this test realized we need another test in a follow on PR. Need a test that checks that when the reservation does not exists in the persisted store that every update fails. Test would look something like the following.

    FateStore.FateTxStore<?> txStore = store1.reserve(fateId);
    // TODO delete the reservation in persisted store
    assertThrows(Exception.class, ()->txStore.push(repo));
    assertThrows(Exception.class, ()->txStore.setStatus(ReadOnlyFateStore.TStatus.IN_PROGRESS));
    assertThrows(Exception.class, ()->txStore.setTransactionInfo(t,v));
    //etc test all methods that write

For now may only be able to make this work with the UserFateStore as the current zookeeper schema does not support the conditional checks of reservations on updates.

// TODO 4131 could potentially have separate classes for testing MetaFateStore and UserFateStore
// similar to how FateTestRunner is used, however that interface doesn't work as nicely here
// since we are using multiple stores instead of just one. Can do something similar to
// FateTestRunner here if desired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Wonder if we could pass a factory to test like the following.

interface TestFateStoreFactory {
    FateStore create(ZooUtil.LockID lockid);
}

Then test could use the factory to create as many stores as needed. We could run the test passing in an impl that creates MetaFateStore and another that creates UserFateStore.

@kevinrr888
Copy link
Member Author

kevinrr888 commented Aug 6, 2024

@keith-turner I have compiled and wrote all the follow on issues for this to be merged. This covers all the issues I marked in this PR as TODO 4131 and all the unresolved review comments.

Here is a synopsis of the follow on issues I will create (so far - still need some questions answered):

  • Refactor MultipleStoresIT to function more similarly to other Fate tests like FateIT
  • Add FateKey toString() method
  • Move MetaFateStore to org.apache.accumulo.core.fate.zookeeper
  • Deprecate AbstractFateStore.createDummyLockID(): this includes creating a path in ZK for utilities to get a ZK lock, changing admin fate fail and admin fate delete commands to get a LockID at this path and no longer require the Manager to be down, and only use a LockID for a store if write operations are expected: the store should fail on write operations if writes are not expected.
  • Replace AFS.verifyReserved with a condition on the RESERVATION_COLUMN to verify that it is reserved
  • Make WorkFinder and the TransactionRunners critical to the Manager
  • Refactor how the RESERVATION_COLUMN works for UserFateStore: create/delete column on reserve/unreserve
  • Additional fate test case from Fate reservations moved out of memory #4524 (comment)

Here are some questions I still have:

  1. This was an existing TODO that you had marked in your "distributed FATE" WIP PR:
// TODO 4131
// TODO make the max time a function of the number of concurrent callers, as the number of
// concurrent callers increases then increase the max wait time
// TODO could support signaling within this instance for known events
// TODO made the maxWait low so this would be responsive... that may put a lot of load in
// the case there are lots of things waiting...

I made the max wait = curr num callers in seconds. Does this address the first TODO? For the other two TODOs, I was not sure what was to be done, so I left them. Can these be safely deleted without creating follow on issue(s)? If follow on issues should be made, perhaps it would be best for you to write these or explain it here because I am not sure what should be done.

  1. Re one of your review comments:
    "Not a change to make in this PR. Thinking if we push all of the Fate data into a single node it would be nice to use JSON. That would make the data stored in ZK more human readable and it would make it easier to serialize and de-serialize."
    Could you elaborate on what you mean by "all of the Fate data into a single node". By "all" do you mean the Repos, TxInfo, Status, Reservation, and FateKey would be in one ZK node? Writing the issue, I want to be sure I'm understanding correctly.

  2. Can Fate reservations moved out of memory #4524 (comment) be resolved? Or should an issue be made?

  3. There was a TODO regarding whether the ZooUtil.LockID was the best type for the lock. Should I make a follow on issue regarding this or is ZooUtil.LockID okay?

  4. Can this Fate reservations moved out of memory #4524 (comment) be resolved or should a follow on issue be made?

  5. Can these be resolved: Fate reservations moved out of memory #4524 (comment), Fate reservations moved out of memory #4524 (comment)

  6. The DeadReservationCleaner runs every 30 seconds. Is this okay? Should this be longer? Shorter?

Once the above questions are answered, I can complete my full list of new follow-on issues. This will allow me to remove all the TODO 4131 in this PR and all the comments in this PR can be resolved as all of these will be addressed or have a follow on issue created for them.

@keith-turner
Copy link
Contributor

keith-turner commented Aug 6, 2024

I resolved a few of the comments.

  1. There was a TODO regarding whether the ZooUtil.LockID was the best type for the lock. Should I make a follow on issue regarding this or is ZooUtil.LockID okay?

That type seems like the best existing type to use, so I do not think an issue is needed.

The DeadReservationCleaner runs every 30 seconds. Is this okay? Should this be longer? Shorter?

30 seconds seems too short to me. Would be doing a lot of work and usually never finding anything, should probably be longer maybe 2 or 3 minutes.

Could you elaborate on what you mean by "all of the Fate data into a single node". By "all" do you mean the Repos, TxInfo, Status, Reservation, and FateKey would be in one ZK node? Writing the issue, I want to be sure I'm understanding correctly.

I was thinking of putting all data that is related to a single fate id into a single zk node. This will allow atomic updates. It will be slower (always have to rewrite all repos when adding/removing). We are only using it for the metadata table. So thinking it will make things more correct at the expense of speed but that tradeoff is good for metadata related fates.

Can these be safely deleted without creating follow on issue(s)? If follow on issues should be made, perhaps it would be best for you to write these or explain it here because I am not sure what should be done.

Yeah you can delete those. I will look into it and open an issue if needed.

@keith-turner
Copy link
Contributor

@kevinrr888 it seems like the only unresolved conversations are covered by your list of issues to open. I just left those open for now.

@kevinrr888
Copy link
Member Author

kevinrr888 commented Aug 7, 2024

@keith-turner - The TODO 4131 comments have now been removed and I believe this PR is ready to be merged 🎉
I was planning on creating the follow-on issues after this is merged in as a batch of new issues.

I will post a list of all the outstanding changes to be made below. The trivial ones, I will just make a PR for without creating an issue, the more involved follow-ons will be tracked in an issue.

Here is the list of changes to be made:

To be tracked by issue:

  • Refactor MultipleStoresIT to function more similarly to other Fate tests like FateIT
  • Deprecate AbstractFateStore.createDummyLockID(): this includes creating a path in ZK for utilities to get a ZK lock, changing admin fate fail and admin fate delete commands to get a LockID at this path and no longer require the Manager to be down, and only use a LockID for a store if write operations are expected: the store should fail on write operations if writes are not expected.
  • A single ZK node for all fate data for each fate id
  • Replace AFS.verifyReserved with a condition on the RESERVATION_COLUMN to verify that it is reserved
  • Make WorkFinder and the TransactionRunners critical to the Manager
  • Refactor how the RESERVATION_COLUMN works for UserFateStore: create/delete column on reserve/unreserve

Will do in one or a few small PRs:

  • Add a toString() to FateKey
  • Move MetaFateStore to org.apache.accumulo.core.fate.zookeeper
  • Periodic clean up of dead reservations should be increased from every 30 seconds to every few minutes
  • Add additional fate test case suggested in Fate reservations moved out of memory #4524 (comment)

- AbstractFateStore still had a synchronization block for accessing `deferred` which was changed to a synchronized map in this PR; removed since there shouldn't be any more sync blocks in this class.
@kevinrr888
Copy link
Member Author

Resolved the merge conflicts with AbstractFateStore using the new CountDownTimer.
Also noticed a sync block that should have been removed so removed that.

AbstractFateStore should probably be looked at again before this is merged.

@dlmarion dlmarion changed the base branch from elasticity to main August 26, 2024 12:16
@kevinrr888 kevinrr888 linked an issue Sep 11, 2024 that may be closed by this pull request
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the changes since my last review and those all look good.

@kevinrr888 kevinrr888 merged commit d84a3b9 into apache:main Sep 19, 2024
8 checks passed
@kevinrr888
Copy link
Member Author

@kevinrr888 -

This was an existing TODO that you had marked in your "distributed FATE" WIP PR:
// TODO 4131
// TODO make the max time a function of the number of concurrent callers, as the number of
// concurrent callers increases then increase the max wait time
// TODO could support signaling within this instance for known events
// TODO made the maxWait low so this would be responsive... that may put a lot of load in
// the case there are lots of things waiting...
I made the max wait = curr num callers in seconds. Does this address the first TODO? For the other two TODOs, I was not sure what was to be done, so I left them. Can these be safely deleted without creating follow on issue(s)? If follow on issues should be made, perhaps it would be best for you to write these or explain it here because I am not sure what should be done.

@keith-turner -

Yeah you can delete those. I will look into it and open an issue if needed.

This may still need to be looked into just posting this as a reminder just in case

kevinrr888 added a commit to kevinrr888/accumulo that referenced this pull request Oct 31, 2024
This commit makes several improvements/fixes: improvements to the `admin fate` util which were made possible from apache#4524, replaced incorrect use of `createDummyLockID()` in real code (now only used in tests), `<User/Meta>FateStore` now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes.

Full list of changes:
- Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before `admin fate delete <tx>` or `admin fate fail <tx>` was able to run. This check is no longer needed with the changes from apache#4524. apache#4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired.
	- This has several benefits:
	- It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may
	keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing
	is still desired, the Manager may need to be shutdown.
	- It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/
	fail was executed.
	- It also should make the commands easier to use now since the Manager is not required to be shutdown before use.
- Changes and adds some tests for `admin fate fail` and `admin fate delete`: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction.
- Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant `Constants.ZADMIN_LOCK = "/admin/lock"`, changed `ServiceLockPaths`, and added `Admin.createAdminLock()` to support this
- New class `TestLock` (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed `createDummyLockID()` from `AbstractFateStore` (moved to TestLock), and `createDummyLock()` is now only used in test code. Added new constant `ZTEST_LOCK = "/test/lock"`, changed `ServiceLockPaths`, and added `createTestLock()` which is used to create a real lock id (one held in ZK) which is needed for some tests.
	- This fixes an unexpected failure that could have occurred for `ExternalCompaction_1_IT`. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the
	Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates
	a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly.
- Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id).
- Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore
- Ensured all tests changed, all FATE tests, and sunny day tests still pass

closes apache#4904
@kevinrr888 kevinrr888 deleted the elasticity-feature-4131 branch November 1, 2024 15:04
kevinrr888 added a commit that referenced this pull request Jan 16, 2025
…ed (#5028)

* `admin fate` improvements, `LockID`s use for fate stores improved/fixed

This commit makes several improvements/fixes: improvements to the `admin fate` util which were made possible from #4524, replaced incorrect use of `createDummyLockID()` in real code (now only used in tests), `<User/Meta>FateStore` now support a null lock id if they will be used as read-only stores: write ops will fail on a store with a null lock, and some other misc. changes.

Full list of changes:
- Removed the check for a dead Manager in the Admin fate util (AdminUtil) which was checked before `admin fate delete <tx>` or `admin fate fail <tx>` was able to run. This check is no longer needed with the changes from #4524. #4524 moved reservations out of Manager memory into the FATE table (for UserFateStore) and into ZK (for MetaFateStore). Prior to this, the Admin process would have no way of knowing if the Manager process had a transaction reserved, so the Manager had to be shutdown to ensure it was not. But now that reservations are visible to any process, we can try to reserve the transaction in Admin, and if it cannot be reserved and deleted/failed in a reasonable time, we let the user know that the Manager would need to be shutdown if deleting/failing the transaction is still desired.
	- This has several benefits:
	- It is one less thing to worry about when implementing multiple managers in the future since Admin assumes only one Manager for these commands. However, there is still the case where the Manager may
	keep a transaction reserved for a long period of time and the Admin can never reserve it. In this case, we inform the user that the transaction could not be deleted/failed and that if deleting/failing
	is still desired, the Manager may need to be shutdown.
	- It covers a potential issue in the previously existing code where there was nothing stopping or ensuring a Manager is not started after the check is already performed in Admin but before the delete/
	fail was executed.
	- It also should make the commands easier to use now since the Manager is not required to be shutdown before use.
- Changes and adds some tests for `admin fate fail` and `admin fate delete`: ensures the Manager is not required to be down to fail/delete a transaction, and ensures that if the Manager does have a transaction reserved, admin will fail to reserve and fail/delete the transaction.
- Another change which was needed as a prerequisite for the above changes was creating a ZK lock for Admin so transactions can be properly reserved by the command. Added new constant `Constants.ZADMIN_LOCK = "/admin/lock"`, changed `ServiceLockPaths`, and added `Admin.createAdminLock()` to support this
- New class `TestLock` (in test package) which is used by tests to create a real ZK lock, or a dummy one. Removed `createDummyLockID()` from `AbstractFateStore` (moved to TestLock), and `createDummyLock()` is now only used in test code. Added new constant `ZTEST_LOCK = "/test/lock"`, changed `ServiceLockPaths`, and added `createTestLock()` which is used to create a real lock id (one held in ZK) which is needed for some tests.
	- This fixes an unexpected failure that could have occurred for `ExternalCompaction_1_IT`. Was using a dummy lock for the store before and the fate data was being stored in the same locations that the
	Manager uses for it's fate data. The DeadReservationCleaner running in Manager would have cleaned up reservations created using this store if it ran when reservations were present. Now the test creates
	a real ZK lock so the DeadReservationCleaner won't clean these up unexpectedly.
- Stores now support a null lock id for the situation where they will be used as read-only stores. A store with a null lock id will fail on write ops. Changed all existing uses of stores to only have a lock id if writes will occur (previously, all instances of the stores had a lock id).
- Removed unused or unneccesary constructors for AbstractFateStore, MetaFateStore, UserFateStore
- Ensured all tests changed, all FATE tests, and sunny day tests still pass

closes #4904
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Move reservations in FATE from memory into the persisted store
4 participants