Explore storing FATE data in the metadata table #3559

keith-turner · 2023-06-30T17:47:24Z

FATE operations store their data in zookeeper. This is ok when FATE is used for a relatively low volume of operations. However in #3462 a use case arose where using FATE to commit compactions would have been nice however it was deemed that commiting compactions would be too high volume for zookeeper. So as part of #3462 a simple ~refresh metadata section was created to ensure that tablet refresh happens after compaction commit even when the process dies between commit and refresh. This is similar in functionality to FATE, but much simpler and more specialized. Using FATE for this use case would be nice as it would reduce custom code in Accumulo.

The ~refresh entry introduced as part of #3462 is stored in ZK, root table, or metadata table depending on the tablet that is committing a compaction. Could FATE data be stored in a similar way, could the following be done?

For root table and system FATE operations, store the FATE data in ZK
For metadata table FATE operations, store the FATE data in the root table
For user table FATE operations, store the FATE data in the metadata table

Currently many FATE operations are tightly coupled to table locks. These would also need to be stored in the metadata table, not sure if these could be done using conditional mutations. That is an example of something that could cause problems for this strategy.

The benefit of this change is it would open FATE up for more use cases and would place less memory and write pressure on zookeeper for current use cases. However if #3462 is the only use case that needs this, it may not be worthwhile.

Another possible use case that could benefit from this is from #3382 where system initiated splits are now executed as FATE operations. However system initiated splits would not normally cause the sustained high levels of metadata activity that compactions would. System initiated splits could cause burst of metadata activity that would place a lot of write and memory pressure on ZK. For example if a table decides all of its tablets need to split around the same time.

Also wondering if this could help with making multiple manager possible by making it easier to partition the FATE operations and data.

The purpose of this issue is to decided if this is workable and worthwhile.

keith-turner · 2023-06-30T18:11:23Z

User compaction are currently driven by a fate operation, however those create ZK node that scale with the number of compaction request coming in via the API. User compactions do not create ZK nodes for each tablet in a compaction request. For example if a user initiated compaction has range that covers 100K tablets, then it will create a few ZK nodes but not N*100K ZK nodes. The compaction commit process in #3462 is a pert tablet operation, so for it to use FATE it would need to create a FATE operation per a tablet. This is why it was deemed too much for ZK.

Its best to avoid placing any per tablet information in ZK as this assume that all tablet metadata can fit in memory on a single machine which is an assumption that Accumulo tries to avoid. That does make the changes in #3382 to use FATE for per tablet splits kinds iffy, however that could be solved by putting a cap on the number of active split operations (which is not currently supported by FATE). It may be ok for splits but not sure we would want to bottleneck the number of compactions that could concurrently commit in that way.

ctubbsii · 2023-06-30T18:14:56Z

You tagged this with the elasticity project, but it seems to me that this could be an independent development effort. It doesn't seem strictly tied to the elasticity effort. Though, that effort would benefit from it. This could be done as an independent intermediate step towards the elasticity effort, rather than as something tightly coupled and interdependent on it.

Also, related to this idea, there was an idea to change the serialization of Fate storage to use Java 17 record types. That doesn't impact where we store the data... just how it's serialized when we do store it. But, if somebody were to work on this, they could take that proposed change into account, if they had to change how things were serialized to use the metadata table.

keith-turner · 2023-06-30T18:20:06Z

Also, related to this idea, there was an idea to change the serialization of Fate storage to use Java 17 record types.

No, but that sounds really nice.

Also something unrelated is prioritization of FATE operations. This is also driven by elasticity. Currently the system initiated split FATE operations are mixed in with user initated FATE ops. Ideally the user initiated operations would get some sort of priority. If FATE ops were stored in the metadata table, then maybe the schema could naturally support prioritization and/or different queues in its sort order.

keith-turner · 2023-06-30T18:21:12Z

@ctubbsii I think using records for serialization deserves its own issue if it does not have one.

keith-turner · 2023-06-30T19:09:09Z

Taking a quick look at how table locks work currently I think they may work as conditional mutations. The table locks do not use ephemeral nodes or notifications, features that only ZK would have. Fate threads will try to get a lock and if they can not then jump to another fate operation, so they never wait for a lock which could use ZK notifications. This polling approach would work with conditional mutations. The lock nodes in ZK are related to a FATE operation and not a running process, that is why ephemeral nodes are not used. The server process locks in ZK use ephemeral nodes.

keith-turner · 2023-07-14T00:56:56Z

Bulk may benefit from this feature as it would allow Fate ops driving bulk imports for user tables to be stored in the metadata table. For any system with lots of bulk imports, this would take pressure off of ZK.

ctubbsii · 2023-07-27T20:58:20Z

@ctubbsii I think using records for serialization deserves its own issue if it does not have one.

It was mentioned on #2473 (comment), which is an overarching issue about refactoring FATE. This issue and that one probably overlap a lot, as they are both very high level and not clear on details. There may be smaller bits that can be separated out and worked on independently. However, I don't want to create any new issues, given that there's already 2 broadly themed ones, and it's not clear what the path forward is going to be. I think the ideas between the two tickets need to be thought through, so a clear design can be proposed. At that point, it might make sense to create sub-issues to track the individual tasks as part of working towards the overall goal.

Another thing to consider: we have a whole accumulo.* namespace. If the metadata isn't appropriate, then it's possible another table in that namespace could be... as long as we don't create any bootstrapping issues between which tables come online first.

keith-turner · 2023-07-27T21:24:10Z

This issue and that one probably overlap a lot, as they are both very high level and not clear on details.

I think this issue is very narrowly defined, its about making FATEs storage more scalable so that FATE can be used more frequently than it is currently used. Below are the details and background.

All writes to zookeeper must go through a single server with a single write ahead log and then be pushed to a quorum of other servers. Therefore writes to ZK are not scalable (ZK is more scalable on the read side).
FATE operations can be write heavy ZK operations. This could negatively impact the general health of ZK on which accumulo depends for ZKs more scalable read functionality.
The metadata table is much more scalable for writes as each tabletserver has its own walog and metadata tablets scan span many tablet servers.
Using the metadata table for storage of FATE would increase the number of FATE operations that could execute concurrently and reduce the general write load on ZK.
Making FATE storage more scalable would open it up to being used for operations that it is currently not suitable for like per tablet compaction commit. Making the fate storage more scalable may help improve user operations like bulk import at scale also.

keith-turner · 2023-07-27T21:29:26Z

In addition to walog scalability, using the metadata table as storage for fate operations is also more scalable in terms of memory usage. Everything stored in ZK must be able to fit in memory on a single machine. Storing data in the metadata table does not have this constraint, but it can leverage much much more memory for caching data across many tservers.

cshannon · 2023-10-08T15:14:48Z

I have started looking into this and have started thinking about a schema design for storing FATE. @keith-turner talked offline and one option is to actually store FATE data its own table vs metadata which might be cleaner

cshannon · 2023-10-10T19:15:17Z

@ctubbsii - You had mentioned in an earlier comment doing this as an independent effort outside of elasticity but @keith-turner had some good points of why this might be better to do as part of Elasticity. I've only just begin to look at this issue and still trying to figure out what all is involved here so he could probably explain better the reasons. (Maybe Keith can respond here or it can be discussed on slack/dev list etc.). Mostly I just wanted to bring it up as now that I am starting to look into this it would be good to get onto the same page as to what part of the work is being targeted at main vs elasticity to avoid any issues down the road.

keith-turner · 2023-10-10T20:26:44Z

The following are reasons why I think it makes sense to do this in elasticity.

With the adoption of using conditional mutations in elasticity we may be able able to get rid of table locks. See Investigate removing table locks for FATE ops #3823. If table locks are not removed, then supporting FATE using the metadata table for storage would mean table locks also need to support the metadata table. Table locks are currently stored in zookeeper. So implementing this in main and merging to elasticity may mean implementing tables locks in the metadata table and then throwing that work away on merge.
The needs for storing FATE operations in the metadata table is being able to do per tablet operations in FATE like the compaction use case described in this issue and System split operations running in FATE could cause problem for user FATE operations #3560 and Investigate how to handle examining files as part of compaction planning #3526. These use cases are not present in main.

ctubbsii · 2023-10-11T16:14:51Z

@cshannon, In general, I prefer to keep features as independent as possible. I think this is good development practice, and results in more modular code design, which is easier to maintain. It also helps with release and feature planning when things aren't so co-dependent and tightly coupled. I'm fine with targeting elasticity, though, if the benefits of doing so are substantial.

On the two points @keith-turner mentioned:

The conditional mutation work isn't done yet, nor is the removal of table locks, but that's also a feature that could be done outside of the elasticity branch (possibly). It's also possible that conditional mutations can be used for this feature, without the other work in the metadata tables using conditional mutations yet, since conditional mutations are a per-write feature, not a per-table feature. I agree that there's a good chance that this may be easier or save some work by starting in the elasticity branch, because of how it relates to those things, but right now, I think this is all very speculative, as there are so many things in flight. I think think it's worth considering those things, but also making an effort to think of how it can be done independently, as the exploration of this feature proceeds.
The use cases for this may primarily exist in the elasticity branch, but that's not an argument against doing this incrementally with some progress in the main branch. If the dependency is one way, there's no reason we couldn't make some incremental changes in the main branch that is utilized more heavily later. Right now, I think it's too early to say what that might look like, since this is still at an exploratory stage.

I'm okay with starting the exploration in elasticity branch, but please consider putting in some effort to make things modular and work towards incremental progress, wherever possible. If it doesn't strictly need to be implemented in the elasticity branch, I'd prefer it not be. I'd much rather make incremental progress in the main branch than dump everything into elasticity, with an all-or-nothing approach... if we can help it (we might not be able to help it... it may still make sense to bundle it all in elasticity, and if so, I'll accept that... I just don't think it's inevitable yet).

ddanielr · 2023-10-11T16:15:49Z

There are also specific system operations that are table agnostic.

Shutting down a tserver creates a FATE operation to accomplish that task.
https://github.com/apache/accumulo/blob/main/server/manager/src/main/java/org/apache/accumulo/manager/ManagerClientServiceHandler.java#L322-L323

If that tserver is causing system issues, this specific FATE operation might need to be prioritized over other FATE types.

It might help to separate the FATEs into specific categories to determine proper storage locations.

The custom refresh tracking code was removed and compaction commit was moved to being a FATE operation with the following 4 steps. 1. Rename file done in RenameCompactionFile class 2. Update the metadata table via a conditional mutation done in CommitCompaction class 3. Write the gc candidates done in PutGcCandidates class 4. Optionally send a RPC refresh request if the tablet was hosted done in RefreshTablet class There is some follow on work that still needs to be done to improve how this change works with detecting dead compactions. After that is done these changes should address problems outlined apache#3811 and apache#3802 that were related to process death before adding GC candidates. Now that GC candidates are written in FATE, if it dies it will run again later. This is currently storing the compaction commit FATE operations in zookeeper. This would not be suitable for a cluster because per tablet information should never be stored in zookeeper. However its fine as a temporary situation in the elasticity branch until FATE storage is availabe in an accumulo table, see apache#4049 and apache#3559 WIP WIP WIP

The custom refresh tracking code was removed and compaction commit was moved to being a FATE operation with the following 4 steps. 1. Rename file done in RenameCompactionFile class 2. Update the metadata table via a conditional mutation done in CommitCompaction class 3. Write the gc candidates done in PutGcCandidates class 4. Optionally send a RPC refresh request if the tablet was hosted done in RefreshTablet class There is some follow on work that still needs to be done to improve how this change works with detecting dead compactions. After that is done these changes should address problems outlined #3811 and #3802 that were related to process death before adding GC candidates. Now that GC candidates are written in FATE, if it dies it will run again later. This is currently storing the compaction commit FATE operations in zookeeper. This would not be suitable for a cluster because per tablet information should never be stored in zookeeper. However its fine as a temporary situation in the elasticity branch until FATE storage is availabe in an accumulo table, see #4049 and #3559

keith-turner added this to Accumulo: Elasticity Jun 30, 2023

keith-turner converted this from a draft issue Jun 30, 2023

keith-turner mentioned this issue Jun 30, 2023

System split operations running in FATE could cause problem for user FATE operations #3560

Open

keith-turner mentioned this issue Oct 6, 2023

Modify FATE to allow distributed workers #3822

Open

cshannon self-assigned this Oct 7, 2023

This was referenced Dec 8, 2023

Update FATE transaction ids to be globally unique across multiple stores #4044

Closed

Add a new AccumuloStore for FATE #4049

Merged

keith-turner mentioned this issue Dec 21, 2023

Moves the compaction commit process into FATE #4109

Merged

cshannon closed this as completed Feb 7, 2024

github-project-automation bot moved this to ✅ Done in Accumulo: Elasticity Feb 7, 2024

ctubbsii added this to the 4.0.0 milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore storing FATE data in the metadata table #3559

Explore storing FATE data in the metadata table #3559

keith-turner commented Jun 30, 2023 •

edited

Loading

keith-turner commented Jun 30, 2023 •

edited

Loading

ctubbsii commented Jun 30, 2023

keith-turner commented Jun 30, 2023

keith-turner commented Jun 30, 2023

keith-turner commented Jun 30, 2023 •

edited

Loading

keith-turner commented Jul 14, 2023

ctubbsii commented Jul 27, 2023

keith-turner commented Jul 27, 2023

keith-turner commented Jul 27, 2023

cshannon commented Oct 8, 2023

cshannon commented Oct 10, 2023

keith-turner commented Oct 10, 2023 •

edited

Loading

ctubbsii commented Oct 11, 2023

ddanielr commented Oct 11, 2023

Explore storing FATE data in the metadata table #3559

Explore storing FATE data in the metadata table #3559

Comments

keith-turner commented Jun 30, 2023 • edited Loading

keith-turner commented Jun 30, 2023 • edited Loading

ctubbsii commented Jun 30, 2023

keith-turner commented Jun 30, 2023

keith-turner commented Jun 30, 2023

keith-turner commented Jun 30, 2023 • edited Loading

keith-turner commented Jul 14, 2023

ctubbsii commented Jul 27, 2023

keith-turner commented Jul 27, 2023

keith-turner commented Jul 27, 2023

cshannon commented Oct 8, 2023

cshannon commented Oct 10, 2023

keith-turner commented Oct 10, 2023 • edited Loading

ctubbsii commented Oct 11, 2023

ddanielr commented Oct 11, 2023

keith-turner commented Jun 30, 2023 •

edited

Loading

keith-turner commented Jun 30, 2023 •

edited

Loading

keith-turner commented Jun 30, 2023 •

edited

Loading

keith-turner commented Oct 10, 2023 •

edited

Loading