Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore storing FATE data in the metadata table #3559

Closed
keith-turner opened this issue Jun 30, 2023 · 14 comments
Closed

Explore storing FATE data in the metadata table #3559

keith-turner opened this issue Jun 30, 2023 · 14 comments
Assignees
Milestone

Comments

@keith-turner
Copy link
Contributor

keith-turner commented Jun 30, 2023

FATE operations store their data in zookeeper. This is ok when FATE is used for a relatively low volume of operations. However in #3462 a use case arose where using FATE to commit compactions would have been nice however it was deemed that commiting compactions would be too high volume for zookeeper. So as part of #3462 a simple ~refresh metadata section was created to ensure that tablet refresh happens after compaction commit even when the process dies between commit and refresh. This is similar in functionality to FATE, but much simpler and more specialized. Using FATE for this use case would be nice as it would reduce custom code in Accumulo.

The ~refresh entry introduced as part of #3462 is stored in ZK, root table, or metadata table depending on the tablet that is committing a compaction. Could FATE data be stored in a similar way, could the following be done?

  • For root table and system FATE operations, store the FATE data in ZK
  • For metadata table FATE operations, store the FATE data in the root table
  • For user table FATE operations, store the FATE data in the metadata table

Currently many FATE operations are tightly coupled to table locks. These would also need to be stored in the metadata table, not sure if these could be done using conditional mutations. That is an example of something that could cause problems for this strategy.

The benefit of this change is it would open FATE up for more use cases and would place less memory and write pressure on zookeeper for current use cases. However if #3462 is the only use case that needs this, it may not be worthwhile.

Another possible use case that could benefit from this is from #3382 where system initiated splits are now executed as FATE operations. However system initiated splits would not normally cause the sustained high levels of metadata activity that compactions would. System initiated splits could cause burst of metadata activity that would place a lot of write and memory pressure on ZK. For example if a table decides all of its tablets need to split around the same time.

Also wondering if this could help with making multiple manager possible by making it easier to partition the FATE operations and data.

The purpose of this issue is to decided if this is workable and worthwhile.

@keith-turner keith-turner converted this from a draft issue Jun 30, 2023
@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 30, 2023

User compaction are currently driven by a fate operation, however those create ZK node that scale with the number of compaction request coming in via the API. User compactions do not create ZK nodes for each tablet in a compaction request. For example if a user initiated compaction has range that covers 100K tablets, then it will create a few ZK nodes but not N*100K ZK nodes. The compaction commit process in #3462 is a pert tablet operation, so for it to use FATE it would need to create a FATE operation per a tablet. This is why it was deemed too much for ZK.

Its best to avoid placing any per tablet information in ZK as this assume that all tablet metadata can fit in memory on a single machine which is an assumption that Accumulo tries to avoid. That does make the changes in #3382 to use FATE for per tablet splits kinds iffy, however that could be solved by putting a cap on the number of active split operations (which is not currently supported by FATE). It may be ok for splits but not sure we would want to bottleneck the number of compactions that could concurrently commit in that way.

@ctubbsii
Copy link
Member

You tagged this with the elasticity project, but it seems to me that this could be an independent development effort. It doesn't seem strictly tied to the elasticity effort. Though, that effort would benefit from it. This could be done as an independent intermediate step towards the elasticity effort, rather than as something tightly coupled and interdependent on it.

Also, related to this idea, there was an idea to change the serialization of Fate storage to use Java 17 record types. That doesn't impact where we store the data... just how it's serialized when we do store it. But, if somebody were to work on this, they could take that proposed change into account, if they had to change how things were serialized to use the metadata table.

@keith-turner
Copy link
Contributor Author

Also, related to this idea, there was an idea to change the serialization of Fate storage to use Java 17 record types.

No, but that sounds really nice.

Also something unrelated is prioritization of FATE operations. This is also driven by elasticity. Currently the system initiated split FATE operations are mixed in with user initated FATE ops. Ideally the user initiated operations would get some sort of priority. If FATE ops were stored in the metadata table, then maybe the schema could naturally support prioritization and/or different queues in its sort order.

@keith-turner
Copy link
Contributor Author

@ctubbsii I think using records for serialization deserves its own issue if it does not have one.

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 30, 2023

Taking a quick look at how table locks work currently I think they may work as conditional mutations. The table locks do not use ephemeral nodes or notifications, features that only ZK would have. Fate threads will try to get a lock and if they can not then jump to another fate operation, so they never wait for a lock which could use ZK notifications. This polling approach would work with conditional mutations. The lock nodes in ZK are related to a FATE operation and not a running process, that is why ephemeral nodes are not used. The server process locks in ZK use ephemeral nodes.

@keith-turner
Copy link
Contributor Author

Bulk may benefit from this feature as it would allow Fate ops driving bulk imports for user tables to be stored in the metadata table. For any system with lots of bulk imports, this would take pressure off of ZK.

@ctubbsii
Copy link
Member

@ctubbsii I think using records for serialization deserves its own issue if it does not have one.

It was mentioned on #2473 (comment), which is an overarching issue about refactoring FATE. This issue and that one probably overlap a lot, as they are both very high level and not clear on details. There may be smaller bits that can be separated out and worked on independently. However, I don't want to create any new issues, given that there's already 2 broadly themed ones, and it's not clear what the path forward is going to be. I think the ideas between the two tickets need to be thought through, so a clear design can be proposed. At that point, it might make sense to create sub-issues to track the individual tasks as part of working towards the overall goal.

Another thing to consider: we have a whole accumulo.* namespace. If the metadata isn't appropriate, then it's possible another table in that namespace could be... as long as we don't create any bootstrapping issues between which tables come online first.

@keith-turner
Copy link
Contributor Author

This issue and that one probably overlap a lot, as they are both very high level and not clear on details.

I think this issue is very narrowly defined, its about making FATEs storage more scalable so that FATE can be used more frequently than it is currently used. Below are the details and background.

  • All writes to zookeeper must go through a single server with a single write ahead log and then be pushed to a quorum of other servers. Therefore writes to ZK are not scalable (ZK is more scalable on the read side).
  • FATE operations can be write heavy ZK operations. This could negatively impact the general health of ZK on which accumulo depends for ZKs more scalable read functionality.
  • The metadata table is much more scalable for writes as each tabletserver has its own walog and metadata tablets scan span many tablet servers.
  • Using the metadata table for storage of FATE would increase the number of FATE operations that could execute concurrently and reduce the general write load on ZK.
  • Making FATE storage more scalable would open it up to being used for operations that it is currently not suitable for like per tablet compaction commit. Making the fate storage more scalable may help improve user operations like bulk import at scale also.

@keith-turner
Copy link
Contributor Author

In addition to walog scalability, using the metadata table as storage for fate operations is also more scalable in terms of memory usage. Everything stored in ZK must be able to fit in memory on a single machine. Storing data in the metadata table does not have this constraint, but it can leverage much much more memory for caching data across many tservers.

@cshannon
Copy link
Contributor

cshannon commented Oct 8, 2023

I have started looking into this and have started thinking about a schema design for storing FATE. @keith-turner talked offline and one option is to actually store FATE data its own table vs metadata which might be cleaner

@cshannon
Copy link
Contributor

@ctubbsii - You had mentioned in an earlier comment doing this as an independent effort outside of elasticity but @keith-turner had some good points of why this might be better to do as part of Elasticity. I've only just begin to look at this issue and still trying to figure out what all is involved here so he could probably explain better the reasons. (Maybe Keith can respond here or it can be discussed on slack/dev list etc.). Mostly I just wanted to bring it up as now that I am starting to look into this it would be good to get onto the same page as to what part of the work is being targeted at main vs elasticity to avoid any issues down the road.

@keith-turner
Copy link
Contributor Author

keith-turner commented Oct 10, 2023

The following are reasons why I think it makes sense to do this in elasticity.

@ctubbsii
Copy link
Member

@cshannon, In general, I prefer to keep features as independent as possible. I think this is good development practice, and results in more modular code design, which is easier to maintain. It also helps with release and feature planning when things aren't so co-dependent and tightly coupled. I'm fine with targeting elasticity, though, if the benefits of doing so are substantial.

On the two points @keith-turner mentioned:

  • The conditional mutation work isn't done yet, nor is the removal of table locks, but that's also a feature that could be done outside of the elasticity branch (possibly). It's also possible that conditional mutations can be used for this feature, without the other work in the metadata tables using conditional mutations yet, since conditional mutations are a per-write feature, not a per-table feature. I agree that there's a good chance that this may be easier or save some work by starting in the elasticity branch, because of how it relates to those things, but right now, I think this is all very speculative, as there are so many things in flight. I think think it's worth considering those things, but also making an effort to think of how it can be done independently, as the exploration of this feature proceeds.
  • The use cases for this may primarily exist in the elasticity branch, but that's not an argument against doing this incrementally with some progress in the main branch. If the dependency is one way, there's no reason we couldn't make some incremental changes in the main branch that is utilized more heavily later. Right now, I think it's too early to say what that might look like, since this is still at an exploratory stage.

I'm okay with starting the exploration in elasticity branch, but please consider putting in some effort to make things modular and work towards incremental progress, wherever possible. If it doesn't strictly need to be implemented in the elasticity branch, I'd prefer it not be. I'd much rather make incremental progress in the main branch than dump everything into elasticity, with an all-or-nothing approach... if we can help it (we might not be able to help it... it may still make sense to bundle it all in elasticity, and if so, I'll accept that... I just don't think it's inevitable yet).

@ddanielr
Copy link
Contributor

There are also specific system operations that are table agnostic.

Shutting down a tserver creates a FATE operation to accomplish that task.
https://github.com/apache/accumulo/blob/main/server/manager/src/main/java/org/apache/accumulo/manager/ManagerClientServiceHandler.java#L322-L323

If that tserver is causing system issues, this specific FATE operation might need to be prioritized over other FATE types.

It might help to separate the FATEs into specific categories to determine proper storage locations.

keith-turner added a commit to keith-turner/accumulo that referenced this issue Dec 21, 2023
The custom refresh tracking code was removed and compaction commit was
moved to being a FATE operation with the following 4 steps.

 1. Rename file done in RenameCompactionFile class
 2. Update the metadata table via a conditional mutation done in
   CommitCompaction class
 3. Write the gc candidates done in PutGcCandidates class
 4. Optionally send a RPC refresh request if the tablet was hosted
    done in RefreshTablet class

There is some follow on work that still needs to be done to improve
how this change works with detecting dead compactions.  After that is
done these changes should address problems outlined apache#3811 and apache#3802
that were related to process death before adding GC candidates.  Now
that GC candidates are written in FATE, if it dies it will run again
later.

This is currently storing the compaction commit FATE operations in
zookeeper.  This would not be suitable for a cluster because per tablet
information should never be stored in zookeeper.  However its fine as a
temporary situation in the elasticity branch until FATE storage is
availabe in an accumulo table, see apache#4049 and apache#3559

WIP

WIP

WIP
keith-turner added a commit that referenced this issue Jan 13, 2024
The custom refresh tracking code was removed and compaction commit was
moved to being a FATE operation with the following 4 steps.

 1. Rename file done in RenameCompactionFile class
 2. Update the metadata table via a conditional mutation done in
   CommitCompaction class
 3. Write the gc candidates done in PutGcCandidates class
 4. Optionally send a RPC refresh request if the tablet was hosted
    done in RefreshTablet class

There is some follow on work that still needs to be done to improve
how this change works with detecting dead compactions.  After that is
done these changes should address problems outlined #3811 and #3802
that were related to process death before adding GC candidates.  Now
that GC candidates are written in FATE, if it dies it will run again
later.

This is currently storing the compaction commit FATE operations in
zookeeper.  This would not be suitable for a cluster because per tablet
information should never be stored in zookeeper.  However its fine as a
temporary situation in the elasticity branch until FATE storage is
availabe in an accumulo table, see #4049 and #3559
@cshannon cshannon closed this as completed Feb 7, 2024
@ctubbsii ctubbsii added this to the 4.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants