-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore storing FATE data in the metadata table #3559
Comments
User compaction are currently driven by a fate operation, however those create ZK node that scale with the number of compaction request coming in via the API. User compactions do not create ZK nodes for each tablet in a compaction request. For example if a user initiated compaction has range that covers 100K tablets, then it will create a few ZK nodes but not N*100K ZK nodes. The compaction commit process in #3462 is a pert tablet operation, so for it to use FATE it would need to create a FATE operation per a tablet. This is why it was deemed too much for ZK. Its best to avoid placing any per tablet information in ZK as this assume that all tablet metadata can fit in memory on a single machine which is an assumption that Accumulo tries to avoid. That does make the changes in #3382 to use FATE for per tablet splits kinds iffy, however that could be solved by putting a cap on the number of active split operations (which is not currently supported by FATE). It may be ok for splits but not sure we would want to bottleneck the number of compactions that could concurrently commit in that way. |
You tagged this with the elasticity project, but it seems to me that this could be an independent development effort. It doesn't seem strictly tied to the elasticity effort. Though, that effort would benefit from it. This could be done as an independent intermediate step towards the elasticity effort, rather than as something tightly coupled and interdependent on it. Also, related to this idea, there was an idea to change the serialization of Fate storage to use Java 17 record types. That doesn't impact where we store the data... just how it's serialized when we do store it. But, if somebody were to work on this, they could take that proposed change into account, if they had to change how things were serialized to use the metadata table. |
No, but that sounds really nice. Also something unrelated is prioritization of FATE operations. This is also driven by elasticity. Currently the system initiated split FATE operations are mixed in with user initated FATE ops. Ideally the user initiated operations would get some sort of priority. If FATE ops were stored in the metadata table, then maybe the schema could naturally support prioritization and/or different queues in its sort order. |
@ctubbsii I think using records for serialization deserves its own issue if it does not have one. |
Taking a quick look at how table locks work currently I think they may work as conditional mutations. The table locks do not use ephemeral nodes or notifications, features that only ZK would have. Fate threads will try to get a lock and if they can not then jump to another fate operation, so they never wait for a lock which could use ZK notifications. This polling approach would work with conditional mutations. The lock nodes in ZK are related to a FATE operation and not a running process, that is why ephemeral nodes are not used. The server process locks in ZK use ephemeral nodes. |
Bulk may benefit from this feature as it would allow Fate ops driving bulk imports for user tables to be stored in the metadata table. For any system with lots of bulk imports, this would take pressure off of ZK. |
It was mentioned on #2473 (comment), which is an overarching issue about refactoring FATE. This issue and that one probably overlap a lot, as they are both very high level and not clear on details. There may be smaller bits that can be separated out and worked on independently. However, I don't want to create any new issues, given that there's already 2 broadly themed ones, and it's not clear what the path forward is going to be. I think the ideas between the two tickets need to be thought through, so a clear design can be proposed. At that point, it might make sense to create sub-issues to track the individual tasks as part of working towards the overall goal. Another thing to consider: we have a whole |
I think this issue is very narrowly defined, its about making FATEs storage more scalable so that FATE can be used more frequently than it is currently used. Below are the details and background.
|
In addition to walog scalability, using the metadata table as storage for fate operations is also more scalable in terms of memory usage. Everything stored in ZK must be able to fit in memory on a single machine. Storing data in the metadata table does not have this constraint, but it can leverage much much more memory for caching data across many tservers. |
I have started looking into this and have started thinking about a schema design for storing FATE. @keith-turner talked offline and one option is to actually store FATE data its own table vs metadata which might be cleaner |
@ctubbsii - You had mentioned in an earlier comment doing this as an independent effort outside of elasticity but @keith-turner had some good points of why this might be better to do as part of Elasticity. I've only just begin to look at this issue and still trying to figure out what all is involved here so he could probably explain better the reasons. (Maybe Keith can respond here or it can be discussed on slack/dev list etc.). Mostly I just wanted to bring it up as now that I am starting to look into this it would be good to get onto the same page as to what part of the work is being targeted at main vs elasticity to avoid any issues down the road. |
The following are reasons why I think it makes sense to do this in elasticity.
|
@cshannon, In general, I prefer to keep features as independent as possible. I think this is good development practice, and results in more modular code design, which is easier to maintain. It also helps with release and feature planning when things aren't so co-dependent and tightly coupled. I'm fine with targeting elasticity, though, if the benefits of doing so are substantial. On the two points @keith-turner mentioned:
I'm okay with starting the exploration in elasticity branch, but please consider putting in some effort to make things modular and work towards incremental progress, wherever possible. If it doesn't strictly need to be implemented in the elasticity branch, I'd prefer it not be. I'd much rather make incremental progress in the main branch than dump everything into elasticity, with an all-or-nothing approach... if we can help it (we might not be able to help it... it may still make sense to bundle it all in elasticity, and if so, I'll accept that... I just don't think it's inevitable yet). |
There are also specific system operations that are table agnostic. Shutting down a tserver creates a FATE operation to accomplish that task. If that tserver is causing system issues, this specific FATE operation might need to be prioritized over other FATE types. It might help to separate the FATEs into specific categories to determine proper storage locations. |
The custom refresh tracking code was removed and compaction commit was moved to being a FATE operation with the following 4 steps. 1. Rename file done in RenameCompactionFile class 2. Update the metadata table via a conditional mutation done in CommitCompaction class 3. Write the gc candidates done in PutGcCandidates class 4. Optionally send a RPC refresh request if the tablet was hosted done in RefreshTablet class There is some follow on work that still needs to be done to improve how this change works with detecting dead compactions. After that is done these changes should address problems outlined apache#3811 and apache#3802 that were related to process death before adding GC candidates. Now that GC candidates are written in FATE, if it dies it will run again later. This is currently storing the compaction commit FATE operations in zookeeper. This would not be suitable for a cluster because per tablet information should never be stored in zookeeper. However its fine as a temporary situation in the elasticity branch until FATE storage is availabe in an accumulo table, see apache#4049 and apache#3559 WIP WIP WIP
The custom refresh tracking code was removed and compaction commit was moved to being a FATE operation with the following 4 steps. 1. Rename file done in RenameCompactionFile class 2. Update the metadata table via a conditional mutation done in CommitCompaction class 3. Write the gc candidates done in PutGcCandidates class 4. Optionally send a RPC refresh request if the tablet was hosted done in RefreshTablet class There is some follow on work that still needs to be done to improve how this change works with detecting dead compactions. After that is done these changes should address problems outlined #3811 and #3802 that were related to process death before adding GC candidates. Now that GC candidates are written in FATE, if it dies it will run again later. This is currently storing the compaction commit FATE operations in zookeeper. This would not be suitable for a cluster because per tablet information should never be stored in zookeeper. However its fine as a temporary situation in the elasticity branch until FATE storage is availabe in an accumulo table, see #4049 and #3559
FATE operations store their data in zookeeper. This is ok when FATE is used for a relatively low volume of operations. However in #3462 a use case arose where using FATE to commit compactions would have been nice however it was deemed that commiting compactions would be too high volume for zookeeper. So as part of #3462 a simple ~refresh metadata section was created to ensure that tablet refresh happens after compaction commit even when the process dies between commit and refresh. This is similar in functionality to FATE, but much simpler and more specialized. Using FATE for this use case would be nice as it would reduce custom code in Accumulo.
The ~refresh entry introduced as part of #3462 is stored in ZK, root table, or metadata table depending on the tablet that is committing a compaction. Could FATE data be stored in a similar way, could the following be done?
Currently many FATE operations are tightly coupled to table locks. These would also need to be stored in the metadata table, not sure if these could be done using conditional mutations. That is an example of something that could cause problems for this strategy.
The benefit of this change is it would open FATE up for more use cases and would place less memory and write pressure on zookeeper for current use cases. However if #3462 is the only use case that needs this, it may not be worthwhile.
Another possible use case that could benefit from this is from #3382 where system initiated splits are now executed as FATE operations. However system initiated splits would not normally cause the sustained high levels of metadata activity that compactions would. System initiated splits could cause burst of metadata activity that would place a lot of write and memory pressure on ZK. For example if a table decides all of its tablets need to split around the same time.
Also wondering if this could help with making multiple manager possible by making it easier to partition the FATE operations and data.
The purpose of this issue is to decided if this is workable and worthwhile.
The text was updated successfully, but these errors were encountered: