Keep local copy of the metastore for disaster recovery #5575
Labels
enhancement
operability
Issues pertaining to running and operating KSQL, notably in production
P0
Denotes must-have for a given milestone
Milestone
Is your feature request related to a problem? Please describe.
Today, in the remote circumstances where Kafka loses data the ksql command topic, we will lose the entire metastore irrecoverably. The backup option is to look at the ksql application logs for the
CREATE
,DROP
,TERMINATE
statements and resubmit those statements to reconstruct the store.Kafka could truncate the command topic if settings like
retention.ms
etc., are erroneously changed to destructive values. Think of settingretention.ms
to0
on the command topic: it will be emptied.It would be better to keep a copy of the metastore in each state store directory in a given cluster. That way, if the command topic data is lost, we can use the local copy to recreate it.
Describe the solution you'd like
One simple idea is to log the messages coming of the command topic during the process of updating the metastore. The sequence could be something like:
We should also have a tool to reconstruct the metastore from the saved file and compare the checksums. If they match, we have a clean backup. We can then recreate the command topic from the file.
If they don’t match, we will have to check the contents of the file, log messages, etc., to figure out what the deviations are. It may mean restoring less than complete state
Describe alternatives you've considered
Another way is to actually add a serializer to the metastore which syncs its contents to disk on every write. That's a viable way of doing things, but the benefit of the approach above is that it is trivial to reconstruct the command topic from the saved file: just write each record to the command topic one by one to get back the original state.
The text was updated successfully, but these errors were encountered: