Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep local copy of the metastore for disaster recovery #5575

Closed
apurvam opened this issue Jun 8, 2020 · 4 comments
Closed

Keep local copy of the metastore for disaster recovery #5575

apurvam opened this issue Jun 8, 2020 · 4 comments
Assignees
Labels
enhancement operability Issues pertaining to running and operating KSQL, notably in production P0 Denotes must-have for a given milestone
Milestone

Comments

@apurvam
Copy link
Contributor

apurvam commented Jun 8, 2020

Is your feature request related to a problem? Please describe.
Today, in the remote circumstances where Kafka loses data the ksql command topic, we will lose the entire metastore irrecoverably. The backup option is to look at the ksql application logs for the CREATE, DROP, TERMINATE statements and resubmit those statements to reconstruct the store.

Kafka could truncate the command topic if settings like retention.ms etc., are erroneously changed to destructive values. Think of setting retention.ms to 0 on the command topic: it will be emptied.

It would be better to keep a copy of the metastore in each state store directory in a given cluster. That way, if the command topic data is lost, we can use the local copy to recreate it.

Describe the solution you'd like

One simple idea is to log the messages coming of the command topic during the process of updating the metastore. The sequence could be something like:

  1. Consume message from command topic
  2. Write it to a compressed local file in the state store directory.
  3. Update the metastore as we do today.
  4. Write a checksum of the new metastore to a separate file in the state store directory.

We should also have a tool to reconstruct the metastore from the saved file and compare the checksums. If they match, we have a clean backup. We can then recreate the command topic from the file.

If they don’t match, we will have to check the contents of the file, log messages, etc., to figure out what the deviations are. It may mean restoring less than complete state

Describe alternatives you've considered
Another way is to actually add a serializer to the metastore which syncs its contents to disk on every write. That's a viable way of doing things, but the benefit of the approach above is that it is trivial to reconstruct the command topic from the saved file: just write each record to the command topic one by one to get back the original state.

@apurvam apurvam added enhancement operability Issues pertaining to running and operating KSQL, notably in production P0 Denotes must-have for a given milestone labels Jun 8, 2020
@apurvam apurvam added this to the 0.11.0 milestone Jun 8, 2020
@agavra
Copy link
Contributor

agavra commented Jun 8, 2020

might be duplicate of #5569

@apurvam
Copy link
Contributor Author

apurvam commented Jun 9, 2020

True, but the two solutions are slightly different. Happy to consolidate, as long as it's on the radar for 0.11.0

@apurvam
Copy link
Contributor Author

apurvam commented Jun 23, 2020

@spena are you focusing on getting this done in the 0.11.0 timeframe? If not, we should reassign.

@apurvam
Copy link
Contributor Author

apurvam commented Jul 20, 2020

Fixed by #5831

@apurvam apurvam closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement operability Issues pertaining to running and operating KSQL, notably in production P0 Denotes must-have for a given milestone
Projects
None yet
Development

No branches or pull requests

3 participants