Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point In Time Recovery #551

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

Point In Time Recovery #551

wants to merge 43 commits into from

Conversation

Zvirovyi
Copy link

@Zvirovyi Zvirovyi commented Nov 20, 2024

Important!

This PR relies on the last version of the charmed-mysql-snap and canonical/charmed-mysql-snap#63.

Overview

MySQL stores binary transactions logs. This PR adds a service job to upload these logs to the S3 bucket and the ability to use them later for a point-in-time-recovery with a new restore-to-time parameter during restore. This new parameter accepts MySQL timestamp or keyword latest (for replaying all the transaction logs).

Also, a new application blocked status is introduced - Another cluster S3 repository to signal user that used S3 repository is claimed by the another cluster and binlogs collecting job is disabled and creating new backups is restricted (these are the only workload limitation). This is crucial to keep stored binary logs safe from the another clusters. This check uses @@GLOBAL.group_replication_group_name.
After restore, cluster group replication is reinitialized, so practically it becomes a new different cluster. For these cases, Another cluster S3 repository message is changed to the Move restored cluster to another S3 repository to indicate this event more conveniently for the user.
Both the block messages will disappear when S3 configuration is removed or changed to the empty repository.

Usage example

  1. deploy mysql + s3-integrator and integrate them
  2. create full backup
  3. create test data:
create database zvirovyi;
use zvirovyi;
create table asd(message varchar(255) primary key);
select current_timestamp; # 2024-11-20 17:10:01
insert into asd values ('hello');
select current_timestamp; # 2024-11-20 17:10:12
insert into asd values ('world');
flush binary logs;
  1. wait several minutes for binlogs to be uploaded
  2. restore: juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="2024-11-20 17:10:01"
use zvirovyi;
select * from asd; # empty set returned
  1. observe application block message
  2. restore: juju run mysql/leader restore backup-id=2024-11-20T17:08:24Z restore-to-time="latest"
use zvirovyi;
select * from asd; # hello, world returned

Key notes

@Zvirovyi Zvirovyi marked this pull request as ready for review December 8, 2024 09:12
@Zvirovyi
Copy link
Author

Zvirovyi commented Dec 8, 2024

Tests are WIP!
UPD: integration tests done, and I will fix + add unit tests after PR review.

Copy link
Contributor

@shayancanonical shayancanonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only collecting binlogs from the leader unit, we need to add handling for when Juju elects a new leader -> what should happen here? Should the new leader unit start collecting binlogs instead?

Furthermore, while thinking of the above use case, we will also handle the scaling scenario -> what if the leader unit is scaled down?

Also, I would really prefer it if we could add an integration test for the above scenario (where the leader unit is scaled down after which the PITR is performed) after we determine how to handle the scenario

Copy link
Contributor

@paulomach paulomach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments and I'll try to test it.

@paulomach
Copy link
Contributor

Hi @Zvirovyi , please take a look the failing tests. Ping me for authorize the test run

@Zvirovyi
Copy link
Author

@shayancanonical

Since we are only collecting binlogs from the leader unit, we need to add handling for when Juju elects a new leader -> what should happen here? Should the new leader unit start collecting binlogs instead?

This is handled by binding s3 changed event to the leader elected event like self.framework.observe(self.charm.on.leader_elected, self._on_s3_credentials_changed) in lib/charms/mysql/v0/backups.py. And then peer relation changed will occur on other units therefore disabling binlogs collector.
This is the same approach used in the PostgreSQL PITR.

Furthermore, while thinking of the above use case, we will also handle the scaling scenario -> what if the leader unit is scaled down?

Same as above, I don't see difference here.

Also, I would really prefer it if we could add an integration test for the above scenario (where the leader unit is scaled down after which the PITR is performed) after we determine how to handle the scenario

Would it be VM-specific integration test? If so, then should we add it?

# Conflicts:
#	lib/charms/mysql/v0/mysql.py
Copy link
Contributor

@shayancanonical shayancanonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks great! Will help resolve the unit test failures so that we can see the integration test results

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 34.05172% with 153 lines in your changes missing coverage. Please review.

Project coverage is 64.11%. Comparing base (e7c1809) to head (00ee5a8).
Report is 133 commits behind head on main.

Files with missing lines Patch % Lines
lib/charms/mysql/v0/backups.py 34.82% 63 Missing and 10 partials ⚠️
lib/charms/mysql/v0/s3_helpers.py 41.66% 28 Missing ⚠️
lib/charms/mysql/v0/mysql.py 29.03% 22 Missing ⚠️
src/mysql_vm_helpers.py 21.42% 21 Missing and 1 partial ⚠️
src/charm.py 20.00% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   66.25%   64.11%   -2.15%     
==========================================
  Files          17       20       +3     
  Lines        3180     4489    +1309     
  Branches      424      742     +318     
==========================================
+ Hits         2107     2878     +771     
- Misses        935     1370     +435     
- Partials      138      241     +103     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants