Add design for repository maintenance job #7375

qiuming-best · 2024-02-01T05:30:53Z

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)
#7291

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

codecov · 2024-02-01T05:38:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.64%. Comparing base (270b1de) to head (ebd90bb).
Report is 51 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7375      +/-   ##
==========================================
- Coverage   61.76%   61.64%   -0.13%     
==========================================
  Files         262      263       +1     
  Lines       28433    28634     +201     
==========================================
+ Hits        17563    17651      +88     
- Misses       9640     9741     +101     
- Partials     1230     1242      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

design/repository-maintenance.md

Lyndon-Li · 2024-02-01T06:38:27Z

design/repository-maintenance.md

+		"/tmp/credentials",
+		filesystem.NewFileSystem(),
+	)
+	cmd.CheckError(err)


Where can we see this error message? Will it dump to any of Velero's debug log or backupRepository CR's message?

there are two kinds of messages, one is the fmt.Fprintf type of message, which would be redirected to termination-log, and another kind of message would be log could be retrieved by job log

So we wanna add a utility in velero to handle the error when the it's running in a job, to selectively dump the error in termination-log, so the velero server can read it via k8s api-server, and that should be added in the design.

I've added the FileHook section in the design, which could dump errors into termination-log

design/repository-maintenance.md

reasonerjt · 2024-02-05T03:22:09Z

design/repository-maintenance.md

+		"/tmp/credentials",
+		filesystem.NewFileSystem(),
+	)
+	cmd.CheckError(err)


So we wanna add a utility in velero to handle the error when the it's running in a job, to selectively dump the error in termination-log, so the velero server can read it via k8s api-server, and that should be added in the design.

design/repository-maintenance.md

draghuram · 2024-02-29T20:36:37Z

design/repository-maintenance.md

+
+Our CLI command is designed as follows:
+```shell
+$ velero repo-maintenance --repo-name $repo-name --repo-type $repo-type --backup-storage-location $bsl


Does this command trigger maintenance job every time the command is run? If so, wouldn't it be better to let users set a schedule for maintenance jobs? I hope I haven't misunderstood the design.

this command calls the Kopia library or Restic to do maintenance and is used in a maintenance job, the command also reads some other configurations from the job such as environment variables and secret, etc. So using this command is less efficient than using Kopia or Restic to handle the repository directly for users.

Does this mean that repo (and BSL) needs to be accessible from where the command is being run? Wouldn't that run counter to the some other issues that are being worked on (such as #7344)? The idea is that BSL may not be reachable from the machine that runs the command. More over, repo may not even be on object storage. For these reasons, I think it is better to have the Velero server run maintenance job. I agree with the concerns mentioned in the document such as memory/CPU requirements but that is true even for large backups. More over, all Velero functionality is available in a declarative way so far and the requirement to run the command breaks that. I feel that users should be able to schedule the maintenance job just like backups and Velero server should run it as per schedule.

My comments may have come a bit late if code changes have already begun but I am curious to know what other developers think.

The command will be run in a k8s job and the job will be created by velero pod.
We do not expect user to run this command from the client side.

Signed-off-by: Ming Qiu <[email protected]>

kaovilai · 2024-03-04T04:06:42Z

design/repository-maintenance.md

+    # Only have one job one time
+    completions: 1
+    # Not parallel running job
+    parallelism: 1


@shawn-hurley have warned me in the past about using Jobs (from #2601 discussion) that this process has to be able to tolerate being started twice.

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.

src

Keep that in mind, I haven't reviewed if repo maintenance starting twice "accidentally" would cause an issue or not.

I assume the starting twice case would happen very close to each other when the jobs object status haven't been updated to Running yet.

My napkin algorithm would be to have some kind of leader election even if the expected running job is one. Such that if there is more than one, the later one is no-op or stopped.

Thanks for reminding me. If the job started twice "accidentally" for the maintenance scenario, it may have a bug but will not cause issues.

As the maintenance job updates the LastMaintenanceTime in backuprepositories CR only affects the next maintenance time which doesn't decide whether the repository is ready or not. it may lead to multiple maintenances in one maintenance frequency in this starting twice "accidentally" scenario.

Also for the same repository, only the maintenance job which gets the repository file lock could run maintenance, and it's guaranteed by the repository itself that only one maintenance at one time for each repository

qiuming-best added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Feb 1, 2024

github-actions bot requested review from shubham-pampattiwar and sseago February 1, 2024 05:31

github-actions bot added the Area/Design Design Documents label Feb 1, 2024

github-actions bot assigned qiuming-best Feb 1, 2024

qiuming-best force-pushed the repo-maintenance branch 2 times, most recently from 7968b25 to f68cbb9 Compare February 1, 2024 05:48

qiuming-best commented Feb 1, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

qiuming-best force-pushed the repo-maintenance branch from f68cbb9 to e3b92f5 Compare February 1, 2024 06:09

qiuming-best requested review from Lyndon-Li, ywk253100, blackpiglet, reasonerjt and allenxu404 February 1, 2024 06:12

Lyndon-Li reviewed Feb 1, 2024

View reviewed changes

ywk253100 reviewed Feb 1, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

ywk253100 reviewed Feb 1, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

ywk253100 reviewed Feb 1, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

ywk253100 reviewed Feb 1, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

reasonerjt reviewed Feb 5, 2024

View reviewed changes

qiuming-best force-pushed the repo-maintenance branch 2 times, most recently from 9794d52 to 0b495e5 Compare February 19, 2024 03:06

qiuming-best mentioned this pull request Feb 21, 2024

Add repository maintenance job #7451

Merged

3 tasks

qiuming-best force-pushed the repo-maintenance branch from 0b495e5 to 574b274 Compare February 23, 2024 02:19

ywk253100 reviewed Feb 23, 2024

View reviewed changes

qiuming-best force-pushed the repo-maintenance branch 4 times, most recently from 801289c to f7aaaa0 Compare February 23, 2024 07:43

reasonerjt reviewed Feb 23, 2024

View reviewed changes

design/repository-maintenance.md Show resolved Hide resolved

reasonerjt reviewed Feb 23, 2024

View reviewed changes

design/repository-maintenance.md Show resolved Hide resolved

reasonerjt reviewed Feb 23, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

qiuming-best force-pushed the repo-maintenance branch 4 times, most recently from 494354b to 210cfc1 Compare February 27, 2024 01:51

ywk253100 reviewed Feb 27, 2024

View reviewed changes

design/repository-maintenance.md Show resolved Hide resolved

qiuming-best force-pushed the repo-maintenance branch 6 times, most recently from ba46cc3 to 0d92c65 Compare February 28, 2024 02:37

shubham-pampattiwar reviewed Feb 28, 2024

View reviewed changes

design/repository-maintenance.md Show resolved Hide resolved

shubham-pampattiwar reviewed Feb 28, 2024

View reviewed changes

design/repository-maintenance.md Outdated Show resolved Hide resolved

qiuming-best force-pushed the repo-maintenance branch from 0d92c65 to e6dfc08 Compare February 29, 2024 09:50

draghuram reviewed Feb 29, 2024

View reviewed changes

Add design for repository maintenance job

ebd90bb

Signed-off-by: Ming Qiu <[email protected]>

qiuming-best force-pushed the repo-maintenance branch from e6dfc08 to ebd90bb Compare March 1, 2024 06:57

kaovilai reviewed Mar 4, 2024

View reviewed changes

blackpiglet approved these changes Mar 25, 2024

View reviewed changes

Lyndon-Li approved these changes Mar 25, 2024

View reviewed changes

qiuming-best merged commit 24941b4 into vmware-tanzu:main Mar 25, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design for repository maintenance job #7375

Add design for repository maintenance job #7375

qiuming-best commented Feb 1, 2024

codecov bot commented Feb 1, 2024 •

edited

Loading

Lyndon-Li Feb 1, 2024

qiuming-best Feb 2, 2024

reasonerjt Feb 5, 2024

qiuming-best Feb 23, 2024

reasonerjt Feb 5, 2024

draghuram Feb 29, 2024

qiuming-best Mar 1, 2024

draghuram Mar 1, 2024

reasonerjt Mar 3, 2024

kaovilai Mar 4, 2024

kaovilai Mar 4, 2024

kaovilai Mar 4, 2024

qiuming-best Mar 4, 2024

qiuming-best Mar 5, 2024

Add design for repository maintenance job #7375

Add design for repository maintenance job #7375

Conversation

qiuming-best commented Feb 1, 2024

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

codecov bot commented Feb 1, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 1, 2024 •

edited

Loading