Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

How to backup and restore user data stored by rest-server. #5786

Closed
siaimes opened this issue Jun 7, 2022 · 5 comments
Closed

How to backup and restore user data stored by rest-server. #5786

siaimes opened this issue Jun 7, 2022 · 5 comments

Comments

@siaimes
Copy link
Contributor

siaimes commented Jun 7, 2022

Organization Name:

Short summary about the issue/question:
My certificate expired, but I got a cluster crash when renewing the certificate. Now I need to reset and reinstall the cluster, but all user data disappeared after this operation, and there is no backup and recovery solution found on GitHub.

volumes:
- name: pai-configuration-rest-server
configMap:
name: pai-configuration
{% if cluster_cfg['authentication']['OIDC'] %}
- name: auth-configuration-rest-server
configMap:
name: auth-configuration
{% endif %}
{%- if cluster_cfg["cluster"]["common"]["cluster-type"] == "k8s" %}
{%- if cluster_cfg['hivedscheduler']['config']|length > 1 %}
- name: hived-spec-rest-server
configMap:
name: hivedscheduler-config
{%- endif %}
- name: k8s-exit-spec-rest-server
configMap:
name: k8s-job-exit-spec-configuration
{%- endif %}
- name: group-configuration-rest-server
configMap:
name: group-configuration
{% if cluster_cfg['cluster']['common']['k8s-rbac'] == 'true' %}
serviceAccountName: rest-server-account
{% endif %}

It seems that rest-server does not mount any directory, so where is its data stored? How can I backup and restore it?

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

@Binyang2014
Copy link
Contributor

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem
@hzy46 Can you help to take a look?

@siaimes
Copy link
Contributor Author

siaimes commented Jun 8, 2022

If the db file not be deleted, you can recover the data. Here is a guide for this. https://openpai.readthedocs.io/en/latest/manual/cluster-admin/troubleshooting.html#how-to-solve-the-problem @hzy46 Can you help to take a look?

User data doesn't seem to be stored here, job data is stored here.

After I reset and installed the cluster, the job data still existed, but the user data was gone, including username, password, e-mail, SSH public Keys et. al.

@siaimes
Copy link
Contributor Author

siaimes commented Jun 8, 2022

I see that user information and group information are stored in the Secret, so now the problem seems to be how to backup and restore the Secret of k8s.

@Binyang2014
Copy link
Contributor

You are right, if you delete the data file fot etcd, then user/group info will be lost. We need to dump secrets first then apply them to the new cluster

@siaimes
Copy link
Contributor Author

siaimes commented Jun 9, 2022

So running the following command will reset the cluster, but all etcd data will be lost, please be careful.

ansible-playbook -i inventory/pai/hosts.yml -e "ansible_python_interpreter=/usr/bin/python3" reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants