Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

Closed
mayankagg9722 opened this issue Mar 13, 2024 · 14 comments
Assignees

Comments

@mayankagg9722
Copy link
Contributor

mayankagg9722 commented Mar 13, 2024

What steps did you take and what happened:
Triggered restore and the restore operation got halt/stuck for 10 min at different intervals when its in-progress due to which it took longer than expected to complete.

What did you expect to happen:
Restore flow should not be halt/stopped due to the "Failed to discover group" and "I0308: client-side throttling" errors.

Failures observed at which velero got stuck/halt:

  • Failed to discover group: external.metrics.k8s.io/v1beta1
  • Different GET requests failing in cluster. Sample: I0308 15:14:27.441940 1 request.go:690] Waited for 1.039794137s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.eastus2.azmk8s.io:443/apis/image-assurance.operator.tigera.io/v1?timeout=32s :: {}

The following information will help us better understand what's going on:

We have observed below logs in the cluster from velero:

Restore is properly happening in the cluster until it restored 244 items.
Time: 2024-03-08 14:49:30.6996640
Log: Restored 244 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"default","namespace":"tigera-dex","progress":"","resource":"serviceaccounts"}

then we are observing failures and the restore operation got halt/stuck for 10 min.

I0308 14:53:24.851065 1 request.go:690] Waited for 1.042889102s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/maps.k8s.elastic.co/v1alpha1?timeout=32s :: {}

I0308 14:53:34.897298 1 request.go:690] Waited for 3.996218636s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/metrics.k8s.io/v1beta1?timeout=32s :: {}

Failed to discover group: external.metrics.k8s.io/v1beta1 :: {"error.message":"the server is currently unable to handle the request"}

next restore log in velero came after 10 minutes
Time: 2024-03-08 15:09:30.8773140
Log: Restored 245 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"progress":"","resource":"serviceaccounts","name":"default","namespace":"twistlock"}

Similar things we have observed at various intervals like below:

restore happened properly till
Time: 2024-03-08 15:09:43.178358
Log: Restored 382 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"tigera-pull-secret","namespace":"tigera-prometheus","progress":"","resource":"secrets"}

--- sample failure logs in between the duration of restore where velero got halt/stuck.

Message
I0308 15:14:27.441940 1 request.go:690] Waited for 1.039794137s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/image-assurance.operator.tigera.io/v1?timeout=32s :: {}

I0308 15:14:37.489143 1 request.go:690] Waited for 3.997041158s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/azmon.app.monitoring/v1?timeout=32s :: {}

Failed to discover group: external.metrics.k8s.io/v1beta1 :: {"error.message":"the server is currently unable to handle the request"}

next restore log in velero came after 10 minutes

Time: 2024-03-08 15:29:43.2419230
Log: Restored 383 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"resource":"secrets","name":"akeyless-customer-fragment","namespace":"akeyless","progress":""}

Additional Information captured:

Code that could have impacted this:

Discovery helper Go Routine:
https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/server/server.go#L520

Restore calling discovery refresh:

if createdOrUpdatedCRDs {

Environment:
Velero version (use velero version): v1.11.0
velero-plugin-for-csi: v0.5.1
Kubernetes version: AKS 1.27.9
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202402.07.0

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@mayankagg9722 mayankagg9722 changed the title Restore getting stuck for 10 min at different interncalonger due to "Failed to discover group" and "I0308: client-side throttling" Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" Mar 13, 2024
@danfengliu
Copy link
Contributor

Thanks for reporting this issue!
I got your point that it took so long time to get a refresh result, we will investigate and give a solution in throttling scenario, in addition, did your restore get Failed / partiallyFailed result due to throttling error?

@blackpiglet
Copy link
Contributor

@mayankagg9722
It is related to your k8s environment's metric server.
The metric server is registered to the kube-apiserver as an aggregated API service.

This is the metric server API-Service in my test environment.
If the pod behind the service is not ready, the kube-apiserver cannot serve the corresponding API.
As a result, please ensure the metric service-related workload and related resources are ready before applying the restore.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    components.gke.io/component-name: metrics-server
    components.gke.io/component-version: 0.6.3-gke.1
    components.gke.io/layer: addon
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
    port: 443
  version: v1beta1
  versionPriority: 100

@mayankagg9722
Copy link
Contributor Author

Thanks for reporting this issue! I got your point that it took so long time to get a refresh result, we will investigate and give a solution in throttling scenario, in addition, did your restore get Failed / partiallyFailed result due to throttling error?

we are getting partially failed restore, as there are some errors too.
sample: Velero restore error: error getting namespace tigera-system: timed out waiting for the condition :: {}

@mayankagg9722
Copy link
Contributor Author

@mayankagg9722 It is related to your k8s environment's metric server. The metric server is registered to the kube-apiserver as an aggregated API service.

This is the metric server API-Service in my test environment. If the pod behind the service is not ready, the kube-apiserver cannot serve the corresponding API. As a result, please ensure the metric service-related workload and related resources are ready before applying the restore.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    components.gke.io/component-name: metrics-server
    components.gke.io/component-version: 0.6.3-gke.1
    components.gke.io/layer: addon
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
    port: 443
  version: v1beta1
  versionPriority: 100

@blackpiglet , yes we understand as well that this occur as when API service is not configured properly.

But our ask majorly is around why is Restore getting impacted i.e. seeing 10 min delay in processing restore resources in velero.

1 . We wanted to basically understand where exactly we are putting sleep in velero of 10 min that is causing this issue.
2. How many retries are currently configured in such failures i.e. "discovery groups" and also what is the root cause for throttling in this cluster?
3. Can we make changes in velero go routine for discovery groups (that runs every 5 min) to avoid throttling of such cases?

We have already tried to go through below code pieces to understand the sleep of 10 minutes and throttling but couldn't figure out the root cause, it will be helpful if you can help us.

Code:

It is critical for us and we wanted to understand and do custom patch to avoid these issues during restore.

@blackpiglet
Copy link
Contributor

IMO, the 10-minute timeout is not set for the service discovery.
It is set for the namespace's restore-waiting timeout defaultResourceTerminatingTimeout.

err := wait.PollUntilContextTimeout(context.Background(), time.Second, timeout, true, func(ctx context.Context) (bool, error) {
clusterNS, err := client.Get(ctx, namespace.Name, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
// Namespace isn't in cluster, we're good to create.
return true, nil
}
if err != nil {
// Return the err and exit the loop.
return true, err
}
if clusterNS != nil && (clusterNS.GetDeletionTimestamp() != nil || clusterNS.Status.Phase == corev1api.NamespaceTerminating) {
// Marked for deletion, keep waiting
return false, nil
}
// clusterNS found, is not nil, and not marked for deletion, therefore we shouldn't create it.
ready = true
return true, nil
})

@mayankagg9722
Copy link
Contributor Author

IMO, the 10-minute timeout is not set for the service discovery. It is set for the namespace's restore-waiting timeout defaultResourceTerminatingTimeout.

err := wait.PollUntilContextTimeout(context.Background(), time.Second, timeout, true, func(ctx context.Context) (bool, error) {
clusterNS, err := client.Get(ctx, namespace.Name, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
// Namespace isn't in cluster, we're good to create.
return true, nil
}
if err != nil {
// Return the err and exit the loop.
return true, err
}
if clusterNS != nil && (clusterNS.GetDeletionTimestamp() != nil || clusterNS.Status.Phase == corev1api.NamespaceTerminating) {
// Marked for deletion, keep waiting
return false, nil
}
// clusterNS found, is not nil, and not marked for deletion, therefore we shouldn't create it.
ready = true
return true, nil
})

Thanks @blackpiglet for sharing this. I think you rightly pointed this out as in includeNamespace list "tigera-system" namespace is mentioned in restore payload, which might have actually caused this issue of sleep and halt in restore.

I could see in the logs for this cluster that total 10 resource types came for restore in the namespace "tigera-system"

Sample Logs: Log link in velero

1. Resource 'services' will be restored into namespace 'tigera-system' :: {}
2. Resource 'serviceaccounts' will be restored into namespace 'tigera-system' :: {}
3. Resource 'secrets' will be restored into namespace 'tigera-system' :: {}
...... like this......
9. Resource 'deployments.apps' will be restored into namespace 'tigera-system' :: {}
10. Resource 'configmaps' will be restored into namespace 'tigera-system' :: {}

Also, I could see the error during restore for fetching "tigera-system" namespace.
To note: This error came 17 times and it comes to a total of 170 minutes of delay for completing polling for fetching the namespace to ready state. (PollUntilContextTimeout used in code)

Log Link
Error Log: Velero restore error: error getting namespace tigera-system: timed out waiting for the condition :: {}

I have also calculated the sleep of 10 min at different intervals that I observed during restore and it was also coming around Total of 160 min. Note: multiples of 10, due to number of resources of different types that might came for restore

image

Restore Code:
EnsureNamespaceExistsAndIsReadyis called in velero restore.go at 2 places, which have caused total 10 min delays when restoring those 10 different types of resources in the cluster:

  1. processSelectedResource
  2. restoreItem

@mayankagg9722
Copy link
Contributor Author

mayankagg9722 commented Mar 27, 2024

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.

EnsureNamespaceExistsAndIsReady

Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

@blackpiglet
Copy link
Contributor

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.

EnsureNamespaceExistsAndIsReady

Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

I understand that is inconvenient for the user to wait a long time to find out that the backup cannot proceed, but I think the current behavior is correct for this scenario.
The Velero cannot control when the terminating namespace is deleted, so it seems reasonable to continue to wait for it.
A possible improvement is making the log here more meaningful to tell the user it hangs due to the terminating namespace.

What do you think? Any suggestions?

@mayankagg9722
Copy link
Contributor Author

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.
EnsureNamespaceExistsAndIsReady
Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

I understand that is inconvenient for the user to wait a long time to find out that the backup cannot proceed, but I think the current behavior is correct for this scenario. The Velero cannot control when the terminating namespace is deleted, so it seems reasonable to continue to wait for it. A possible improvement is making the log here more meaningful to tell the user it hangs due to the terminating namespace.

What do you think? Any suggestions?

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

@blackpiglet
Copy link
Contributor

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

The proposal is reasonable, but Velero doesn't understand the resource dependency.
During restoration, Velero creates resources according to a default resource list.

var defaultRestorePriorities = restore.Priorities{

As a result, Velero doesn't know whether the restored resource's dependent namespace already exists.
There was some discussion about how to make Velero aware of the resources' dependency, but that is still not in the planned releases.
#7199

@mayankagg9722
Copy link
Contributor Author

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

The proposal is reasonable, but Velero doesn't understand the resource dependency. During restoration, Velero creates resources according to a default resource list.

var defaultRestorePriorities = restore.Priorities{

As a result, Velero doesn't know whether the restored resource's dependent namespace already exists. There was some discussion about how to make Velero aware of the resources' dependency, but that is still not in the planned releases. #7199

@blackpiglet can we keep the decreased polling time for terminating namespaces which is currently 10 min?

@blackpiglet
Copy link
Contributor

Making it configurable seems feasible.

Copy link

github-actions bot commented Jun 8, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

@github-actions github-actions bot added the staled label Jun 8, 2024
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants