Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release stuck in CannotUpdateExternalResource #58

Closed
negz opened this issue Nov 2, 2020 · 5 comments
Closed

Release stuck in CannotUpdateExternalResource #58

negz opened this issue Nov 2, 2020 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@negz
Copy link
Member

negz commented Nov 2, 2020

What happened?

I'm trying to create a CompositeCluster via Upbound Cloud using the AWS reference platform v0.0.4, which I think corresponds roughly to https://github.com/upbound/platform-ref-aws/tree/fee50c794da296e832fb7a80bdee8cff508aac96. It seems like the Helm release created by this platform is stuck permanently in CannotUpdateExternalResource:

Name:         clustor-xnkq2-ftz94
Namespace:    
Labels:       crossplane.io/claim-name=clustor
              crossplane.io/claim-namespace=it
              crossplane.io/composite=clustor-xnkq2
Annotations:  crossplane.io/external-name: clustor-xnkq2-ftz94
API Version:  helm.crossplane.io/v1alpha1
Kind:         Release
Metadata:
  Creation Timestamp:  2020-11-02T23:15:54Z
  Finalizers:
    finalizer.managedresource.crossplane.io
  Generate Name:  clustor-xnkq2-
  Generation:     2
  Owner References:
    API Version:     aws.platformref.crossplane.io/v1alpha1
    Controller:      true
    Kind:            Services
    Name:            clustor-xnkq2-gv4h2
    UID:             f003da91-c3b0-43b8-ae5d-db91cf147131
  Resource Version:  881946
  Self Link:         /apis/helm.crossplane.io/v1alpha1/releases/clustor-xnkq2-ftz94
  UID:               1796dea7-2826-48f4-abff-32467441192f
Spec:
  For Provider:
    Chart:
      Name:  kube-prometheus-stack
      Pull Secret Ref:
        Name:       
        Namespace:  
      Repository:   https://prometheus-community.github.io/helm-charts
      Version:      10.1.0
    Namespace:      operators
    Values:
  Provider Config Ref:
    Name:  clustor
Status:
  At Provider:
    Release Description:  Initial install underway
    Revision:             1
    State:                pending-install
  Conditions:
    Last Transition Time:  2020-11-02T23:28:10Z
    Message:               update failed: failed to upgrade release: "clustor-xnkq2-ftz94" has no deployed releases
    Reason:                ReconcileError
    Status:                False
    Type:                  Synced
    Last Transition Time:  2020-11-02T23:27:48Z
    Reason:                Unavailable
    Status:                False
    Type:                  Ready
Events:
  Type     Reason                        Age                   From                                Message
  ----     ------                        ----                  ----                                -------
  Warning  CannotConnectToProvider       25m (x2 over 25m)     managed/release.helm.crossplane.io  provider could not be retrieved: ProviderConfig.helm.crossplane.io "clustor" not found
  Warning  CannotConnectToProvider       16m (x20 over 25m)    managed/release.helm.crossplane.io  secret referred in provider could not be retrieved: secret data is nil
  Warning  CannotUpdateExternalResource  9m7s (x9 over 13m)    managed/release.helm.crossplane.io  failed to upgrade release: "clustor-xnkq2-ftz94" has no deployed releases
  Warning  CannotUpdateExternalResource  5m2s (x5 over 7m27s)  managed/release.helm.crossplane.io  failed to upgrade release: "clustor-xnkq2-ftz94" has no deployed releases
  Warning  CannotUpdateExternalResource  7s (x6 over 3m7s)     managed/release.helm.crossplane.io  failed to upgrade release: "clustor-xnkq2-ftz94" has no deployed releases

How can we reproduce it?

Use Upbound Cloud to create a CompositeNetwork, then a CompositeCluster.

What environment did it happen in?

kubectl crossplane install provider crossplane/provider-helm:v0.3.5
@negz negz added the bug Something isn't working label Nov 2, 2020
@negz negz changed the title Release stuck in c Release stuck in CannotUpdateExternalResource Nov 3, 2020
@turkenh
Copy link
Collaborator

turkenh commented Nov 3, 2020

Seems to be related to the case described here: #36 (comment)

Enabling rollback could help to recover from such cases: #36 (comment)

@turkenh
Copy link
Collaborator

turkenh commented Nov 3, 2020

An actual fix would probably implementing something like this in helm provider controller: helm/helm#7139 (comment)

@turkenh
Copy link
Collaborator

turkenh commented Nov 3, 2020

@muvaf reported same issue and we don't see any pod restarts in helm provider. helm deployment probably cancelled due to context deadline in this case.

@turkenh
Copy link
Collaborator

turkenh commented Nov 3, 2020

In summary;

If chart contains hooks, helm deployment takes longer since helm go client waits (blocking) until hooks completed (if pre), or all other resources completed (if post), which could become more severe because of things like:

  • cluster nodes not being ready (composed k8s cluster could report ready before all nodes are running, which could be the case here)
  • slow network (pulling pod images)
  • hook job itself taking longer

Actual helm release marked as pending-install and there is no way to continue installation or do an upgrade on top of that release, if, during deployment, deployment process cancelled somehow (similar to hitting ctrl-c in helm client):

  • controller pod being restarted/killed
  • context deadline exceeded

A short term solution could be increasing the context deadline which would block reconciler loop longer. This is not a perfect solution for a k8s controller, but since helm provider was already configured to run 10 workers by default, it wouldn't hurt much.

Actual fix should be to find a way to do the deployment as asynchronous instead of blocking with helm go client.

Even with a fix, I believe rollback should be enabled in helm release (esp. if there are hooks in the chart) by setting spec.rollbackLimit like this.

turkenh added a commit to upbound/platform-ref-aws that referenced this issue Nov 3, 2020
turkenh added a commit to upbound/platform-ref-aws that referenced this issue Nov 3, 2020
turkenh added a commit to upbound/platform-ref-aws that referenced this issue Nov 3, 2020
@turkenh turkenh self-assigned this Nov 3, 2020
@turkenh turkenh added this to the Platform Sprint 11 milestone Nov 3, 2020
@turkenh
Copy link
Collaborator

turkenh commented Nov 4, 2020

Having short term solution merged, closing this issue in favor of #63

Please reopen, if you observe again.

@turkenh turkenh closed this as completed Nov 4, 2020
@turkenh turkenh mentioned this issue Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants