-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: RDS Instance creation, resource already exists #1346
Comments
@mjnovice, could you please try changing the field |
@turkenf that helps, but that also creates another RDS no ? Is there a known cause for this issue ? |
I could not reproduce the bug, when you get this error, are you sure that there is no other instance with the identifier |
@turkenf nope. There was none before applying the CR. This happens intermittently for me. |
I've got a similar problem. I'm trying to create an RDS instance from another instance using apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
annotations:
crossplane.io/external-create-failed: '2024-06-19T10:42:51Z'
crossplane.io/external-create-pending: '2024-06-19T10:42:51Z'
crossplane.io/external-create-succeeded: '2024-06-19T09:18:22Z'
creationTimestamp: '2024-06-19T09:18:21Z'
finalizers:
- finalizer.managedresource.crossplane.io
generation: 2
name: obfuscated-c75kw-mssql-db
resourceVersion: '1261333276'
uid: e04aadb8-ed95-4ec9-b09e-940f6bc4ade2
selfLink: >-
/apis/rds.aws.upbound.io/v1beta2/instances/obfuscated-c75kw-mssql-db
status:
atProvider: {}
conditions:
- lastTransitionTime: '2024-06-19T09:18:22Z'
reason: Creating
status: 'False'
type: Ready
- lastTransitionTime: '2024-06-19T10:42:51Z'
message: "create failed: async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d []}]"
reason: ReconcileError
status: 'False'
type: Synced
- lastTransitionTime: '2024-06-19T10:42:51Z'
message: "async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d []}]"
reason: AsyncCreateFailure
status: 'False'
type: LastAsyncOperation
spec:
deletionPolicy: Delete
forProvider:
autoGeneratePassword: true
autoMinorVersionUpgrade: true
backupRetentionPeriod: 0
caCertIdentifier: rds-ca-rsa4096-g1
dbSubnetGroupName: prod1
deleteAutomatedBackups: true
engine: sqlserver-ee
engineVersion: ''
identifier: obfuscated-c75kw-mssql-db
instanceClass: db.r5.xlarge
multiAz: false
optionGroupName: prod1-obfuscated
region: eu-west-1
skipFinalSnapshot: true
tags:
crossplane-kind: instance.rds.aws.upbound.io
crossplane-name: obfuscated-c75kw-mssql-db
crossplane-providerconfig: default
vpcSecurityGroupIds:
- sg-xxxxxxxxxxxxxxx
initProvider:
restoreToPointInTime:
- sourceDbInstanceIdentifier: prod1-obfuscated-mssql-db
useLatestRestorableTime: true
managementPolicies:
- '*'
providerConfigRef:
name: default
writeConnectionSecretToRef:
name: obfuscated-c75kw-mssql-db
namespace: production This worked correctly when I was using v1beta1 and the rds provider in version Edit: I've managed to get the controller to work correctly by updating the manager resource with the Edit2: After the controller decided that there's actually no problem with the Instance resource it updated the |
The terraform provider version 5.0 upgrade contained a breaking change to this resource, which we mitigated, but may not have been able to fully restore the terraform provider 4.x behavior. The details of the change are both important and confusing. In terraform provider aws v4.x, the terraform
For an upjet-based crossplane provider, the format of the terraform id is very important. Ideally, we can construct it from the In terraform provider aws v4.x, which used the
I have not yet tried to reproduce this issue, but my working theory is that the sequence of events goes like this:
If my theory is correct (which I hope some answers below can help confirm or refute), then the remaining task is to figure out what's happening in step 2, and make it not happen. If we're lucky, it could be as simple as a timeout configuration needing to be tweaked, but it may end up being much more complicated. Some questions for those of you who have experienced this issue:
|
I am attaching debug logs from the RDS provider that mention this instance. Explore-logs-2024-06-20 12_08_32.txt To further add to that, from the events, it seems that after 30 minutes, problems start appearing. The point-in-time restore takes a fair bit longer for that database than 30 minutes. The instance itself is configured correctly. All changes specified in the manifest are there. Right now, it happens each time I try to recreate that database, which is around once per day. |
Ok, one thing that actually is misconfigured compared with the manifest is the backup retention window. In the manifest, I set |
Thanks for the logs @PatTheSilent, they were really helpful. What I see confirms my working theory above. It looks like the request to create the database instance is timing out after about 40 minutes, and then the provider is repeatedly trying to make more This is also consistent with your finding about I'm not sure, but I don't think there's a user-facing way for you to adjust the timeout, and it can only be adjusted in the provider source code. That might be a good enhancement to add to upjet. It seems like this is a resource that could potentially take a very long time to create, especially if you're restoring from a large snapshot. Can you tell how long it's taking this RDS instance to actually be created? There's an additional problem, which is that this situation should result in the provider adding the external create incomplete annotation, raising an error, and stopping further reconciliation of this resource, because it's lost track of the resource it created. That's not happening, and I want to figure out why. |
@mbbush from what I could see, it took over an hour for it to be available. It's a MS SQL database with a few TB of data. And yeah, it would be great if we could configure timeouts like in Terraform, maybe via an annotation. |
Hey @PatTheSilent, I haven't read all of the past discussion above, but based on my impression from the issue description and on what we discussed off-channel with @mbbush, here's what I believe happens:
Here's the crucial part: Upon encountering an error, Terraform doesn't let us know of the instance ID that it received in Step 2. If it did, we could have set the external-name annotation. When the external-name annotation is not set, we try to create the resource again in the next reconciliation loop. Rather fortunately, trying to recreate an existing RDS instance fails, so that the user is aware of the problem. Recreating other types (probably VPC) of existing resources may succeed. Therefore, we may end up creating and leaking multiple resources in case of errors described above, without user being aware. The solution is to have Terraform set the instance ID to the state in case of errors. Indeed, doing so is Terraform's policy. From their Resource Contribution Guideline:
I opened upstream issue and fix last week.
The fix has been merged yesterday. Next time we update our Terraform provider dependency, the fix should take effect in our provider. |
@mergenci Awesome stuff! Thank you very much for your work! |
I faced same issue. It is producible but only for a specific instanceIdentifier. We tried other claims in same configuration, there were created just fine. I can share some details to these question
Yes identifier exists on managed resource. The instance name in the aws console was the same. Yes crossplane tags are set on actual rds instance:
It's reproducible and it takes
Also here are cloudtrail request logs from first successful create event and first failing event Maybe worth to mention we set the identifier in our composition: - type: FromCompositeFieldPath
fromFieldPath: metadata.labels[crossplane.io/claim-name]
toFieldPath: spec.forProvider.identifier
- type: FromCompositeFieldPath
fromFieldPath: spec.resourceConfig.instanceIdentifier
toFieldPath: spec.forProvider.identifier |
@karlderkaefer, thank you for your well-prepared report. Which provider version are you using? The upstream Terraform fix that addresses the issue was consumed in #1406, which was released in v1.10.0 — latest provider version is v1.11.0 by the way. RDS creation errors are still possible in versions v1.10.0 or newer, simply because creation takes very long and something may timeout, like credentials do in your case. But such errors should be recovered from automatically in the next reconciliation loop, because now we receive the RDS ID (even if something timed out) with which we can successfully detect that the instance exists. In other words, we detect that the creation that we thought failed indeed succeeded. In case your provider cannot detect that the instance was created successfully, and therefore is trying to recreate it and fail, it means there is another edge case we have yet to cover. |
@mergenci, we are currently using version |
@mergenci I can confirm the issue is now fixed with version |
I think we have enough evidence for this issue being fixed, therefore I'm closing it. Feel free to reopen, or preferably, to open a new one. Thank you everyone for your contribution 🙏 |
Is there an existing issue for this?
Affected Resource(s)
No response
Resource MRs required to reproduce the bug
Steps to Reproduce
What happened?
Relevant Error Output Snippet
The text was updated successfully, but these errors were encountered: