-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: cluster unusable while foreign keys validated #32118
Comments
Hi @damienhollis - can you provide details about the DDL you're attempting to import or the file itself? Feel free to email me at [email protected], assuming you don't want to post it on a public forum. |
Provided the DDL via email. |
Thanks @damienhollis - I saw your update and responded with a folder to store the import. Look forward to assisting further. |
Wanted to add my findings after doing some tests on my end as well. When dumping data into a cluster on Kubernetes Hosted GKE instance I was able to reproduce the issue described the the user. QPS started out at about 35 and quickly degraded to as low as 6 QPS. Once the dump reached the ALTER TABLE statments, the cluster began to be unusable. Running However, I can query to get results such as It appears that VALIDATE CONSTRAINT is taking a very long time, over an hour in this case:
Also running locally after the 18th ALTER TABLE statement, the SQL client hangs and output the following errors:
In my local debug zip, I was also seeing liveness errors like these: |
@vivekmenezes - can you take a look? The issue here is with ALTER TABLE VALIDATE CONSTRAINT so I think this falls under schema? If not let me know who should take a look. We have the SQL files and DDL on a private store, let @roncrdb know if you need a link. |
@tschottdorf we are using v2.1. The problem occurred in single node cluster and 3 node cluster. |
It sounds like it was Ron’s SQL client that crashed with the trace trap and
not the Kubernetes cluster. I’m hopeful that his client is simply not
upgraded to v2.1.
…On Sat, Nov 10, 2018 at 5:20 PM damienhollis ***@***.***> wrote:
@tschottdorf <https://github.com/tschottdorf> we are using v2.1. The
problem occurred in single node cluster and 3 node cluster.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32118 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15INe2yherVeHUp4BUYaDn_HMj0c8qks5ut1EQgaJpZM4YLICz>
.
|
@benesch we have had this issue on two environments and both had 2.1 clients. Also, I'm pretty sure we had to restart the cockroach cluster to make it useable again, not just connect a new client. |
Yes, apologies, @damienhollis. I was referring only to the specific crash
that @roncrdb posted above and not the broader issue.
…On Sun, Nov 11, 2018 at 3:07 PM damienhollis ***@***.***> wrote:
@benesch <https://github.com/benesch> we have had this issue on two
environments and both had 2.1 clients. Also, I'm pretty sure we had to
restart the cockroach cluster to make it useable again, not just connect a
new client.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32118 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA15IN67sdIVPf9XpJ-V-KWCsbZPITI8ks5uuIN7gaJpZM4YLICz>
.
|
@benesch no worries, just thought I'd give more info. A couple of other things. The first instance of this was against a kubernetes cluster and we left it running overnight but the client had crashed by the morning. The second instance was on my development machine and it never finished validating the foreign keys because I killed it after it had been running all night - my laptop is set not to sleep when plugged in, so I'm pretty sure it ran for over 8 hours. |
You were correct, I updated to 2.1 on local now so no more trace error, still hangs on that same ALTER TABLE statement. @benesch it was on local, not kubernetes. I ran
I then ran
Let me know if you need anything else. |
Zendesk ticket #2801 has been linked to this issue. |
@dt I hope you can followup on this one. |
Doesn't look like the |
adding a "me too" to this issue. While the validate is executing, the table being altered is inaccessible (you cant for example run a count on it from another session), but the other tables, including the one that is associated by a foreign key, are accessible normally. You also cant run a SHOW TABLES, while in this state. I haven't provided debug logs here, as it would similar to what was already provided by @roncrdb.
|
This is present because of the call to the InternalExecutor which has a limitation that while it can reuse a user transaction it cannot reuse a TableCollection associated with a transaction. Therefore if a user runs a schema change before a VALIDATE in the same transaction the transaction can get deadlocked on: the transaction having an outstanding intent on the table, and the InternalExecutor triggering a table lease acquisition on the table. Stop using the InternalExecutor in VALIDATE CONSTRAINT. Added the missing call to rows.Close() in validateCheckExpr() related to cockroachdb#32118 Release note (sql change): Fix deadlock when using ALTER TABLE VALIDATE CONSTRAINT in a transaction with a schema change.
This is present because of the call to the InternalExecutor which has a limitation that while it can reuse a user transaction it cannot reuse a TableCollection associated with a transaction. Therefore if a user runs a schema change before a VALIDATE in the same transaction the transaction can get deadlocked on: the transaction having an outstanding intent on the table, and the InternalExecutor triggering a table lease acquisition on the table. Stop using the InternalExecutor in VALIDATE CONSTRAINT. Added the missing call to rows.Close() in validateCheckExpr() related to cockroachdb#32118 Release note (sql change): Fix deadlock when using ALTER TABLE VALIDATE CONSTRAINT in a transaction with a schema change.
This is present because of the call to the InternalExecutor which has a limitation that while it can reuse a user transaction it cannot reuse a TableCollection associated with a transaction. Therefore if a user runs a schema change before a VALIDATE in the same transaction the transaction can get deadlocked on: the transaction having an outstanding intent on the table, and the InternalExecutor triggering a table lease acquisition on the table. Stop using the InternalExecutor in VALIDATE CONSTRAINT. Added the missing call to rows.Close() in validateCheckExpr() related to cockroachdb#32118 Release note (sql change): Fix deadlock when using ALTER TABLE VALIDATE CONSTRAINT in a transaction with a schema change.
This problem has not been fixed. The FK validation happens in three steps 1. Read schema, 2, run validation, 3. write validated schema. CockroachDB requires that this all be done on the same timestamp. There is some contention on the schema where all nodes refresh the schema every 5 minutes. Any of these reads on the schema will necessarily push the above schema change requiring it to be retried. I've come to a conclusion that it's best that we run validation outside of this schema change. The validation command itself should schedule a schema change that runs steps 1, 2, and 3 as separate transactions. |
We pushed this from 19.1 because it requires a refactor |
I'm going to close this issue. As of 19.2 we'd fixed the problem of using a bad join for the validation query, and we started validating FKs by default when they're created, which makes |
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
We exported our database and reimported it into a different cluster. The import worked up to the point where it attempted to validate the foreign key constraints. The validation ran for several hours and then our console died (unrelated to this issue). The database is not that big but some tables have about 300k rows. In the end it turned out the data had been loaded but the foreign key validation caused the import to take an extremely long time and make the cluster unusable.
To Reproduce
Export databases and import into a new cluster.
Expected behavior
A clear and concise description of what you expected to happen.
The validate of foreign keys would finish in a reasonable time. Perhaps the foreign key validation is actually not really necessary given we assume the exported data was valid and if it wasn't, we still need it imported into another database.
Additional data / screenshots
Environment:
cockroach sql
Additional context
The import took extremely long and the cluster was unuseable.
The text was updated successfully, but these errors were encountered: