-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/tests: TestMonotonicInserts/distsql=on failed #67802
Comments
I've stressed it with
...
Seems like a KV issue. |
Another instance: https://teamcity.cockroachdb.com/viewLog.html?buildId=3200931&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Testrace If we keep hitting it, I will have to skip TestMonotonicInserts |
Chasing on the deadlock I saw that it always get stuck waiting on a lock queue for all 6 parallel queries. What I noticed is that if I bump the timeout to 90s and run it for some time test time varies widely with most iterations finishing in 15s, but some iterations running 45-60s and eventually hitting 90s threshold. I'll try to bisect if this behaviour is something new or we just have more stressed machines that run tests maybe. |
So I bisected the failure to this PR: #66505 which is bumping GO to 1.16.5 |
Could it be related to #67658 ? If yes, we could try bisecting among changes to |
Current findings: Test runs fine unless one of transactions gets into bad shape where it indefinitely restarts with the same txnid while holding locks. On each iteration it would obtain more and more locks for unsuccessful writes. This could either lead to test time out, or if luck strikes txn_interceptor_heartbeater.go tries heartbeat at the time error is sent to client and it didn't retry yet (which is done in a tight loop) then it would trigger transaction abort that would kill transaction non-retryably and other writers could proceed. |
Update: retracting the bisect conclusion to Go update. After running pre 1.16 for longer I made it fail. So it may be failing more frequently, but it was definitely failing even before that change. Another observation that I had is that in cases where we get stuck, requests tried to push transaction and push requests end up waiting for push on the same lock table object (see 0xc0006bd680 below)?
|
Update:
|
Refs: cockroachdb#67802 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None
…ierProcessor} Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: cockroachdb#67802, \cockroachdb#68127, cockroachdb#68795, cockroachdb#68801 Release note: None
68794: roachtest: fix apt-get upgrade in Jepsen tests r=knz a=erikgrinaker An `apt-get upgrade` command blocked on: ``` Setting up openssh-server (1:8.2p1-4ubuntu0.3) ... debconf: unable to initialize frontend: Dialog debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.) debconf: falling back to frontend: Readline Configuring openssh-server -------------------------- A new version (/tmp/fileMcydAr) of configuration file /etc/ssh/sshd_config is available, but the version installed currently has been locally modified. 1. install the package maintainer's version 2. keep the local version currently installed 3. show the differences between the versions 4. show a side-by-side difference between the versions 5. show a 3-way difference between available versions 6. do a 3-way merge between available versions 7. start a new shell to examine the situation What do you want to do about modified configuration file sshd_config? ``` This patch sets `DEBIAN_FRONTEND=noninteractive` to avoid this. Resolves #68783. Resolves #68782. Resolves #68781. Resolves #68780. Resolves #68779. Resolves #68778. Resolves #68777. Resolves #68776. Resolves #68775. Resolves #68774. Resolves #68773. Resolves #68772. Resolves #68771. Resolves #68770. Resolves #68769. Resolves #68768. Resolves #68767. Resolves #68766. Resolves #68765. Resolves #68764. Resolves #68763. Resolves #68762. Resolves #68761. Resolves #68760. Resolves #68759. Resolves #68758. Resolves #68757. Resolves #68756. Resolves #68755. Resolves #68754. Resolves #68753. Resolves #68752. Resolves #68744. Resolves #68726. Resolves #68724. Resolves #68722. Resolves #68720. Resolves #68719. Resolves #68718. Release note: None 68802: *: skip Test{MonotonicInserts,BackupRestoreEmpty,StreamIngestionFront… r=irfansharif a=irfansharif …ierProcessor} Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: #67802, \#68127, #68795, #68801 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: irfan sharif <[email protected]>
As discussed at some point with @nvanbenschoten we thought the problem is that we refresh to closed timestamp, but I think it is just a matter of how test uses cluster_logical_timestamp() to fix transaction to closed ts from previous try. Unless there's a guaranteed way that client could do the same "legally", I don't think it is a bug. So we'd rather fix the test or at least make it respond to the situation where it gets temporarily stuck. As for stuck time, what happens is that test can pause on aforementioned select max(val) on all 6 workes for few sec and then gets unblocked by test termination code. I think it is not even trying to make a verification query at that time as we are still waiting for all workers to finish. |
Ok, so I added the check once cluster_logical_timestamp() diverges from wallclock for more than 2 seconds, drop transaction and start again. Running without getting stuck. It is a band aid, but i think the problem is with unable to restart is inherent to the test, not the way we refresh transactions. |
There is. Clients have access to the Fixing the test sounds like a good idea, but I don't think we should miss this opportunity to improve upon how transactions are retried after being bumped by the closed timestamp. Right now, transactions that get bumped by the closed timestamp push their write timestamp to immediately above the closed timestamp ( This can be visualizes as:
This is was something we noticed in #51294. The 600ms referenced there was the period of the closed timestamp at the time. We addressed that issue by preventing transaction refreshes from being starved in such situations. But in retrospect, it's not surprising that transactions that use
I imagine that if we were to address that, this test would no longer be flaky. Even if a transaction took 3s during its first epoch due to some slowdown, it wouldn't enter this endless starvation loop where it has effectively no chance of ever succeeding in subsequent retries. Instead, it would be given a full 3s again to complete its next retry. |
Will this change work #69096? I'm a bit wary of changing transaction timestamp "randomly" and I don't know how much to trust our tests to cover that part. |
sql/tests.TestMonotonicInserts/distsql=on failed with artifacts on master @ 9f39af35fd81a7754e6123862b63247e3277b458:
Reproduce
To reproduce, try:
Parameters in this failure:
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: