-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock wait timeout exceeded MySQL errors #9905
Comments
Hey @nickdunn, do you have SEOmatic installed by chance? If so can you please ensure that you are running its latest version? We have seen similar issues where that ended up being the culprit, and has been fixed in v3.4.14. |
@brandonkelly I've been experiencing the same on a page recently. It's a "Pruning extra revisions" job that fails: Now we are running the latest version, but we noticed the issue 10 days ago after running 3.7.11 for 2 weeks. Before that we have been running 3.6.17 without any problems. So it might be a change in 3.7 causing this? Another thing I have noticed is that this is often happening after saving a specific entry. The entry containes a matrix field with a bunch of blocks, but this entry have had it's revisions pruned without a problem earlier. System infoPHP 7.4.15 PluginsAsyncQueue 2.3.0 NotesIt's an issue that occured before AsyncQueue was installed. TimelineSun Jul 04 2021 Updated to 3.6.17 Hope this helps. Let me know if you need me to test anything more. Edit: |
@lassemt Does the job finish successfully if you restart it? Or does it consistently fail? |
@brandonkelly I didn't try restarting the queue as most of the times I had to restart the server. I have the queue.log from when it happened and you see it tries to restart the job itself:
|
Thanks. The first and last errors are 12 minutes apart, so that probably rules out the possibility that the lock was just legitimately tied up by a separate process. Do you see any other |
@brandonkelly that makes sense. Here is batch from queue.log when I tried to force recreating the error by manually saving the entry a couple of times. This was done right after I restarted the server, so the first event you see is the first queued job after the restart, and the last event is from right after I triggered the server restart through Digital Ocean.
The only error in phperrors.log close to this timestamp is at I was watching the queue in the admin when the above was happening and took screenshot of the 2 "Pruning extra revisions" jobs that "got stuck". The images are in the order they were in the queue overview in the admin: I'm not sure if it matters, but notice that the 2 jobs are related to the same SourceID and that the one with the ID 292913 that was pushed first failed right after the one with the ID 292917 was pushed to the queue. I believe it was right after the second one was pushed to the queue the server started to run slow. |
@brandonkelly I updated and restarted the server now. The same issue is still there and the logs are the same. But I noticed that it only happens when there are 2 "Pruning extra revisions" jobs for the same entry. None of the jobs are completed before the server starts to slow down, and it is always the first one pushed to queue that get the Lock Wait timeout exceeeded error message. The last one of the 2 jobs rarely shows up in the queue.log. If there is only one "Pruning extra revisions" job for the entry, the job finishes as expected:
I had to open the entry editor in 2 tabs and save them almost simultaneously to get 2 jobs going. |
We do, and have recently updated to latest version. Not seen the error since... but then again we've only seen the error a handful of times in 6 months so it's sporadic. I don't think this was the culprit since we've had |
@lassemt Do you have multiple queue runners? |
I'd like to chip in with a similar experience, although it's not just a single resource/element that it's happening on. We do have a faily chunky install, with 9 sites, ~50k rows in MariaDB pegged a single core (out of 4) once this behaviour started and had a huge impact on DB performance such that the CP was unususable (couldn't save entries reliably). Killing the DB process was the only way to resolve things. I have since run prune-revisions via the CLI and it ripped through pretty quickly. queue.log attached for info, not that it reports anything that hasn't been seen already (what's with the order of the entries being messed up though? why aren't they in ascending time order?) Craft 3.6.18 queue.log: |
@brandonkelly I don't think so (I'm not too familiar with queues in PHP). As far as I know, I'm only running the default queue driver. I disabled the ostark/craft-async-queue plugin before updating and running the previous tests. |
Using Async Queue here, default 2 concurrent processes. |
@nickdunn I have a feeling multiple queue runners has a tendency to exacerbate these sorts of issues. |
Yeah, use 1 queue processor and you'll be fine. We learned this the hard way too, hence: https://twitter.com/sjcallender/status/1433238613039407105 |
Ok, we've upgraded Craft, Neo and SEOmatic to latest and have dropped Async Queue to one queue processor so will see if this fixes it. |
Same here! The entire site from our client went down after a few “Internal server errors” messages. For me as the admin I could test this once and after it the website + CP crashed down. There where 2 failed tasks in the queue before it threw me out that stated something about Pruning extra revisions:
We found this query task on our server. We think this was our little devil that crashed everything. (We manually killed it because there was no other option left) Also these: Maybe you guys can shed some light on this issue so we can maybe prevent this in the future.
|
@kevinmu17 - This is the exact issue that forced us to use just one queue processor. We've been fine ever since. |
We didn't install any, or does craft out of the box use more than 1? :) |
By default, the queue is run when you visit the the control panel. I'm not sure if two or more can ever be initiated this way. @brandonkelly? I'd recommended setting |
@kevinmu17 I'd suggest updating to the latest 3.7 release as a starting point and see if the behavior changes. There have been improvements made around transactions and locking that you're missing. |
Could #9977 be a potential fix for this? Edit: unlikely — locally I've just run |
@nickdunn You updated to 3.7.19 first right? |
Ha, yes I did! The But I'm thinking maybe 55K I must admit I don't fully understand what the lower level |
Ah… it’s due to Neo. Each individual Neo field/element combo is going to add its own additional structure. |
We just update to 3.7.10.1 since we had similar problems with a version < 3.7.19, but the problem still persists., although now it does not create that massive db load anymore which can finally only be resolved by restarting it. Here are some facts:
I know that some of my findings above sound rather confused, but this is what I am seeing. I need to mention that we have Neo installed in version 2.8.16 and we need to update that. will happen soon and I will keep you posted. |
@hiasl We have the same exact issue, We've tried rebooting the cluster DB and still the issue persist, did you do anything to resolve your issue? |
Fixes #9905 - where a deadlock error would occur when updating a row referencing an elements table row, which may be locked by another request
We finally got to the bottom of this, with much help form @aaronbushnell 🎉 The deadlock errors occurred when updating rows in the We fixed by dropping the foreign keys that relate these tables together, and deferring the “cascade deletes” to garbage collection. Since that involved a new DB migration, we decided to fix it for Craft 4.4 rather than the next patch releases. To test the fix early, you can change your |
I tried this update in my Craft 4 branch of this project and it works, @brandonkelly! Thanks so much for the work you and Brad put into this! I presume this probably requires Craft 4 and couldn't be a fix backported to Craft 3? |
Yay, well done! Stupid question, but is it genuinely safe to remove these? Don't the FKs provide data integrity for those relationships? And is there a risk that any plugins rely on these FK rows and it may be a bigger breaking change? |
@aaronbushnell Due to the schema change, we wouldn’t backport the fix to v3 unless there was going to be a 3.8 release. So maybe down the road (e.g. if we have a need to bump the minimum PHP version), but not in the immediate future. Update: We ended up releasing 3.8 alongside 4.4. @nickdunn It can be a concern, but not in this case. Both (Plus, elements typically get hard-deleted via garbage collection, so in most cases those extra rows will only be around for a fraction of a millisecond, between the time the elements are hard-deleted and the orphaned extra rows are deleted.) |
That makes sense, thanks for the follow-up @brandonkelly! |
Craft 3.8 and 4.4 are out now with this deadlock fix! |
@brandonkelly I'm having a lot of issues with structures being orphaned on multiple 3.8 sites, could this change have anything to do with it? |
@boboldehampsink This change could lead to orphaned If you want to send in a DB backup + Composer files, and point out the problem, we can look into it from there. Or if you can come up with steps to reproduce, post a new issue about it. |
@brandonkelly I was already mailing with support (Oli) about this. Just sent another dump where it can be reproduced by running GC, see support mail. |
Related spicywebau/craft-neo#725 |
@boboldehampsink Is your issue specific to Neo as well? |
@brandonkelly I do have this specific issue, but I had one other site that had non-Neo structure related issues after GC - it just broke a Structure section. So my guess is its not Neo but Craft (structure) specific. |
After an upgrade from 3.7.68 -> 3.8.3 last week we've had several structures flatten and attempts to edit them result in server errors. This is an instance with Neo, Navigation plugins, among others, but non-plugin structures (the page tree, for instance) was also flattened. We don't know when this background process ran, so we're rolling back (spinning up a new host based on a snapshot, rolling back the db to before the upgrade and restoring a pre-upgrade release). Stack: headless Craft, NextJS, Postgres |
Thanks all! I was able to reproduce with @boboldehampsink’s DB, and tracked it down to a PostgreSQL + garbage collection bug in 3.8 + 4.4. Craft 3.8.4 and 4.4.4 are out now with a fix. (96c56f1) |
@brandonkelly The flattening of structures appears to be fixed, but GC in 3.8.4 seems to be removing tag, category and asset relations. I've been able to reproduce this two times now by rolling back, looking at entries, running GC, looking again. |
Doh, sorry about that. We were so focused on this being a structure bug, I completely forgot that we made the same GC change for relations as well. Craft 3.8.5 and 4.4.5 are out with the same fix for relations. |
We'll test it out and will confirm. |
Looks good to us, thanks! |
Hi, I recently upgraded a large site from Craft CMS 3.7.55.1 to Craft CMS 4.4.11 and I am receiving the following error when trying to save changes to an entry: I added some test blocks to a page that I had forgotten to remove. I went into the entry deleted the test blocks and clicked save and that is when the error happened. I now can't make any changes to the entry without receiving the error above in the logs and an Internal Server Error in the CP. I tried to add a block to other entries and they seem to be fine, however, I am concerned that this is not an isolated case and I could receive more of these errors. |
Hey @paragonn, it sounds like maybe your mutex lock didn’t clear itself for some reason. How is your |
Hey @brandonkelly -- I should also have mentioned that it is a load-balanced environment across two servers. We have a redis component configured, but no mutex. Should we add a mutex configuration? Below is the components configuration that we have in place in the config/app.php: |
@paragonn Yup! See https://craftcms.com/knowledge-base/configuring-load-balanced-environments#mutex-locks Probably worth giving the rest of that page a once-over, too. |
Thanks @angrybrad ! |
Description
Very sporadically I am getting
Lock wait timeout exceeded
errors from MySQL.These appear in Craft logs like:
and
Running
show engine innodb status
shows they are consistently from queries on thestructureelements
table:The database server is pretty chunky and performs well elsewhere. The
innodb_lock_wait_timeout
setting was at the50
second default. I have just dumped to100
but this feels like a sticking plaster i.e. telling it to wait a bit longer just in case!#5872 suggests a large number of revisions could be the issue. I only have about 6k
entries
and 4krevisions
, while there are 57k rows instructureelements
.I'm conscious StackOverflow is awash with useless threads on these errors and the usual reply is "refactor your code to make sure everything is in one transaction" which doesn't seem too useful.
The offending queries seem fairly innocuous to me. The issue is very sporadic — maybe 10 times in 3 months — and always seemingly when a content editor goes to save an entry.
Reckons?
Additional info
The text was updated successfully, but these errors were encountered: