-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing "database is locked" errors during update #175
Comments
I enabled WAL on my database manually at around 2020-06-25T11:34 UTC. Update: So, yeah, it definitely worked, but not all the way. $ date
Mon Jun 29 21:11:02 UTC 2020
$ cat /var/log/reader/update.log | grep ^20 | cut -dT -f1 | uniq -c | tail -
1649 2020-06-20
1969 2020-06-21
1896 2020-06-22
2384 2020-06-23
1963 2020-06-24
2976 2020-06-25
3451 2020-06-26
3464 2020-06-27
3372 2020-06-28
3208 2020-06-29
$ cat /var/log/reader/update.log | python3 locked.py | cut -dT -f1 | uniq -c | tail
24 2020-06-20
24 2020-06-21
24 2020-06-22
24 2020-06-23
24 2020-06-24
14 2020-06-25
1 2020-06-26
1 2020-06-27
5 2020-06-28
3 2020-06-29 |
A few notes: The logs above show that some days the hourly This shows that feed.last_updated/.last_exception (#68) don't have enough info to show in the web app that reader itself doesn't work. We could have a global equivalent, but I don't think its the right place to solve this: feed.last_exception contains exceptions reader can't do anything about, a global last_exception would contain bug in reader (update_feeds() should never fail, from the point of view of the web application user). Also, how are you going to set it if you can't write to the database? If it becomes a habit that |
OK, so I think I using HAVE_USLEEP either doesn't help, or it was already enabled (it doesn't always show up in I used this script to run the update commands like cron would, but at a scaled down timeframe; for all 4 combinations of journal mode delete/WAL and system SQLite / custom HAVE_USLEEP build, the hourly update fails during the second minutely update (which is what we see for real). |
So yeah, we should probably do HTML stripping outside of the transaction. Note how the strip_html() cumulative time makes up most of the sqlite3.Connection.execute() one:
|
Ok, we could do this to take strip_html() out of transactions:
Currently, # everything under "with sql" is a single SQL statement
while True:
with transaction:
# write lock
with sql:
input = get_entries_to_update(chunk_size or -1)
# this takes most of the time
stripped_entries = strip_entries(get_unstripped_entries(input))
insert_into_search(
part
for entry in stripped_entries
for part in entry.parts
)
if sql.rowcount == 0:
# nothing to update, bail early
break
with sql:
# input is identical to the previous one,
# since we're in a transaction
input = get_entries_to_update(limit)
set_to_update_flag(input, 0)
if not chunk_size:
# the first iteration did everything, bail
break
if sql.rowcount < chunk_size:
# this is the last chunk, bail
break We could instead do this: while True:
with transaction:
# read lock
with sql:
input = get_entries_to_update(chunk_size)
unstripped_entries = get_unstripped_entries(input)
if not unstripped_entries:
# nothing to update
break
# this takes most of the time
stripped_entries = strip_entries(unstripped_entries)
for stripped_entry in stripped_entries:
with transaction:
# explicitly acquire a write lock
# update: this check added after the initial implementation
if not current_entry.to_update:
# a concurrent call updated this entry, skip it
continue
with sql:
insert_into_search(
part
for part in stripped_entry.parts
if current_entry.last_updated == stripped_entry.last_updated
)
# presumably we could have inserted the whole chunk,
# but we'd have to throw everything away if just one entry changed
if sql.rowcount != len(stripped_entry.parts):
# entry changed since we got it, skip it,
# we'll catch it on the next loop
rollback
continue
with sql:
set_to_update_flag(stripped_entry, 0) |
OK, after 82234aa: The script I used above still fails when using 5/8 seconds as timeout (the "scaled down" one), but does not fail anymore when using the default 5 seconds one (it did before the change). We see a 1.7x increase in the update_search() timings when updating >64 entries:
The profile shows the cumulative time of sqlite3.Connection.execute(many) decreased from 2.8s to .6s, but increased from .2s when not accounting for the time spent in strip_html(); however, this time is now spread over 2k calls (as opposed to ~40 before). The current implementation starts a transaction for each of the 2k calls; we could probably batch the inserts somewhat, at the expense having to retry the whole batch if a single entry in it got updated since we started work on this batch. At the moment, this seems overkill.
|
I deployed the change above at about 2020-07-03T13:34 UTC. Update: The CPU utilization on the instance went to 90% ... Timeline:
There's nothing in the logs about Update #2: As expected, there was an infinite loop. It was because after the fancy |
OK, so the number of errors kept increasing. Looking at some backups, I noticed the following:
Some conclusions:
Update: This gist has a script that reproduces the issue and output showing what's happening. Update #2: The scenario is:
|
…anges between insert loops. For #175.
to avoid inserting them more than once if the entry changes between update loops. For #175.
I deployed the change above at about 2020-07-07T20:15 UTC and re-enabled the Update: Still seeing "database is locked", but at least I didn't see any duplicate entries_search rows. |
This gist has a summary of timings for each db.execute(many) call from the updates between roughly --07-08T19/--07-09T07. Some observations:
Next steps:
|
Deployed the smaller Update: That doesn't seem to have helped much, we got roughly the same:
|
causing two entries_search scans instead of one. For #175.
I deployed the change above at --07-11T09:25. It should almost halve the _delete_from_search query time. BTW, I'm leaving the "after update on feeds" trigger fix for last because it's a sure win, and it'll make it harder to measure the impact of the other changes. Update: 2/6 failures for hourly |
I deployed the change above at --07-11T15:43. Update: As of --07-11T20:26: 0/5 hourly Update: As of --07-12T08:25: Still no failed |
To summarize what happened:
Lessons learned:
|
It started again today...
Meanwhile, the web app is very slow (/reader/?limit=64 loads in 24s as opposed to 0.2s). --debug-storage output shows stuff like:
The burst balance started decreasing at --07-12T08:15, but I can't figure out why (I made no change around that time). Rebooted at --07-13T14:00. Update: ... and didn't happen again after that. |
Starting with 2020-04-15, there has been an increasing number of "database is locked" errors during update on my reader deployment (
update
every hour andupdate --new-only && search update
every minute).Most errors happen at XX:01:0X, which I think indicates the hourly and minutely updates are clashing. It's likely that
search update
is hogging the database (since we know it has long-running transactions).I didn't see any metric changes on the host around -04-15.
Logs.
I should check if there was a trigger that started this, or if the number of entries/feeds simply hit some threshold.
Also it, would be nice to show the pid in the logs so I can see which of the commands is failing, and maybe intercepting the exception and showing some nicer error message.
Things that are likely to improve this:
--workers 20
to give the hourlyupdate
a chance to finish before the second minute of the hour. Obviously, this isn't actually addressing the problem.search update
chunks more spaced out, to allow other stuff to happen.search update
not hog the database by not stripping HTML inside the transaction.The text was updated successfully, but these errors were encountered: