fix(lock) handle all resty.lock failure modes #112

flrgh · 2022-07-01T01:56:12Z

This fixes some issues with the business logic used when calling a mutexed function:

The result of pcall(resty_lock, lock_obj, key) was checked to ensure no exception was raised, but the actual result of acquiring a lock was never checked, so functions could be executed without holding a valid lock
The conditions that would cause a mutexed function to be retried in a timer were a little bit broad, which could lead to scheduling more timers than we really want to

Both run_mutexed_fn and run_fn_locked_target_list shared much of the same logic, so I factored it out into a common run_locked function which contains several behavioral changes:

resty.lock:lock() is no longer wrapped with pcall. Instead we explicitly check if we're in a yieldable phase and set the lock timeout to 0 if not. This ensures that resty.lock will not attempt to sleep a the lock is already held.
We only re-schedule the function in a timer if resty.lock:lock() returned timeout and we're not in a yieldable phase

The only logic that was left up to the callers of run_locked was the handling of the re-schedule condition. This was done so that callers could log a more informative/context-aware message instead of just a generic "we re-scheduled some kind of function" log entry.

I think this is the correct way to address some of the observed problems, but review from somebody more knowledgeable with the project is very much needed. If it looks good I can port to the 1.x tree.

This fixes a bug in the `run_fn_locked_target_list` function. The function wraps `resty.lock:lock()` with pcall, but it was only checking the pcall return status and not the actual result of the lock operation. Therefore it would continue even if no lock was acquired. Additionally, I expanded the error-handling to explicitly check for known errors returned by `resty.lock:lock()` ("timeout" and "locked"). This ensures that we only retry the function if the lock operation failed in a recoverable way.

CLAassistant · 2022-07-01T01:56:17Z

All committers have signed the CLA.

flrgh · 2022-07-01T19:26:18Z

FYI I haven't really looked at the tests yet. I figure this could be a pretty difficult one to test, but if the approach looks sound I'll see if there's anything we can do to ensure test coverage.

kikito

I have mostly minor comments about the use of return values and the meaning of the first return value.

lib/resty/healthcheck.lua

* fix(healthchecker) port 2.x lock fixes to 1.5.x * chore(healthcheck) remove unused vars * chore(healthcheck) fix indent level * fix(healthcheck) correct duplicate handling in add_target * fix(healthchecker) handle fetch_target_list failure in checker callback * chore(healthcheck) apply suggestions from #112 Co-authored-by: Vinicius Mignot <[email protected]>

This is a squashed commit that realigns master branch with 3.0.0 release. In order to do so the master branch was reverted back to 1.3.0 release (commit: dc2a6b6) and then the 3.0.0 release branch was merged to it (up to commit: a2bec67). Below you can see all the details of the squashed commits. --------- * release 1.4.0 * fix(healthcheck) use single timer for all active checks (#62) * fix(healthcheck) use single timer for all active checks * tests(*) removed tests that are not needed * docs(*) docs for release 1.4.0 * chore(ci) use newer openresty and luarocks releases (#68) * fix(healthcheck) single worker actively checks the status (#67) * release 1.4.1 * fix(healthcheck) record `last_run` when healthcheck is scheduled (#72) Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete. * tests(active-probes) interval is respected (#73) * fix(healthcheck) record `last_run` when healthcheck is scheduled Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete. * tests(active-probes) interval is respected Co-authored-by: Brian Fox <[email protected]> * fix(healthcheck) remove event watcher when stopping hc (#74) Co-authored-by: Brian Fox <[email protected]> Co-authored-by: Brian Fox <[email protected]> * tests(*) avoid some flakiness (#75) * release 1.4.2 * chore(*) add GitHub Actions workflows (#82) * chore(*) add GitHub Actions workflows * fix(healthcheck) lint error * Simplify start of the checking timer (#85) * simplify start of the checking timer, ensuring only one worker actively sends healthchecks. one timer per worker, but before doing anything, tries to acquire an expiration lock. if fails, try again later. if the "winning" worker ever fails to renew it, some other worker would get it. * chore(rockspec) added rockspec for release 1.5.0-1 Also: - updated scm-1 rockspec - bumped openresty version in CI tests * feat(*) add header support for active checks * feat(active) support map headers * feat(healthcheck) delayed_clear function (#88) Added new function delayed_clear. This function marks all targets to be removed, but do not actually remove them. If before the delay parameter any of them is re-added, it is unmarked for removal. This function makes it possible to keep target state during config changes, where the targets might be removed and then re-added. * chore(readme) 1.5.0 release (#91) * chore(readme) 1.5.0 release * docs(*) release 1.5.0 Also added docs missing to delayed_clear() function. * fix(healthcheck) Use pair instead ipair for hcs weak table (#93) * release 1.5.1 (#95) * chore(readme) update badges (#98) * docs(readme) updated with 1.4.x changes * chore(workflows) updates for 1.6.0 release - added latest openresty to the CI matrix - added tests for when lua-resty-worker-events or lua-resty-events are used * feat(healthcheck) support setting the events module (#105) * feat(healthcheck) support setting the events module * fix(healthcheck) defaults to lua-resty-worker-events * tests(workflows) fixed manual deps install * fix(healthcheck) check empty opts * chore(workflows) use last luarocks * test(workflows) use pre-built deps, test with or 1.13-1.21 * chore(workflows) install lua-resty-events in ci * tests(workflows) debug * fixed tests and resty-events usage * init resty-events in init_worker * fix(tests) init events module (#107) * add init_worker in 03-get_target_status.t * fix 03-get_target_status.t * fix 03-get_target_status_with_sleeps.t * fix 04-report_success.t * fix 05/06 * fix 07/08 * fix 09 * change 10 * fix 11 * fix 12 * change 13 * fix 15 * partial fix 16 * change 17 * fix 18 * change 13 * fix 16 * style 05 * fix 01/02 * use string.buffer in OpenResty 1.21.4.1 (#109) * use string.buffer in OpenResty 1.21.4.1 * remove cjson require * fix(healthcheck) use the events module set in defaults * tests(with_resty-events) disabled tests that need more work * fix(healthcheck) avoid breaking when opts are nil * tests(with_resty-events) removed unnecessary test * tests(with_resty-events) increased sleeps Co-authored-by: Chrono <[email protected]> * release 1.6.0 (#110) * docs(readme) release 1.6.0 * fix(rockspec) typo * chore(rockspec) release 1.6.0 * docs(*) release 1.6.0 * chore(*) localize string.format (#111) * fix(healthcheck) support any lua-resty-events 0.1.x (#118) * chore(workflows) bump deps versions * chore(helathcheck) support any lua-resty-events 0.1.x * fix(healthchecker) port 2.x lock fixes to 1.5.x (#113) * fix(healthchecker) port 2.x lock fixes to 1.5.x * chore(healthcheck) remove unused vars * chore(healthcheck) fix indent level * fix(healthcheck) correct duplicate handling in add_target * fix(healthchecker) handle fetch_target_list failure in checker callback * chore(healthcheck) apply suggestions from #112 Co-authored-by: Vinicius Mignot <[email protected]> * chore(healthcheck) increase verbosity for locked function failures (#114) * chore(healthcheck) increase verbosity for locked function failures * tests(healthcheck) add tests for run_locked() * fix(healthcheck) lower the cleanup check frequency the health-check timer also checks if targets must be removed. to safely remove targets, the targets list is locked. if this check runs on every health-check cycle and there are a large number of targets, a bazillion locks will be created. this change avoids that by lowering the frequency the cleanup list is checked. the side-effect is that targets marked for cleanup may exist for more time (2.5s) than expected, and some unexpected active checks could happen. * tests(clear) increase delay for delayed clear tests with less locks the wait for delayed clean is longer. * docs(readme) release 1.6.1 * chore(rockspecs) release 1.6.1 * release 1.6.1 * docs(readme) updated build badge * chore(ci) remove old openresty versions * feat(healthcheck) avoid duplication post in rebuild healthcheck scenario * release 1.6.2 * Added support for https_sni in healthcheck.lua (#49) * fix(mtls) use OpenResty's API for mtls (#99) * chore(ci): fix cache path (#136) ${{ env.* }} is not evaluated in `with` causing gha tries to cache `/`. * release 1.6.3 (#135) * release 3.0.0 (#142) * feat(ci/KAG-1800): add lint and sast workflows using shared actions * chore(ci): pin shared code quality actions * chore(*): backport - localize some functions A commit on master 80ee2e1 introduced localizing some functions. This commit backports that one. Backports: #92 * fix(healthcheck): fixed incorrect default http_statuses when new() was called multiple times (#83) * chore(lint): bump kong/public-shared-actions * docs(README): added 1.5.2 and 1.5.3 releases * chore(*) rename readme, add release instructions * chore(healthcheck): fix get_defaults function * fix(test): fix worker-events test * release 3.0.0 * chore(github): cancel in progress workflows when new pushed --------- Co-authored-by: saisatish karra <[email protected]> Co-authored-by: Shuoqing Ding <[email protected]> Co-authored-by: Vinicius Mignot <[email protected]> Co-authored-by: Thijs Schreijer <[email protected]> * chore(*): revert commits back to 1.3.0 This reverts the master branch backs to the commit of dc2a6b6 so that we can skip over 2.0.0 release. The 1.3.0 release is the first common commit between master branch and 1.6.x (also 3.0.x) branches. * chore(docs): fix semgrep https warnings * docs(readme): update shield badges Co-authored-by: Vinicius Mignot <[email protected]> * chore(*): add 2.0.0 rockspecs and fix tests Release 2.0.x introduced some rockspecs with fixes. Reverting back to 1.3.0 and reapplying changes from 3.0.0 reversed those fixes. This commit reintroduces them. KAG-2704 --------- Co-authored-by: Vinicius Mignot <[email protected]> Co-authored-by: Brian Fox <[email protected]> Co-authored-by: Murillo Paula <[email protected]> Co-authored-by: Javier <[email protected]> Co-authored-by: Thijs Schreijer <[email protected]> Co-authored-by: Mayo <[email protected]> Co-authored-by: Tomasz Nowak <[email protected]> Co-authored-by: Chrono <[email protected]> Co-authored-by: Michael Martin <[email protected]> Co-authored-by: Jun Ouyang <[email protected]> Co-authored-by: HansK-p <[email protected]> Co-authored-by: Qi <[email protected]> Co-authored-by: Wangchong Zhou <[email protected]> Co-authored-by: Aapo Talvensaari <[email protected]> Co-authored-by: saisatish karra <[email protected]> Co-authored-by: Shuoqing Ding <[email protected]>

flrgh requested a review from locao July 1, 2022 01:56

flrgh requested a review from Tieske July 1, 2022 01:57

flrgh added 2 commits July 1, 2022 12:01

refactor(lock) use special run_locked helper

e7c7955

fix(healthchecker) handle fetch_target_list failure in checker callback

de2b0fc

flrgh requested a review from bungle July 1, 2022 19:02

flrgh mentioned this pull request Jul 1, 2022

fix(healthchecker) port 2.x lock fixes to 1.5.x #113

Merged

kikito reviewed Jul 4, 2022

View reviewed changes

lib/resty/healthcheck.lua Outdated Show resolved Hide resolved

lib/resty/healthcheck.lua Outdated Show resolved Hide resolved

lib/resty/healthcheck.lua Outdated Show resolved Hide resolved

lib/resty/healthcheck.lua Outdated Show resolved Hide resolved

kikito reviewed Jul 4, 2022

View reviewed changes

lib/resty/healthcheck.lua Show resolved Hide resolved

flrgh added 6 commits July 5, 2022 09:37

refactor(lock) don't return function value in timer

d8d5d10

fix(lock) change run_locked return signature when rescheduling

53c772a

fix(lock) fix return check conditional

da41f49

chore(*) revert a docstring change

abcd1c5

chore(*) re-organize dependencies and cached locals

edc66c1

chore(lock) add explanation for resty_lock's limited scope

bd40338

locao merged commit c0950b9 into master Jul 6, 2022

locao deleted the fix/resty-lock-error-handling branch July 6, 2022 14:36

locao added a commit that referenced this pull request Jul 6, 2022

chore(healthcheck) apply suggestions from #112

19274c1

locao added a commit that referenced this pull request Jul 6, 2022

chore(healthcheck) apply suggestions from #112

14349e2

mcdullbloom mentioned this pull request Aug 2, 2023

"failed to release lock" error during active healthchecks Kong/kong#9221

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lock) handle all resty.lock failure modes #112

fix(lock) handle all resty.lock failure modes #112

flrgh commented Jul 1, 2022 •

edited

Loading

CLAassistant commented Jul 1, 2022 •

edited

Loading

flrgh commented Jul 1, 2022

kikito left a comment

fix(lock) handle all resty.lock failure modes #112

fix(lock) handle all resty.lock failure modes #112

Conversation

flrgh commented Jul 1, 2022 • edited Loading

CLAassistant commented Jul 1, 2022 • edited Loading

flrgh commented Jul 1, 2022

kikito left a comment

Choose a reason for hiding this comment

flrgh commented Jul 1, 2022 •

edited

Loading

CLAassistant commented Jul 1, 2022 •

edited

Loading