Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(healthcheck) Use pair instead ipair for hcs weak table #93

Merged
merged 1 commit into from
Mar 23, 2022

Conversation

kayano
Copy link
Contributor

@kayano kayano commented Mar 9, 2022

This PR fixes an issue with active healthchecks being disabled after sometime when targets are being removed/added to an upstream in kong.

This PR is based on #78 but for lua-resty-healthcheck version 1.5.0 used in kong 2.8.0.

Steps to reproduce the issue.

  1. In a new shell run docker-compose up with following docker-compose.yml
version: '3.7'

volumes:
  kong_data: {}

networks:
  kong-net:
    external: false

services:
  kong:
    image: kong:2.8.0
    depends_on:
      - db
    environment:
      KONG_LOG_LEVEL: debug
      KONG_NGINX_DAEMON: "off"
      KONG_ADMIN_LISTEN: 0.0.0.0:8001
      KONG_ADMIN_ACCESS_LOG: /dev/stdout
      KONG_ADMIN_ERROR_LOG: /dev/stderr
      KONG_PROXY_LISTEN: 0.0.0.0:8000
      KONG_PROXY_ACCESS_LOG: /dev/stdout
      KONG_PROXY_ERROR_LOG: /dev/stderr
      KONG_NGINX_WORKER_PROCESSES: "2"
      KONG_DATABASE: postgres
      KONG_PG_DATABASE: kong
      KONG_PG_HOST: db
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: kong
    ports:
      - 8000:8000
      - 8001:8001
    healthcheck:
      test: ["CMD", "kong", "health"]
      interval: 10s
      timeout: 10s
      retries: 10
    restart: on-failure
    networks:
      - kong-net

  kong-migrations:
    image: kong:2.8.0
    command: kong migrations bootstrap
    environment:
      KONG_DATABASE: postgres
      KONG_PG_DATABASE: kong
      KONG_PG_HOST: db
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: kong
    depends_on:
      - db
    restart: on-failure
    networks:
      - kong-net

  kong-migrations-up:
    image: kong:2.8.0
    command: kong migrations up && kong migrations finish
    environment:
      KONG_DATABASE: postgres
      KONG_PG_DATABASE: kong
      KONG_PG_HOST: db
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: kong
    depends_on:
      - db
    restart: on-failure
    networks:
      - kong-net

  db:
    image: postgres:12
    environment:
      POSTGRES_DB: kong
      POSTGRES_USER: kong
      POSTGRES_PASSWORD: kong
    healthcheck:
      interval: 5s
      retries: 10
      test:
      - CMD
      - pg_isready
      - --dbname=kong
      - --username=kong
      timeout: 10s
    stop_signal: SIGKILL
    restart: on-failure
    networks:
      - kong-net
    volumes:
      - kong_data:/var/lib/postgresql/data

It starts a single kong node and a postgres db.

  1. In a new shell setup a kong service, a route and an upstream with 2 targets running (in the directory with docker-compose.yml from step 1):
#!/usr/bin/env bash

curl -s -X POST http://localhost:8001/services \
     --data name=example_service \
     --data host='example_upstream'

curl -s -X POST http://localhost:8001/services/example_service/routes \
     --data 'paths[]=/mock' \
     --data name=mocking

curl -s -X POST http://localhost:8001/upstreams \
     --data name=example_upstream

curl -s -X POST http://localhost:8001/upstreams/example_upstream/targets \
     --data target='mockbin.org:80'

curl -s -X POST http://localhost:8001/upstreams/example_upstream/targets \
     --data target='httpbin.org:80'


curl -X PATCH -H 'content-type: application/json' http://localhost:8001/upstreams/example_upstream \
     -d @- << EOF
{
  "healthchecks": {
    "active": {
      "healthy": {
        "interval": 5,
        "successes": 2
      },
      "unhealthy": {
        "http_failures": 3,
        "interval": 2,
        "timeouts": 5
      }
    },
    "passive": {
      "unhealthy": {
        "http_failures": 3,
        "timeouts": 5
      },
      "healthy": {
        "successes": 2
      }
    }
  }
}
EOF

At this point active healthchecks should be running for both targets and in the shell from step 1 should periodically appear logs like

kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'mockbin.org (172.67.201.247:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'mockbin.org (104.21.60.227:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (54.221.78.73:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (52.55.211.119:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (52.86.136.68:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (54.85.66.171:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (3.209.99.235:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:13 [debug] 1111#0: *439 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) Reporting 'httpbin.org (34.227.133.25:80)' (got HTTP 200)
kong_1                | 2022/03/09 21:17:14 [debug] 1111#0: *119 [lua] healthcheck.lua:1235: log(): [healthcheck] (0fdb1f6d-5629-472b-ac71-8680db1ae231:example_upstream) checking unhealthy targets: nothing to do

indicating that active healthchecks are working.

  1. To quicker reproduce the issue run:
docker-compose exec -u 0 kong bash -c "sed -i '1590 i collectgarbage("collect")' /usr/local/share/lua/5.1/resty/healthcheck.lua"

It adds a line to run one complete cycle of garbage collection.

  1. Reload kong after library change running:
docker-compose exec kong bash -c "kong reload"
  1. When kong is reloaded and working again delete one target running:
curl -X DELETE "localhost:8001/upstreams/example_upstream/targets/httpbin.org:80"
  1. Looking at logs active healthchecks are disabled now for all targets in example_upstream but they shouldn't.
    When passive healthcheck fails for all targets in the example_upstream kong returns eventually
{"message":"failure to get a peer from the ring-balancer"}

even after targets are healthy again.

Changing ipairs to pairs while iterating over hcs table fixes the issue.

@CLAassistant
Copy link

CLAassistant commented Mar 9, 2022

CLA assistant check
All committers have signed the CLA.

@kayano
Copy link
Contributor Author

kayano commented Mar 9, 2022

@locao as you released lua-resty-healthcheck 1.5.0 can you help to have my change released? What needs to be done to have the library bump in kong also if my change makes sense?

@locao locao merged commit 3f7efd1 into Kong:release/1.5.x Mar 23, 2022
@locao
Copy link
Contributor

locao commented Mar 23, 2022

Hi @kayano! Thanks for your contribution!

I will release a new version of this module with your fix. For the Kong side, we need to bump the dependency and also release a new version.

nowNick added a commit that referenced this pull request Nov 8, 2023
This is a squashed commit that realigns master branch with 3.0.0 release. In order to do so the master branch was reverted back to 1.3.0 release (commit: dc2a6b6) and then the 3.0.0 release branch was merged to it (up to commit: a2bec67). 

Below you can see all the details of the squashed commits.

---------

* release 1.4.0

* fix(healthcheck) use single timer for all active checks (#62)

* fix(healthcheck) use single timer for all active checks

* tests(*) removed tests that are not needed

* docs(*) docs for release 1.4.0

* chore(ci) use newer openresty and luarocks releases (#68)

* fix(healthcheck) single worker actively checks the status (#67)

* release 1.4.1

* fix(healthcheck) record `last_run` when healthcheck is scheduled (#72)

Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete.

* tests(active-probes) interval is respected (#73)

* fix(healthcheck) record `last_run` when healthcheck is scheduled

Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete.

* tests(active-probes) interval is respected

Co-authored-by: Brian Fox <[email protected]>

* fix(healthcheck) remove event watcher when stopping hc (#74)

Co-authored-by: Brian Fox <[email protected]>

Co-authored-by: Brian Fox <[email protected]>

* tests(*) avoid some flakiness (#75)

* release 1.4.2

* chore(*) add GitHub Actions workflows (#82)

* chore(*) add GitHub Actions workflows

* fix(healthcheck) lint error

* Simplify start of the checking timer (#85)

* simplify start of the checking timer, ensuring only one worker actively sends healthchecks.

one timer per worker, but before doing anything, tries to acquire an
expiration lock.  if fails, try again later.  if the "winning" worker
ever fails to renew it, some other worker would get it.

* chore(rockspec) added rockspec for release 1.5.0-1

Also:
  - updated scm-1 rockspec
  - bumped openresty version in CI tests

* feat(*) add header support for active checks

* feat(active) support map headers

* feat(healthcheck) delayed_clear function (#88)

Added new function delayed_clear. This function marks all targets to be
removed, but do not actually remove them. If before the delay parameter
any of them is re-added, it is unmarked for removal.

This function makes it possible to keep target state during config
changes, where the targets might be removed and then re-added.

* chore(readme) 1.5.0 release (#91)

* chore(readme) 1.5.0 release

* docs(*) release 1.5.0

Also added docs missing to delayed_clear() function.

* fix(healthcheck) Use pair instead ipair for hcs weak table (#93)

* release 1.5.1 (#95)

* chore(readme) update badges (#98)

* docs(readme) updated with 1.4.x changes

* chore(workflows) updates for 1.6.0 release

- added latest openresty to the CI matrix
- added tests for when lua-resty-worker-events or lua-resty-events are
  used

* feat(healthcheck) support setting the events module (#105)

* feat(healthcheck) support setting the events module

* fix(healthcheck) defaults to lua-resty-worker-events

* tests(workflows) fixed manual deps install

* fix(healthcheck) check empty opts

* chore(workflows) use last luarocks

* test(workflows) use pre-built deps, test with or 1.13-1.21

* chore(workflows) install lua-resty-events in ci

* tests(workflows) debug

* fixed tests and resty-events usage

* init resty-events in init_worker

* fix(tests) init events module (#107)

* add init_worker in 03-get_target_status.t

* fix 03-get_target_status.t

* fix 03-get_target_status_with_sleeps.t

* fix 04-report_success.t

* fix 05/06

* fix 07/08

* fix 09

* change 10

* fix 11

* fix 12

* change 13

* fix 15

* partial fix 16

* change 17

* fix 18

* change 13

* fix 16

* style 05

* fix 01/02

* use string.buffer in OpenResty 1.21.4.1 (#109)

* use string.buffer in OpenResty 1.21.4.1

* remove cjson require

* fix(healthcheck) use the events module set in defaults

* tests(with_resty-events) disabled tests that need more work

* fix(healthcheck) avoid breaking when opts are nil

* tests(with_resty-events) removed unnecessary test

* tests(with_resty-events) increased sleeps

Co-authored-by: Chrono <[email protected]>

* release 1.6.0 (#110)

* docs(readme) release 1.6.0

* fix(rockspec) typo

* chore(rockspec) release 1.6.0

* docs(*) release 1.6.0

* chore(*) localize string.format (#111)

* fix(healthcheck) support any lua-resty-events 0.1.x (#118)

* chore(workflows) bump deps versions

* chore(helathcheck) support any lua-resty-events 0.1.x

* fix(healthchecker) port 2.x lock fixes to 1.5.x (#113)

* fix(healthchecker) port 2.x lock fixes to 1.5.x

* chore(healthcheck) remove unused vars

* chore(healthcheck) fix indent level

* fix(healthcheck) correct duplicate handling in add_target

* fix(healthchecker) handle fetch_target_list failure in checker callback

* chore(healthcheck) apply suggestions from #112

Co-authored-by: Vinicius Mignot <[email protected]>

* chore(healthcheck) increase verbosity for locked function failures (#114)

* chore(healthcheck) increase verbosity for locked function failures

* tests(healthcheck) add tests for run_locked()

* fix(healthcheck) lower the cleanup check frequency

the health-check timer also checks if targets must be removed. to
safely remove targets, the targets list is locked. if this check runs
on every health-check cycle and there are a large number of targets, a
bazillion locks will be created. this change avoids that by lowering
the frequency the cleanup list is checked.

the side-effect is that targets marked for cleanup may exist for more
time (2.5s) than expected, and some unexpected active checks could
happen.

* tests(clear) increase delay for delayed clear tests

with less locks the wait for delayed clean is longer.

* docs(readme) release 1.6.1

* chore(rockspecs) release 1.6.1

* release 1.6.1

* docs(readme) updated build badge

* chore(ci) remove old openresty versions

* feat(healthcheck) avoid duplication post in rebuild healthcheck scenario

* release 1.6.2

* Added support for https_sni in healthcheck.lua (#49)

* fix(mtls) use OpenResty's API for mtls (#99)

* chore(ci): fix cache path (#136)

${{ env.* }} is not evaluated in `with` causing gha tries to cache `/`.

* release 1.6.3 (#135)

* release 3.0.0 (#142)

* feat(ci/KAG-1800): add lint and sast workflows using shared actions

* chore(ci): pin shared code quality actions

* chore(*): backport - localize some functions

A commit on master 80ee2e1
introduced localizing some functions. This commit backports that one.

Backports: #92

* fix(healthcheck): fixed incorrect default http_statuses when new() was called multiple times (#83)

* chore(lint): bump kong/public-shared-actions

* docs(README): added 1.5.2 and 1.5.3 releases

* chore(*) rename readme, add release instructions

* chore(healthcheck): fix get_defaults function

* fix(test): fix worker-events test

* release 3.0.0

* chore(github): cancel in progress workflows when new pushed

---------

Co-authored-by: saisatish karra <[email protected]>
Co-authored-by: Shuoqing Ding <[email protected]>
Co-authored-by: Vinicius Mignot <[email protected]>
Co-authored-by: Thijs Schreijer <[email protected]>

* chore(*): revert commits back to 1.3.0

This reverts the master branch backs to the commit
of dc2a6b6 so that
we can skip over 2.0.0 release.
The 1.3.0 release is the first common commit between
master branch and 1.6.x (also 3.0.x) branches.

* chore(docs): fix semgrep https warnings

* docs(readme): update shield badges

Co-authored-by: Vinicius Mignot <[email protected]>

* chore(*): add 2.0.0 rockspecs and fix tests

Release 2.0.x introduced some rockspecs with fixes.
Reverting back to 1.3.0 and reapplying changes from 3.0.0
reversed those fixes. This commit reintroduces them.

KAG-2704

---------

Co-authored-by: Vinicius Mignot <[email protected]>
Co-authored-by: Brian Fox <[email protected]>
Co-authored-by: Murillo Paula <[email protected]>
Co-authored-by: Javier <[email protected]>
Co-authored-by: Thijs Schreijer <[email protected]>
Co-authored-by: Mayo <[email protected]>
Co-authored-by: Tomasz Nowak <[email protected]>
Co-authored-by: Chrono <[email protected]>
Co-authored-by: Michael Martin <[email protected]>
Co-authored-by: Jun Ouyang <[email protected]>
Co-authored-by: HansK-p <[email protected]>
Co-authored-by: Qi <[email protected]>
Co-authored-by: Wangchong Zhou <[email protected]>
Co-authored-by: Aapo Talvensaari <[email protected]>
Co-authored-by: saisatish karra <[email protected]>
Co-authored-by: Shuoqing Ding <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants