bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

alptugay · 2023-07-04T07:40:50Z

Current Behavior

Our Apisix instance has more than 2500 route and upstream all of which has active healthcheck enabled. Sometimes (once/twice a week) we see that one or more workers use %100 of CPU. At the same time we see the following error log:

2023/07/04 03:15:53 [error] 161670#161670: *27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:*.*.*.*:8500': unlocked, context: ngx.timer, client: *.*.*.*, server: 0.0.0.0:80

We have encountered this situation on multiple instances and multiple times:

At the same exact time we can see the CPU core running this worker starts to use %100 CPU

That Cpu Core also sees an increase in time spent

Sockstat usage increases as well:

After killing the worker process it returns to normal (BTW we cant kill the process gracefully, we need to use "kill -9")

We are normally using Apisix 3.2.1 but to solve the issue we cherry picked "v3.0.0 of lua-resty-healthcheck", because the lock mechanisim seems to have changed but that caused massive memory leak so we reverted.

All of our upstreams have the following timeout and healthcheck values:

 "timeout": {
    "connect": 60,
    "send": 60,
    "read": 60
  },
  "type": "roundrobin",
  "checks": {
    "active": {
      "concurrency": 10,
      "healthy": {
        "http_statuses": [
          200,
          302
        ],
        "interval": 1,
        "successes": 2
      },
      "http_path": "/",
      "https_verify_certificate": true,
      "timeout": 1,
      "type": "tcp",
      "unhealthy": {
        "http_failures": 5,
        "http_statuses": [
          429,
          404,
          500,
          501,
          502,
          503,
          504,
          505
        ],
        "interval": 1,
        "tcp_failures": 2,
        "timeouts": 3
      }
    }
  },

There are some other abnormalities as well, I don't know if it is related or not so I'll briefly share those as well:

We see lots of upstream time out errors, however the upstreams are healthy and running. These connections seem to be related to a watch API of K8s (not sure)

2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.* , server: _, request: "GET /apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=2809
57361&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/kyverno.io/v1/clusterpolicies?allowWatchBookmarks=true&resourceVersion=280957361&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apps/v1/deployments?allowWatchBookmarks=true&resourceVersion=280957362&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/wgpolicyk8s.io/v1alpha2/policyreports?allowWatchBookmarks=true&resourceVersion=280957363&watch=true", host: "example.com"
2023/07/04 03:15:51 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apps/v1/replicasetsallowWatchBookmarks=true&resourceVersion=280957365&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=280957365&watch=true", host: "example.com"
2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersi
on=280957366&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/kyverno.io/v1alpha2/admissionreports?allowWatchBookmarks=true&resourceVersion=280957366&watch=true", host: "example.com"
2023/07/04 03:15:52 [error] 161656#161656: *27855374513 upstream timed out (110: Connection timed out) while reading upstream, client: *.*.*.*, server: _, request: "GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&
resourceVersion=280957366&watch=true HTTP/2.0", upstream: "https://*.*.*.*:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&resourceVersion=280957366&watch=true", host: "example.com"

We have TEngine running on parallel with Apisix, both have healthchecks enabled. But connection states are very different, for example on TEngine we see less connection in TimeWait and more inUse:

However in Apisix we see more in Timewait and less inUse:

When we disable healthcheck, it returns to the similar state with TEngine.

Our config.yml


apisix:
  node_listen:                      # This style support multiple ports
    - port: 80

  enable_admin: true
  enable_dev_mode: false            # Sets nginx worker_processes to 1 if set to true
  enable_reuseport: true            # Enable nginx SO_REUSEPORT switch if set to true.
  show_upstream_status_in_response_header: false # when true all upstream status write to `X-APISIX-Upstream-Status` otherwise only 5xx code
  enable_ipv6: true

  enable_server_tokens: false        # Whether the APISIX version number should be shown in Server header.

  extra_lua_path: ""                # extend lua_package_path to load third party code
  extra_lua_cpath: ""               # extend lua_package_cpath to load third party code

  ssl_session_cache_size: 100m

  geoip_shared_dict_size: 100m

  proxy_cache_distributed_shared_dict_size: 200m
  redis_healthcheck_shared_dict_size: 50m

  proxy_cache:                      # Proxy Caching configuration
    cache_ttl: 10s                  # The default caching time in disk if the upstream does not specify the cache time
    zones:                          # The parameters of a cache
      - name: disk_cache_one        # The name of the cache, administrator can specify
        memory_size: 50m            # The size of shared memory, it's used to store the cache index for
        disk_size: 1G               # The size of disk, it's used to store the cache data (disk)
        disk_path: /tmp/disk_cache_one  # The path to store the cache data (disk)
        cache_levels: 1:2           # The hierarchy levels of a cache (disk)
      - name: memory_cache
        memory_size: 50m

  delete_uri_tail_slash: false    # delete the '/' at the end of the URI
  normalize_uri_like_servlet: false
  router:
    http: radixtree_host_uri         # radixtree_uri: match route by uri(base on radixtree)
    ssl: radixtree_sni          # radixtree_sni: match route by SNI(base on radixtree)
  stream_proxy:                  # TCP/UDP proxy
    only: false                   # use stream proxy only, don't enable HTTP stuff
    tcp:                         # TCP proxy port list
    - addr: "*.*.*.*:22"
    - addr: "*.*.*.*:22"
    - addr: "1789"
    - addr: "5000"
    - addr: "6780"
    - addr: "8000"
    - addr: "8004"
    - addr: "8041"
    - addr: "8042"
    - addr: "8774"
    - addr: "8776"
    - addr: "8780"
    - addr: "8786"
    - addr: "9001"
    - addr: "9292"
    - addr: "9311"
    - addr: "9322"
    - addr: "9511"
    - addr: "9696"
    - addr: "9876"

  resolver_timeout: 5             # resolver timeout
  enable_resolv_search_opt: true  # enable search option in resolv.conf
  ssl:
    enable: true
    listen:                       # APISIX listening port in https.
      - port: 443
        enable_http2: true
    ssl_protocols: TLSv1.2 TLSv1.3
    ssl_ciphers: ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DHE+AES128:!ADH:!AECDH:!MD5
    ssl_session_cache: 100m


  enable_control: true
  control:
    ip: 127.0.0.1
    port: 9090
  disable_sync_configuration_during_start: false  # safe exit. Remove this once the feature is stable
  data_encryption:                # add `encrypt_fields = { $field },` in plugin schema to enable encryption
    enable: false                 # if not set, the default value is `false`.
    keyring:
      - qeddd145sfvddff3          # If not set, will save origin value into etcd.

nginx_config:                     # config for render the template to generate nginx.conf
  user: apisix                     # specifies the execution user of the worker process.
  error_log:
    - syslog:server=unix:/dev/rsyslog,tag=lb_error_log,nohostname warn
  error_log_level:  warn          # warn,error

  enable_cpu_affinity: true       # enable cpu affinity, this is just work well only on physical machine
  worker_processes: 15          # if you want use multiple cores in container, you can inject the number of cpu as environment variable "APISIX_WORKER_PROCESSES"
  custom_cpu_affinity: |
    1111111111111110
  worker_rlimit_nofile: 1048576     # the number of files a worker process can open, should be larger than worker_connections
  worker_shutdown_timeout: 300s   # timeout for a graceful shutdown of worker processes

  max_pending_timers: 16384       # increase it if you see "too many pending timers" error
  max_running_timers: 4096        # increase it if you see "lua_max_running_timers are not enough" error

  event:
    worker_connections: 500000

  meta:
    lua_shared_dict:
      prometheus-metrics: 15m

  stream:
    enable_access_log: true         # enable access log or not, default true
    access_log:
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info main if=$status_2xx3xx
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info main if=$status_non_2xx3xx
    access_log_format: '{"bytes_sent":"$bytes_sent","connection":"$connection","protocol":"$protocol","remote_addr":"$remote_addr","remote_port":"$remote_port","server_addr":"$server_addr","server_port":"$server_port","session_time":"$session_time","ssl_server_name":"$ssl_server_name","status":"$status","upstream_addr":"$upstream_addr","upstream_bytes_received":"$upstream_bytes_received","upstream_bytes_sent":"$upstream_bytes_sent","upstream_connect_time":"$upstream_connect_time","upstream_session_time":"$upstream_session_time"}'
    access_log_format_escape: json          # allows setting json or default characters escaping in variables
    lua_shared_dict:
      etcd-cluster-health-check-stream: 10m
      lrucache-lock-stream: 10m
      plugin-limit-conn-stream: 10m
      upstream-healthcheck-stream: 100m

  main_configuration_snippet: |
  http_configuration_snippet: |
    map_hash_max_size 20480;
    map_hash_bucket_size 20480;
    map $http_cf_ipcountry $ipcountry { ""                1; default           0; tr                1; }

    map $status $status_2xx3xx {
        ~^[23]  1;
        default 0;
    }
    map $status $status_non_2xx3xx {
        ~^[23]  0;
       default 1;
    }

    sendfile        on;
    tcp_nopush      on;
    tcp_nodelay     on;
    proxy_buffers 4 16k;
    proxy_busy_buffers_size 16k;
    proxy_buffer_size 16k;
  http_server_configuration_snippet: |
    set $masked_hostname "******";
    client_body_buffer_size 128k;
    client_header_buffer_size 5120k;
    large_client_header_buffers 16 5120k;
  http_server_location_configuration_snippet: |
  http_admin_configuration_snippet: |
  http_end_configuration_snippet: |
  stream_configuration_snippet: |
    map $status $status_2xx3xx {
        ~^[23]  1;
        default 0;
    }
    map $status $status_non_2xx3xx {
        ~^[23]  0;
       default 1;
    }


  http:
    enable_access_log: true         # enable access log or not, default true
    access_log:
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_2xx3xx,nohostname,severity=info main if=$status_2xx3xx
      - syslog:server=unix:/dev/rsyslog,tag=lb_access_non_2xx3xx,nohostname,severity=info main if=$status_non_2xx3xx
    access_log_format: '{"cf_ipcountry":"$http_cf_ipcountry","http_x_client_ip":"$http_x_client_ip","http_True_Client_IP":"$http_True_Client_IP","upstream_http_X_Proxy_Cache":"$upstream_http_X_Proxy_Cache","request_id":"$request_id","route_id":"$http_route_id","request_length":"$request_length","remote_addr":"$remote_addr","remote_port":"$remote_port","request":"$request","args":"$args","uri":"$uri","status":"$status","bytes_sent":"$bytes_sent","http_user_agent":"$http_user_agent","http_x_forwarded_for":"$http_x_forwarded_for","http_host":"$http_host","server_name":"$server_name","request_time":"$request_time","upstream":"$upstream_addr","upstream_connect_time":"$upstream_connect_time","upstream_status":"$upstream_status","upstream_response_time":"$upstream_response_time","upstream_cache_status":"$upstream_cache_status","ssl_protocol":"$ssl_protocol","ssl_cipher":"$ssl_cipher","scheme":"$scheme","server_port":"$server_port","request_method":"$request_method","server_protocol":"$server_protocol","http_cf_ray":"$http_cf_ray","ty_lb_waf_id":"$http_ty_lb_waf_id","ty_lb_cc":"$http_ty_lb_cc","ty_lb_asn":"$http_ty_lb_asn"}'
    access_log_format_escape: json       # allows setting json or default characters escaping in variables
    keepalive_timeout: 60s          # timeout during which a keep-alive client connection will stay open on the server side.
    client_header_timeout: 60s      # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
    client_body_timeout: 60s        # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
    client_max_body_size: 0         # The maximum allowed size of the client request body.

    send_timeout: 30s              # timeout for transmitting a response to the client.then the connection is closed
    underscores_in_headers: "on"   # default enables the use of underscores in client request header fields
    real_ip_header: X-Real-IP      # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
    real_ip_recursive: "off"       # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_recursive
    real_ip_from:                  # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
      - 127.0.0.1
      - "unix:"

    proxy_ssl_server_name: true
    upstream:
      keepalive: 320                # Sets the maximum number of idle keepalive connections to upstream servers that are preserved in the cache of each worker process.
      keepalive_requests: 100000      # Sets the maximum number of requests that can be served through one keepalive connection.
      keepalive_timeout: 60s        # Sets a timeout during which an idle keepalive connection to an upstream server will stay open.
    charset: utf-8                  # Adds the specified charset to the "Content-Type" response header field, see
    variables_hash_max_size: 2048   # Sets the maximum size of the variables hash table.

    lua_shared_dict:
      internal-status: 100m
      plugin-limit-req: 100m
      plugin-limit-count: 100m
      prometheus-metrics: 1024m
      plugin-limit-conn: 100m
      upstream-healthcheck: 100m
      worker-events: 100m
      lrucache-lock: 100m
      balancer-ewma: 100m
      balancer-ewma-locks: 100m
      balancer-ewma-last-touched-at: 100m
      plugin-limit-count-redis-cluster-slot-lock: 100m
      tracing_buffer: 100m
      plugin-api-breaker: 100m
      etcd-cluster-health-check: 100m
      discovery: 100m
      jwks: 100m
      introspection: 100m
      access-tokens: 100m
      ext-plugin: 100m
      tars: 100m
      cas-auth: 100m




graphql:
  max_size: 1048576               # the maximum size limitation of graphql in bytes, default 1MiB


plugins:                          # plugin list (sorted by priority)
  - real-ip                        # priority: 23000
  - ai                             # priority: 22900
  - client-control                 # priority: 22000
  - proxy-control                  # priority: 21990
  - request-id                     # priority: 12015
  - zipkin                         # priority: 12011
  - ext-plugin-pre-req             # priority: 12000
  - fault-injection                # priority: 11000
  - mocking                        # priority: 10900
  - serverless-pre-function        # priority: 10000
  - cors                           # priority: 4000
  - ip-restriction                 # priority: 3000
  - ua-restriction                 # priority: 2999
  - referer-restriction            # priority: 2990
  - csrf                           # priority: 2980
  - uri-blocker                    # priority: 2900
  - request-validation             # priority: 2800
  - openid-connect                 # priority: 2599
  - cas-auth                       # priority: 2597
  - authz-casbin                   # priority: 2560
  - authz-casdoor                  # priority: 2559
  - wolf-rbac                      # priority: 2555
  - ldap-auth                      # priority: 2540
  - hmac-auth                      # priority: 2530
  - basic-auth                     # priority: 2520
  - jwt-auth                       # priority: 2510
  - key-auth                       # priority: 2500
  - consumer-restriction           # priority: 2400
  - forward-auth                   # priority: 2002
  - opa                            # priority: 2001
  - body-transformer               # priority: 1080
  - proxy-mirror                   # priority: 1010
  - proxy-cache-distributed        # priority: 1009
  - proxy-rewrite                  # priority: 1008
  - workflow                       # priority: 1006
  - api-breaker                    # priority: 1005
  - limit-conn                     # priority: 1003
  - limit-count                    # priority: 1002
  - limit-req                      # priority: 1001
  - gzip                           # priority: 995
  - server-info                    # priority: 990
  - multi-dc                       # priority: 967
  - traffic-split                  # priority: 966
  - redirect                       # priority: 900
  - response-rewrite               # priority: 899
  - degraphql                      # priority: 509
  - grpc-transcode                 # priority: 506
  - grpc-web                       # priority: 505
  - public-api                     # priority: 501
  - prometheus                     # priority: 500
  - datadog                        # priority: 495
  - elasticsearch-logger           # priority: 413
  - echo                           # priority: 412
  - loggly                         # priority: 411
  - http-logger                    # priority: 410
  - splunk-hec-logging             # priority: 409
  - skywalking-logger              # priority: 408
  - google-cloud-logging           # priority: 407
  - sls-logger                     # priority: 406
  - tcp-logger                     # priority: 405
  - kafka-logger                   # priority: 403
  - rocketmq-logger                # priority: 402
  - syslog                         # priority: 401
  - udp-logger                     # priority: 400
  - file-logger                    # priority: 399
  - clickhouse-logger              # priority: 398
  - tencent-cloud-cls              # priority: 397
  - inspect                        # priority: 200
  - example-plugin                 # priority: 0
  - aws-lambda                     # priority: -1899
  - azure-functions                # priority: -1900
  - openwhisk                      # priority: -1901
  - openfunction                   # priority: -1902
  - serverless-post-function       # priority: -2000
  - ext-plugin-post-req            # priority: -3000
  - ext-plugin-post-resp           # priority: -4000
  - ty-geoip                       # priority: -9000
stream_plugins: # sorted by priority
  - ip-restriction                 # priority: 3000
  - limit-conn                     # priority: 1003
  - mqtt-proxy                     # priority: 1000
  - syslog                         # priority: 401



plugin_attr:
  log-rotate:
    interval: 3600    # rotate interval (unit: second)
    max_kept: 168     # max number of log files will be kept
    max_size: -1      # max size bytes of log files to be rotated, size check would be skipped with a value less than 0
    enable_compression: true    # enable log file compression(gzip) or not, default false
  skywalking:
    service_name: APISIX
    service_instance_name: APISIX Instance Name
    endpoint_addr: http://127.0.0.1:12800
  opentelemetry:
    trace_id_source: x-request-id
    resource:
      service.name: APISIX
    collector:
      address: 127.0.0.1:4318
      request_timeout: 3
      request_headers:
        Authorization: token
    batch_span_processor:
      drop_on_queue_full: false
      max_queue_size: 1024
      batch_timeout: 2
      inactive_timeout: 1
      max_export_batch_size: 16
  prometheus:
    export_uri: /apisix/prometheus/metrics
    metric_prefix: apisix_
    enable_export_server: true
    export_addr:
      ip: 10.40.129.15
      port: 9091
  server-info:
    report_ttl: 60   # live time for server info in etcd (unit: second)
  dubbo-proxy:
    upstream_multiplex_count: 32
  request-id:
    snowflake:
      enable: false
      snowflake_epoc: 1609459200000   # the starting timestamp is expressed in milliseconds
      data_machine_bits: 12           # data machine bit, maximum 31, because Lua cannot do bit operations greater than 31
      sequence_bits: 10               # each machine generates a maximum of (1 << sequence_bits) serial numbers per millisecond
      data_machine_ttl: 30            # live time for data_machine in etcd (unit: second)
      data_machine_interval: 10       # lease renewal interval in etcd (unit: second)
  proxy-mirror:
    timeout:                          # proxy timeout in mirrored sub-request
      connect: 60s
      read: 60s
      send: 60s
  inspect:
    delay: 3            # in seconds
    hooks_file: "/usr/local/apisix/plugin_inspect_hooks.lua"

deployment:
  role: traditional
  role_traditional:
    config_provider: etcd
  admin:
    admin_key:
      -
        name: admin
        key: **********
        role: admin                 # admin: manage all configuration data

    enable_admin_cors: true         # Admin API support CORS response headers.
    allow_admin:                    # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
      - 127.0.0.0/24                # If we don't set any IP list, then any IP access is allowed by default.
    admin_listen:                 # use a separate port
      ip: 127.0.0.1                 # Specific IP, if not set, the default value is `0.0.0.0`.
      port: 9180                  # Specific port, which must be different from node_listen's port.


    admin_api_mtls:               # Depends on `admin_listen` and `https_admin`.
      admin_ssl_cert: ""          # Path of your self-signed server side cert.
      admin_ssl_cert_key: ""      # Path of your self-signed server side key.
      admin_ssl_ca_cert: ""       # Path of your self-signed ca cert.The CA is used to sign all admin api callers' certificates.

    admin_api_version: v3         # The version of admin api, latest version is v3.

  etcd:
    host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "http://*.*.*.*:2379"
    - "http://*.*.*.*:2379"
    - "http://*.*.*.*2379"

    prefix: /apisix                 # configuration prefix in etcd
    use_grpc: false                 # enable the experimental configuration sync via gRPC
    timeout: 30                     # 30 seconds. Use a much higher timeout (like an hour) if the `use_grpc` is true.
    startup_retry: 2                # the number of retry to etcd during the startup, default to 2
    tls:

      verify: true                  # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd,

Expected Behavior

Cpu should not be consumed %100

Error Logs

2023/07/04 03:15:53 [error] 161670#161670: 27856946269 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/routes/461091278096960700) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/routes/461091278096960700:target_lock:...:8500': unlocked, context: ngx.timer, client: ...*, server: 0.0.0.0:80

Steps to Reproduce

Seems to be random. But 2500+ route/service and upstream. All with active healthchecks enabled

Environment

APISIX 3.2.1
Ubuntu 22.04
nginx version: openresty/1.21.4.1
built with OpenSSL 1.1.1s 1 Nov 2022
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.21.4.1 -DNGX_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_HTTP_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.33 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.09 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.9 --add-module=../ngx_stream_lua-0.0.11 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../mod_dubbo-1.0.2 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../ngx_multi_upstream_module-1.1.1 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/stream --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/meta --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../wasm-nginx-module-0.6.4 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../lua-var-nginx-module-v0.5.3 --add-module=/builds/platform/infra/load-balancer/packages/apisix/apisix-3.2.1/openresty-1.21.4.1/../grpc-client-nginx-module-v0.4.2 --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

kingluo · 2023-07-06T03:08:56Z

@alptugay Which version of lua-resty-healthcheck do you use finally with error? Do the upstream servers change status frequently at that time? i.e. restart pods or shutdown or new upstreams added, or network transient errors?

I suggest using the lua-resty-healthcheck 2.2.3 and increasing the size of upstream-healthcheck, e.g. 500m and rechecking again.

And what's the configuration item redis_healthcheck_shared_dict_size? Have you changed the code of apisix or lua-resty-healthcheck?

monkeyDluffy6017 · 2023-07-18T03:57:40Z

Maybe it's same to #9015

juzhiyuan · 2023-07-26T09:29:03Z

Hi @leslie-tsang,

Their team is under monitoring the status. They will update when necessary.

github-actions · 2023-08-09T10:01:22Z

Due to lack of the reporter's response this issue has been labeled with "no response". It will be close in 3 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2023-08-12T10:01:24Z

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

alptugay · 2023-08-14T14:10:15Z

We have encountered the issue once again. Our current findings are:

4 out of 6 Apisix instances in the same environment were affected in different times in the same day. These are identical machines running with the same configs.

APISIX INSTANCE 1

2023/08/08 15:30:09 [error] 480719#480719: *24417911412 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:10 [error] 480724#480724: *24417921777 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:11 [error] 480739#480739: *24417932673 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:30:12 [error] 480716#480716: *24417943934 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.92:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:39 [error] 480728#480728: *24422048257 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:40 [error] 480735#480735: *24422057094 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:41 [error] 480723#480723: *24422065842 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:42 [error] 480733#480733: *24422074126 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 15:38:43 [error] 480736#480736: *24422083755 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:32708': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80

APISIX INSTANCE 2

2023/08/08 14:23:39 [error] 608073#608073: *24408524288 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:41 [error] 608090#608090: *24408534735 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:42 [error] 608078#608078: *24408545713 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 14:23:43 [error] 608087#608087: *24408556591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.194:31125': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80

APISIX INSTANCE 3

2023/08/08 16:29:39 [error] 67061#67061: *27466062548 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:40 [error] 67045#67045: *27466073872 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:41 [error] 67063#67063: *27466084337 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:29:42 [error] 67041#67041: *27466094834 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313:target_lock:XXX.XXX.XXX.94:31725': unlocked, context: ngx.timer, client: XXX.XXX.XXX.193, server: 0.0.0.0:80
2023/08/08 16:34:33 [error] 67043#67043: *27468414132 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80
2023/08/08 16:34:34 [error] 67050#67050: *27468422768 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:35 [error] 67051#67051: *27468432136 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.94, server: 0.0.0.0:80
2023/08/08 16:34:36 [error] 67042#67042: *27468440315 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.91:30238': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:03 [error] 291549#291549: *27482520502 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.196, server: 0.0.0.0:80
2023/08/08 17:13:04 [error] 67057#67057: *27482527129 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:13:05 [error] 67046#67046: *27482533105 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:13:06 [error] 291659#291659: *27482539195 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.92:30057': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: 10.212.200.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80

APISIX INSTANCE 4

2023/08/08 15:54:03 [error] 65587#65587: *27451715285 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:04 [error] 65602#65602: *27451726740 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:05 [error] 65606#65606: *27451738230 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 15:54:06 [error] 65596#65596: *27451750351 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.93:32496': unlocked, context: ngx.timer, client: XXX.XXX.XXX.191, server: 0.0.0.0:80

As it can be seen in the logs the errors were related to 2 upstreams 58e0a3b27292c0d9267ce78a7dfa45ab and f51cbfa1fac84b2110ba93f947c93313
We have found that event though the backend of these upstreams were serving through HTTP, we have mistakenly set the scheme of the upstreams as HTTPS
Also the are duplicate IP-Port pairs in these upstream definitions. Below you can see the content of these upstreams:

For example:

"port": 32330,
"host": "XXX.XXX.XXX.30"

Upstream: 58e0a3b27292c0d9267ce78a7dfa45ab

    {
      "value": {
        "type": "roundrobin",
        "create_time": 1684840040,
        "timeout": {
          "connect": 300,
          "read": 300,
          "send": 300
        },
        "name": "thanos-query-store-api",
        "id": "58e0a3b27292c0d9267ce78a7dfa45ab",
        "hash_on": "vars",
        "desc": "source:consul",
        "pass_host": "pass",
        "update_time": 1691489116,
        "labels": {
          "source": "consul"
        },
        "scheme": "https",
        "nodes": [
          {
            "priority": 0,
            "port": 32347,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 32337,
            "weight": 1,
            "host": "XXX.XXX.XXX.29"
          },
          {
            "priority": 0,
            "port": 32337,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32496,
            "weight": 1,
            "host": "XXX.XXX.XXX.93"
          },
          {
            "priority": 0,
            "port": 32347,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 31141,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 32496,
            "weight": 1,
            "host": "XXX.XXX.XXX.198"
          },
          {
            "priority": 0,
            "port": 32190,
            "weight": 1,
            "host": "XXX.XXX.XXX.195"
          },
          {
            "priority": 0,
            "port": 30057,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 30238,
            "weight": 1,
            "host": "XXX.XXX.XXX.91"
          },
          {
            "priority": 0,
            "port": 31141,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 32190,
            "weight": 1,
            "host": "XXX.XXX.XXX.95"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 30057,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32330,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 30238,
            "weight": 1,
            "host": "XXX.XXX.XXX.191"
          }
        ],
        "checks": {
          "active": {
            "unhealthy": {
              "timeouts": 3,
              "http_statuses": [
                429,
                404,
                500,
                501,
                502,
                503,
                504,
                505
              ],
              "interval": 1,
              "http_failures": 5,
              "tcp_failures": 2
            },
            "type": "tcp",
            "http_path": "/",
            "healthy": {
              "successes": 2,
              "http_statuses": [
                200,
                302
              ],
              "interval": 1
            },
            "timeout": 1,
            "concurrency": 10,
            "https_verify_certificate": true
          }
        }
      },
      "modifiedIndex": 1695071,
      "createdIndex": 1165,
      "key": "/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab"
    }

Upstream: f51cbfa1fac84b2110ba93f947c93313

{
      "value": {
        "type": "roundrobin",
        "create_time": 1684840040,
        "timeout": {
          "connect": 300,
          "read": 300,
          "send": 300
        },
        "nodes": [
          {
            "priority": 0,
            "port": 30249,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31725,
            "weight": 1,
            "host": "XXX.XXX.XXX.94"
          },
          {
            "priority": 0,
            "port": 31914,
            "weight": 1,
            "host": "XXX.XXX.XXX.96"
          },
          {
            "priority": 0,
            "port": 31125,
            "weight": 1,
            "host": "XXX.XXX.XXX.194"
          },
          {
            "priority": 0,
            "port": 31569,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 31500,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 30593,
            "weight": 1,
            "host": "XXX.XXX.XXX.30"
          },
          {
            "priority": 0,
            "port": 32708,
            "weight": 1,
            "host": "XXX.XXX.XXX.194"
          },
          {
            "priority": 0,
            "port": 31125,
            "weight": 1,
            "host": "XXX.XXX.XXX.92"
          },
          {
            "priority": 0,
            "port": 30593,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31725,
            "weight": 1,
            "host": "XXX.XXX.XXX.93"
          },
          {
            "priority": 0,
            "port": 31569,
            "weight": 1,
            "host": "XXX.XXX.XXX.33"
          },
          {
            "priority": 0,
            "port": 31914,
            "weight": 1,
            "host": "XXX.XXX.XXX.94"
          },
          {
            "priority": 0,
            "port": 32708,
            "weight": 1,
            "host": "XXX.XXX.XXX.95"
          },
          {
            "priority": 0,
            "port": 31500,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          },
          {
            "priority": 0,
            "port": 30249,
            "weight": 1,
            "host": "XXX.XXX.XXX.31"
          }
        ],
        "pass_host": "pass",
        "hash_on": "vars",
        "desc": "source:consul",
        "id": "f51cbfa1fac84b2110ba93f947c93313",
        "update_time": 1689321033,
        "labels": {
          "source": "consul"
        },
        "scheme": "https",
        "name": "thanos-queryfrontend-api",
        "checks": {
          "active": {
            "unhealthy": {
              "timeouts": 3,
              "interval": 1,
              "http_statuses": [
                429,
                404,
                500,
                501,
                502,
                503,
                504,
                505
              ],
              "http_failures": 5,
              "tcp_failures": 2
            },
            "type": "tcp",
            "http_path": "/",
            "healthy": {
              "successes": 2,
              "http_statuses": [
                200,
                302
              ],
              "interval": 1
            },
            "timeout": 1,
            "concurrency": 10,
            "https_verify_certificate": true
          }
        }
      },
      "modifiedIndex": 1239801,
      "createdIndex": 1181,
      "key": "/apisix/upstreams/f51cbfa1fac84b2110ba93f947c93313"
    }

There is only one entry in the output of ss -t that contains an entry regarding to the upstream 58e0a3b27292c0d9267ce78a7dfa45ab

2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80

SYN-SENT 0 1 YYY.YYY.YYY.YYY:42790 XXX.XXX.XXX.29:32337

XXX.XXX.XXX.29:32337 is one of the upstreams of 58e0a3b27292c0d9267ce78a7dfa45ab

I am posting some logs just before we receive failed to release lock logs:

You can ignore worker process **** exited on signal 9 logs. Because we have a script that kills the workers as soon as it detects "failed to release lock" logs.

2023/08/08 17:22:27 [error] 291898#291898: *27485100457 [lua] events.lua:273: post(): worker-events: failed posting event "mostly_healthy" by "lua-resty-healthcheck [upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab]"; no memory while connect
ing to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.91:30238/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:27 [error] 291898#291898: *27485100457 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32190/", host: "thanos-query-store-api.AAAA.BBBB.com"
2023/08/08 17:22:28 [error] 291695#291695: *27485100523 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291632#291632: *27485100556 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67058#67058: *27485100578 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291940#291940: *27485100654 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67044#67044: *27485100658 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291520#291520: *27485100669 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291921#291921: *27485100780 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291682#291682: *27485100840 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291582#291582: *27485100855 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67077#67077: *27485100854 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 291877#291877: *27485100859 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:28 [error] 67072#67072: *27485100860 [lua] events.lua:364: poll(): worker-events: dropping event; waiting for event data timed out, id: 10019295, context: ngx.timer
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.93:31725/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.194:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.92:31125/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.96:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.94:31914/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 SSL_do_handshake() failed (SSL: error:1408F10B:SSL routines:ssl3_get_record:wrong version number) while SSL handshaking to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291520#291520: *27485126279 [lua] balancer.lua:362: run(): failed to pick server: failed to find valid upstream server, all upstream servers tried while connecting to upstream, client: XXX.XXX.XXX.191, server: _, request: "GET / HTTP/1.1", upstream: "https://XXX.XXX.XXX.95:32708/", host: "thanos-queryfrontend-api.AAAA.BBBB.com"
2023/08/08 17:22:33 [error] 291921#291921: *27485105465 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: YYY.YYY.YYY.129, server: 0.0.0.0:443
2023/08/08 17:22:34 [alert] 27489#27489: worker process 291921 exited on signal 9
2023/08/08 17:22:34 [error] 67058#67058: *27485109964 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.197, server: 0.0.0.0:80
2023/08/08 17:22:35 [alert] 27489#27489: worker process 67058 exited on signal 9
2023/08/08 17:22:36 [error] 291582#291582: *27485115892 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.96, server: 0.0.0.0:80
2023/08/08 17:22:37 [alert] 27489#27489: worker process 291582 exited on signal 9
2023/08/08 17:22:37 [error] 67072#67072: *27485120591 [lua] healthcheck.lua:1150: log(): [healthcheck] (upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab) failed to release lock 'lua-resty-healthcheck:upstream#/apisix/upstreams/58e0a3b27292c0d9267ce78a7dfa45ab:target_lock:XXX.XXX.XXX.95:32190': unlocked, context: ngx.timer, client: XXX.XXX.XXX.198, server: 0.0.0.0:80
2023/08/08 17:22:38 [alert] 27489#27489: worker process 67072 exited on signal 9

I think the SSL errors are related to our finding number 3
worker-events dict size is: 1024m

We also have an interesting finding.

This is a normal log (after we kill the affected workers) for the related service, it is normal that all upstreams are 502 because of our finding number 3

{"@timestamp":"2023-08-08T17:56:03.484664+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"5364e8127a236e359699bffe80851093", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.196", "remote_port":"10506", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"754", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.012", "upstream":"XXX.XXX.XXX.95:32708, XXX.XXX.XXX.31:31500, XXX.XXX.XXX.31:30249, XXX.XXX.XXX.33:30249, XXX.XXX.XXX.30:31569, XXX.XXX.XXX.33:31500, XXX.XXX.XXX.30:30593, XXX.XXX.XXX.33:30593, XXX.XXX.XXX.33:31569, XXX.XXX.XXX.94:31725, XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914", "upstream_connect_time":"-, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502, 502", "upstream_response_time":"0.002, 0.000, 0.000, 0.001, 0.000, 0.000, 0.001, 0.000, 0.000, 0.002, 0.000, 0.001, 0.001, 0.001, 0.001, 0.001", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}

But this is an unusual log, and the logs related to this service during the problem is like the log below:

{"@timestamp":"2023-08-08T17:37:03.482507+03:00", "_team":"traffic", "_app":"loadbalancer", "_federation":"AAA", "_region":"BBB", "_environment":"prod", "_cluster":"internal", "_hostname":"Apisix01", "_service":"loadbalancer", "tag":"lb_access_non_2xx3xx:", "severity":"6", "facility":"23", "severity_label":"info", "facility_label":"local7", "cf_ipcountry":"", "http_x_client_ip":"", "http_True_Client_IP":"", "upstream_http_X_Proxy_Cache":"", "request_id":"ef16786dd332690d1fd06228bb08788c", "route_id":"", "request_length":"120", "remote_addr":"XXX.XXX.XXX.94", "remote_port":"61553", "request":"GET / HTTP/1.1", "args":"", "uri":"/", "status":"502", "bytes_sent":"680", "http_user_agent":"Go-http-client/1.1", "http_x_forwarded_for":"", "http_host":"thanos-queryfrontend-api.aaaa.bbbb.com", "server_name":"_", "request_time":"0.014", "upstream":"XXX.XXX.XXX.96:31914, XXX.XXX.XXX.194:31125, XXX.XXX.XXX.194:32708, XXX.XXX.XXX.92:31125, XXX.XXX.XXX.93:31725, XXX.XXX.XXX.94:31914, XXX.XXX.XXX.95:32708, XXX.XXX.XXX.94:31725 : ", "upstream_connect_time":"-, -, -, -, -, -, -, - : ", "upstream_status":"502, 502, 502, 502, 502, 502, 502, 502 : ", "upstream_response_time":"0.002, 0.001, 0.005, 0.001, 0.001, 0.001, 0.002, 0.001 : ", "upstream_cache_status":"", "ssl_protocol":"", "ssl_cipher":"", "scheme":"http", "server_port":"80", "request_method":"GET", "server_protocol":"HTTP/1.1", "http_cf_ray":"", "ty_lb_waf_id":"", "ty_lb_cc":"", "ty_lb_asn":""}

Please look at the upstream, upstream_status, upstream_connect_time, upstream_response_time. They are like incomplete. They all and with colon character. :

juzhiyuan · 2023-08-15T04:15:53Z

Hello @alptugay,

The issue is under fixing, check api7/lua-resty-healthcheck@f1d81af for more information.

Sn0rt · 2023-09-18T03:55:34Z

@alptugay have you try this fix ? api7/lua-resty-healthcheck@f1d81af

Revolyssup · 2023-09-18T06:06:16Z

@alptugay you can try with apisix version 3.4>= where the lua-resty-healthcheck library is upgraded

Revolyssup · 2023-09-19T20:05:54Z

Fixed by api7/lua-resty-healthcheck@f1d81af

juzhiyuan added the checking check first if this issue occurred label Jul 6, 2023

leslie-tsang assigned kingluo Jul 6, 2023

moonming added this to Apache APISIX backlog Jul 6, 2023

moonming moved this to 🏗 In progress in Apache APISIX backlog Jul 6, 2023

leslie-tsang added the wait for update wait for the author's response in this issue/PR label Jul 24, 2023

leslie-tsang moved this from 🏗 In progress to 📋 Backlog in Apache APISIX backlog Jul 24, 2023

github-actions bot added the no response label Aug 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 12, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Apache APISIX backlog Aug 12, 2023

github-actions bot added user responded and removed wait for update wait for the author's response in this issue/PR labels Aug 14, 2023

kingluo reopened this Aug 14, 2023

shreemaan-abhishek removed no response user responded labels Aug 15, 2023

shreemaan-abhishek moved this from ✅ Done to 🏗 In progress in Apache APISIX backlog Aug 15, 2023

Revolyssup moved this from 🏗 In progress to 📋 Backlog in Apache APISIX backlog Sep 8, 2023

Revolyssup removed the checking check first if this issue occurred label Sep 18, 2023

Revolyssup self-assigned this Sep 19, 2023

Revolyssup closed this as completed Sep 19, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Apache APISIX backlog Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

alptugay commented Jul 4, 2023

kingluo commented Jul 6, 2023

monkeyDluffy6017 commented Jul 18, 2023 •

edited

Loading

juzhiyuan commented Jul 26, 2023

github-actions bot commented Aug 9, 2023

github-actions bot commented Aug 12, 2023

alptugay commented Aug 14, 2023

juzhiyuan commented Aug 15, 2023

Sn0rt commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

Revolyssup commented Sep 19, 2023

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

Comments

alptugay commented Jul 4, 2023

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

kingluo commented Jul 6, 2023

monkeyDluffy6017 commented Jul 18, 2023 • edited Loading

juzhiyuan commented Jul 26, 2023

github-actions bot commented Aug 9, 2023

github-actions bot commented Aug 12, 2023

alptugay commented Aug 14, 2023

juzhiyuan commented Aug 15, 2023

Sn0rt commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

Revolyssup commented Sep 19, 2023

monkeyDluffy6017 commented Jul 18, 2023 •

edited

Loading