Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance regression after 2.2.0 (debian bullseye -> bookworm) #460

Open
tik-stbuehler opened this issue Jun 13, 2023 · 14 comments
Assignees
Labels

Comments

@tik-stbuehler
Copy link

Describe the bug

Seeing high CPU usage and load on icinga2 master after upgrade to bookworm:

  • 100% vs 20% CPU usage on two (virtual) cores
  • Load about 130 vs less than 3 before
  • Lots of check_ssl_cert checks time out

Bookworm should have version 2.60.0 of check_ssl_cert, but same issue with 2.70.

Using the old 2.2.0 script (with a small patch for the new "-m|--match" option) fixes the issue.

To Reproduce

  • icinga master on two cores of AMD EPYC 7313P
  • 4GB memory (2GB swap) - not an issue, swap is completely free, almost 3GB "avail Mem" in top.
  • about 300 SSL checks: mostly Active Directory Domain Controllers: IPv4+IPv6 (if present), LDAP + LDAPS, normal port + global catalog - quite a few variations for each host.

Expected behavior

Less CPU usage, no timeouts, ...

System (please complete the following information):

  • OS: Debian
  • OS version: bookworm (12)
  • check_ssl_cert version: 2.60.0 and 2.70.0
  • OpenSSL version (openssl version): OpenSSL 3.0.9 30 May 2023 (Library: OpenSSL 3.0.9 30 May 2023)

Additional context/output

Invocations look like this:

/usr/lib/nagios/plugins/check_ssl_cert --no-ssl3 --no-tls1 --no-tls1_1 -H $IPADDRESS -P ldap -c 14 -m $SERVERNAME -p 3268 -r /etc/ssl/trusted-cas.crt -w 30

Manual calls usually work, but are slower than before.

@matteocorti matteocorti self-assigned this Jun 13, 2023
@matteocorti
Copy link
Owner

It could be caused by the check if the port is actually open.
Can you please post the output with the --debug-time command line option?

@tik-stbuehler
Copy link
Author

Wow, quick response - thanks :)

First the output of /usr/bin/time:

  • production system (bullseye, more cores), patched v.2.0 script:
    0.18user 0.04system 0:00.24elapsed 97%CPU (0avgtext+0avgdata 10900maxresident)k
    0inputs+56outputs (6major+31538minor)pagefaults 0swaps
  • qs system (bookworm), patched v2.2.0 script:
    0.78user 0.05system 0:00.96elapsed 87%CPU (0avgtext+0avgdata 11556maxresident)k
    0inputs+0outputs (1major+41023minor)pagefaults 0swaps
  • qs system, v2.70.0 script:
    1.61user 0.12system 0:02.72elapsed 63%CPU (0avgtext+0avgdata 11548maxresident)k
    0inputs+0outputs (0major+96181minor)pagefaults 0swaps

With --debug-time it runs for 4-5 seconds.
(I doubt that a "is the port open" check would cause high CPU usage - and I think the timeouts are only happening due to high CPU usage. Also as said before, I couldn't observe timeouts when I ran the check manually on CLI, even during high load.)

Full --debug-time output (also has a timeout in the icinga2 history this morning, but not a DC):
check_ssl_cert_debug_time.txt

I didn't see anything suspicious (timing wise), only [DBG 0000s] '/usr/bin/openssl s_client' supports '-name': using icinga2 looks wrong (icinga2 happens to be the hostname of the system running the check)

@matteocorti
Copy link
Owner

I just installed bookworm and something is very strange: a standard check on github.com takes more than 32 seconds. On all the other Linux systems (Fedora, Debian) it takes less than 4 seconds.

I'll investigate further

@matteocorti
Copy link
Owner

Something to do with /bin/timeout:

Example

corti@debian check_ssl_cert> time /bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\"

real	0m0.010s
user	0m0.009s
sys	0m0.001s
corti@debian check_ssl_cert> time /bin/timeout 105 /bin/sh -c "/bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""

real	0m5.039s
user	0m0.006s
sys	0m0.010s

@matteocorti
Copy link
Owner

Nope ist 'curl'

corti@debian check_ssl_cert> time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   947  100   947    0     0    187      0  0:00:05  0:00:05 --:--:--   198

real	0m5.052s
user	0m0.000s
sys	0m0.019s

@matteocorti
Copy link
Owner

It really seems a problem with curl:

corti@debian check_ssl_cert> time curl github.com

real	0m5.078s
user	0m0.030s
sys	0m0.017s

This takes around 10 ms on all other machines and systems I tested with.

As the problem is not related to the script, I'll close the issue.

@tik-stbuehler
Copy link
Author

I'm quite certain it isn't timeout/curl (should also be visible in the --debug-time log):

# time /bin/curl --user-agent 'check_ssl_cert/2.70.0' --location http://cacerts.digicert.com/DigiCertGlobalRootCA.crt --output /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   947  100   947    0     0  55669      0 --:--:-- --:--:-- --:--:-- 59187

real    0m0.023s
user    0m0.007s
sys     0m0.000s
# time /bin/timeout 105 /bin/sh -c "/bin/curl    --silent --user-agent 'check_ssl_cert/2.70.0' --location \"http://cacerts.digicert.com/DigiCertGlobalRootCA.crt\""

real    0m0.024s
user    0m0.007s
sys     0m0.003s

Given what I was seeing in top I suspect there are more (expensive?) openssl calls than before.

@waja
Copy link
Contributor

waja commented Jun 13, 2023

$ time curl github.com

real	0m0.088s
user	0m0.010s
sys	0m0.022s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@matteocorti
Copy link
Owner

Ok then it's only on my machine ...

@matteocorti matteocorti reopened this Jun 14, 2023
@matteocorti
Copy link
Owner

And the issue is gone (as of today)

$ time curl github.com

real	0m0.060s
user	0m0.007s
sys	0m0.004s
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@matteocorti
Copy link
Owner

Given what I was seeing in top I suspect there are more (expensive?) openssl calls than before.

Could be, but should not have such a big impact

@matteocorti
Copy link
Owner

I looked at your output with the timing (total around 5 seconds). The only calls to OpenSSL that I see are:

  1. The main one with openssl s_client to fetch the certificate (cannot be skipped)
  2. Three calls to openssl x509 to check the three elements of the chain (these are just using local data and should not be expensive). You can try to skip these checks with --first-element-only

Do you get longer times only with LDAP checks or generally even with HTTPS?

@tik-stbuehler
Copy link
Author

Found some time to take a look at this again.

I built a wrapper script to measure openssl times, comparing debian bookworm openssl (OpenSSL 3.0.13) with a locally built OpenSSL 1.1.1w (in ~/src/openssl, built with ./config + make -j8).

#!/bin/sh

exec 9>/dev/tty
echo >&9 "TIME: openssl $@"
# exec /usr/bin/time -o /dev/fd/9 /usr/bin/openssl "$@"
LD_LIBRARY_PATH=~/src/openssl exec /usr/bin/time -o /dev/fd/9 ~/src/openssl/apps/openssl "$@"
  • patched v2.2.0 script, openssl 1.1.1w: 0.29user
  • patched v2.2.0 script, openssl 3.0.13: 0.81user
  • HEAD (bf0addc), openssl 1.1.1w: 0.59user
  • HEAD (bf0addc), openssl 3.0.13: 1.87user

It seems openssl x509 ... takes at least 20ms or more with openssl 3.0.13, and "nothing" in openssl 1.1.1w, and the more recent version of check_ssl_certs extracts way more data from the certificates (and/or from more certificates?).

@tik-stbuehler
Copy link
Author

Setting SSL_CERT_FILE= when extracting data with openssl helps a lot (my locally built openssl 1.1.1w tries to load a non-existing file from /usr/local); no idea why it loads the trusted CAs by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants