Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos-rebuild switch spends too much time checking ACME certs #62958

Closed
majewsky opened this issue Jun 10, 2019 · 3 comments · Fixed by #91121
Closed

nixos-rebuild switch spends too much time checking ACME certs #62958

majewsky opened this issue Jun 10, 2019 · 3 comments · Fixed by #91121
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md
Milestone

Comments

@majewsky
Copy link
Contributor

majewsky commented Jun 10, 2019

I have a NixOS server hosting websites on multiple domains. It provides nearly a dozen unique domains:

$ systemctl show acme-certificates.target | grep Wants | tr ' ' '\n' | wc -l         
11

I noticed that sudo nixos-rebuild switch is taking a rather long time on that server, and went to investigate. With no changes to the configuration.nix or the channels, it takes almost 20 seconds to do a no-op switch:

$ ( echo start; sudo nixos-rebuild switch 2>&1; echo done ) | while read LINE; do echo "$(date +%s.%N) $LINE"; done
1560184891.708082156 start
1560184892.703134201 building Nix...
1560184893.479601133 building the system configuration...
1560184899.042561931 updating GRUB 2 menu...
1560184899.173778641 activating the configuration...
1560184899.272107865 setting up /etc...
1560184899.786263303 reloading user units for stefan...
1560184899.902325272 setting up tmpfiles
1560184910.582049876 done

After some further digging, it turns out that a rather large amount of time (10 seconds) is spent in this systemctl start invocation. In particular, what's taking a long time is

$ time sudo systemctl start acme-certificates.target
sudo systemctl start acme-certificates.target  0,01s user 0,02s system 0% cpu 10,227 total

This starts every acme-$DOMAIN.service unit (apparently one by one, without any parallelization, if the timing is any indication). This should be no big deal because no certificates need to be renewed most of the time, but simp_le, being a Python program, just takes about a second for anything because of the whole interpreter-startup and reading-code and compiling steps.

As a user, I would like nixos-rebuild switch to take less time to make the edit-rebuild-test loop more fluent. I can see two major venues:

  1. Since a replacement for simp_le is being considered in security.acme: simp_le -> dehydrated or certbot or acme.sh or lego #34941, I would like the replacement to be something that doesn't start an entire dynamic-language interpreter, so that systemctl start acme-$DOMAIN.service finishes in less than a second.

  2. From the timings that I've seen, systemctl start acme-certificates.target appears to be starting its constituent units serially. Can we enable some parallelization here? (Or is it already parallelizing, but without it helping the overall runtime?)

Technical details

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 4.19.48, NixOS, 19.03.172865.4fb3b869e21 (Koi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.2.2`
 - channels(root): `"nixos-19.03.172865.4fb3b869e21"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
@matthewbauer matthewbauer added this to the 19.09 milestone Jun 10, 2019
@arianvp
Copy link
Member

arianvp commented Jun 18, 2019

apparently one by one, without any parallelization, if the timing is any indication.

I'm almost certain this is not the case. the cert services dont depend on eachother, so they should start up in parallel. You could verify this with the output of systemd-analyze graph. Unless simp_le does its own locking and retrying when multiple instances of it are running.

Anyhow, I've been rewriting the acme module because it has a few bugs I ran into. Perhaps I can tackle this one too. Do you have any suggestions on how we should tackle this? I'll open a PR for that soonish. it's still all local. One of the things it does is actually remove acme-certificates.target and make nginx/lighttpd directly depend on the corresponding services to fix #60180
However, I don't think that will solve slow startup times.

The reason why they get started up on every switch is because nginx.service has a Wants=acme-certificates.target. (https://github.com/arianvp/nixpkgs/blob/master/nixos/modules/security/acme.nix#L359). (Or in my case a list of acme-${cert}.service) I wonder if we can drop that. however I think that will break the first time bootstrap of certs, so we'll have to find a way to get them started the first time you deploy the certificates.

@majewsky
Copy link
Contributor Author

majewsky commented Jun 21, 2019

I wonder if we can drop that. however I think that will break the first time bootstrap of certs, so we'll have to find a way to get them started the first time you deploy the certificates.

Yes, that's the big issue. But I've just been thinking, and I have an idea how we could have our cake and eat it too. We could introduce a new set of services, e.g. acme-ensure-$cert.service, that is identical to acme-$cert.service except that the script starts with something like

test -f /path/to/cert.pem && exit 0

So this service would only call simp_le if a bootstrap is required. Then we could have Wants=acme-ensure-*.service on nginx.service and it would still be fast.

I don't know if this is actually feasible, esp. wrt the acme-selfsigned-* machinery, and I'm also not entirely sure if the extra complexity of introducing this extra set of services is worth it. It just crossed my mind and I figured it's good enough of an idea to put up for consideration.

@arianvp arianvp mentioned this issue Jul 1, 2019
10 tasks
@stale
Copy link

stale bot commented Jun 2, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020
@m1cr0man m1cr0man mentioned this issue Jun 19, 2020
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants