Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small random sleep at renewal to avoid load spikes #1656

Closed
1 task done
jsha opened this issue Jun 15, 2022 · 10 comments · Fixed by #1657
Closed
1 task done

small random sleep at renewal to avoid load spikes #1656

jsha opened this issue Jun 15, 2022 · 10 comments · Fixed by #1657
Assignees

Comments

@jsha
Copy link
Contributor

jsha commented Jun 15, 2022

Welcome

  • Yes, I've searched similar issues on GitHub and didn't find any.

How do you use lego?

Other

Detailed Description

Hiya! I'm an engineer at Let's Encrypt. Lately we've been having severe load spikes at 00:00 UTC every day. We've run into similar issues in the past, and historically the issue has been large numbers of users manually configuring their cron jobs to run renewals at exactly midnight every day. It gets worse - because many of these jobs fail due to overload conditions, the next day the load spike may be bigger!

In Certbot the problem is partly solved by packaging - all the major packaging for Certbot arranges for it to run at a random time throughout the day. However, there are still people who manually setup Certbot in crontab, and they often choose 00:00 UTC to run it. So, some years ago Certbot added some code:

certbot/certbot#6391
certbot/certbot#6596
certbot/certbot#6599

When certbot renew is run non-interactively, it will sleep a random amount of time up to 8 minutes. This helps spread out the midnight load spike significantly.

Right now lego-cli is the top participant in the load spike. During the first 30 seconds after 00:00 UTC today, lego-cli accounted for 33k new-order requests, while the next biggest contributor accounted for only 5.6k new-order requests. For all requests (not just new-order) lego-cli accounted for 173k vs 19k for the next biggest contributor (60% of the total).

Considering all lego-cli traffic to Let's Encrypt, it's very spiky:

image

So, two questions:

  • Could you implement a randomized delay for non-interactive renewals, similar to Certbot's?
  • Are you aware of any major integrations or packaging for lego-cli that includes a cron job or systemd unit that runs at 00:00?

Thanks,
Jacob

@ldez ldez self-assigned this Jun 15, 2022
@ldez
Copy link
Member

ldez commented Jun 15, 2022

Hello Jacob,

The user agent lego-cli is explicitly related to the binary of lego.

The users of lego as a library will not have this specific user agent.

Sorry but I don't know the list of projects that use lego as binary.

I think we can implement a randomized delay in the CLI inside the renew command, I will work on it.

@jsha
Copy link
Contributor Author

jsha commented Jun 15, 2022

Thanks so much!

@dmke
Copy link
Member

dmke commented Jun 15, 2022

@jsha, do you have a version breakdown of that spiky chart for lego-cli or can you identify what percentage uses the latest release (lego-cli/4.7.0)?

If the latter number is low, we have a slow adoption rate for updates and it might take time to actually resolve this issue (even with random delay available in v4.8.0).

Regarding Lego-as-library users, https://github.com/go-acme/lego/network/dependents list a few popular repositories (measured by stars, each > 1k):

Sadly, this list cannot be sorted by stars (I've groked only the first few pages), and I don't know if it traverses whole dependency trees. However this would be a good start for further research.

@ldez
Copy link
Member

ldez commented Jun 15, 2022

@dmke the users of the library don't have the user agent lego-cli so it's not the right list.

@jsha
Copy link
Contributor Author

jsha commented Jun 15, 2022

I don't have the version breakdown but I can get it tomorrow. Yep, uptake will take some time - but "the best time to plant a tree was twenty years ago. The second best time is now." :-)

@dmke
Copy link
Member

dmke commented Jun 15, 2022

@dmke the users of the library don't have the user agent lego-cli so it's not the right list.

Indeed, they might use xenolf-acme/<version>, unless of course, they've reconfigured it.

In any case, I'll later prepare a PR with some text snippets and configuration examples about "being a good netizen", to put into the README and/or on the website.

@ldez
Copy link
Member

ldez commented Jun 15, 2022

Indeed, they might use xenolf-acme/<version>, unless of course, they've reconfigured it.

Not exactly, xenolf-acme/<version> is always here, the users can only "append" a user-agent.

@dmke
Copy link
Member

dmke commented Jun 15, 2022

users can only "append" a user-agent

Even better :)

dmke added a commit to dmke/lego that referenced this issue Jun 15, 2022
- split examples into multiple pages
- restructure content a bit
- flesh out some sections
- add a section about load spikes (cf. go-acme#1656)
@jsha
Copy link
Contributor Author

jsha commented Jun 17, 2022

Here's the distribution of versions seen during the interval from 00:00:00 to 00:00:30. It looks as expected, with the nearly-most-recent version heading the pack:

1 4.6.0 40,859
2 4.3.1 23,385
3 4.5.3 22,067
4 4.4.0 21,192
5 4.2.0 11,441
6 3.7.0 10,974
7 4.1.3 8,395
8 4.0.1 8,127
9 4.5.2 7,219

@ldez
Copy link
Member

ldez commented Sep 17, 2022

I just add the blog post made by Bitnami about this topic.

https://blog.bitnami.com/2022/07/bitnami-lets-encrypt-lego-teams-troubleshooted-bncert-issues.html

Conclusion

We would like to highlight the great work done by the Lego community for adding those features and thank the Let’s Encrypt team for helping us improve our solutions. The ability of the Lego community and the Let’s Encrypt team to quickly release a new version that added new features to address this issue - and that too, in less than 24 hours - is evidence of the power of the open-source community. They have done an excellent job and we want to recognize them for their great work in creating a highly popular service to enable HTTPS (SSL/TLS) for websites for free.

I really appreciate the thanks from Bitnami/VMware and the highlight on the power of open-source.

mohsenasm added a commit to mohsenasm/swarm-dashboard that referenced this issue Feb 27, 2024
…elay](go-acme/lego#1656) to the `renew` command, but we don't need this delay at the start because it's not an automated task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants