Feature Request: Hot Reload #25

elhackeado · 2024-04-10T17:48:32Z

Feature Description:

Hot reloading functionality will enable river to dynamically reload configuration file without requiring a restart of the application or service. This capability enhances system flexibility, uptime, and ease of maintenance by allowing administrators to make configuration changes on-the-fly while the application is still running.

How Nginx implemented it ?

In order for nginx to re-read the configuration file, a HUP signal should be sent to the master process. The master process first checks the syntax validity, then tries to apply new configuration, that is, to open log files and new listen sockets. If this fails, it rolls back changes and continues to work with old configuration. If this succeeds, it starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.

Let’s illustrate this by example. Imagine that nginx is run on FreeBSD and the command

ps axw -o pid,ppid,user,%cpu,vsz,wchan,command | egrep '(nginx|PID)'
produces the following output:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1148 pause  nginx: master process /usr/local/nginx/sbin/nginx
33127 33126 nobody   0.0  1380 kqread nginx: worker process (nginx)
33128 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)
33129 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)

If HUP is sent to the master process, the output becomes:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33129 33126 nobody   0.0  1380 kqread nginx: worker process is shutting down (nginx)
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

One of the old worker processes with PID 33129 still continues to work. After some time it exits:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

[SOURCE] https://nginx.org/en/docs/control.html

Any limitations with Nginx's approach ?

Too frequent hot reloading would make connections unstable and lose business data.

When NGINX executes the reload command, the old worker process will keep processing the existing connections and automatically disconnect once it processes all remaining requests. However, if the client hasn’t processed all requests, they will lose business data of the remaining requests forever. Of course, this would raise client-side users’ attention.

In some circumstances, the recycling time of the old worker process takes so long that it affects regular business.

For example, when we proxy WebSocket protocol, we can’t know whether a request has been processed because NGINX doesn’t parse the header frame. So even though the worker process receives the quit command from the master process, it can’t exit until these connections raise exceptions, time out, or disconnect.

Here is another example, when NGINX performs as the reverse proxy for TCP and UDP, it can’t know how often a request is being requested before it finally gets shut down.

Therefore, the old worker process usually takes a long time, especially in industries like live streaming, media, and speech recognition. Sometimes, the recycling time of the old worker process could reach half an hour or even longer. Meanwhile, if users frequently reload the server, it will create many shutting down processes and finally lead to NGINX OOM, which could seriously affect the business.

APISIX solved this problem in their own way, do checkout this article before taking any design decision. https://api7.ai/blog/how-nginx-reload-work

The text was updated successfully, but these errors were encountered:

moderation · 2024-04-11T01:09:00Z

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

taikulawo · 2024-04-12T08:38:38Z

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

Author said is about reload configuration, not restart binary itself. there has some different.

jamesmunns · 2024-04-12T11:43:17Z

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md

It is likely river will take a similar path, doing a hot-reload (e.g. starting and stopping the binary, but maintaining connections).

It's possible this could be implemented in a way that doesn't require starting a new process, but as this is implemented within pingora itself, it's likely River will mimic their implementation 1:1.

jamesmunns · 2024-04-12T11:47:08Z

Putting this in the "Backlog" milestone, as I'm not sure if this will make it into River before the end of April, but it might.

Et7f3 · 2024-04-13T21:27:56Z

I also add other techniques that can be applied to process like docker container https://iximiuz.com/en/posts/multiple-containers-same-port-reverse-proxy/

elhackeado · 2024-04-14T07:15:45Z

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md
@jamesmunns I believe Pingora's Graceful Upgrade is the way to go. Since Pingora is already battle tested in production, At this point I would rather rely on Pingora's way of doing it rather than bringing something new which may mature over time.

studersi · 2024-04-16T18:14:35Z

There is also a different Rust-based reverse proxy project with strong focus on changing configurations without any downtime or lost connections: https://github.com/sozu-proxy/sozu.

Quote from their website (https://github.com/sozu-proxy/sozu):

SŌZU is a HTTP reverse proxy built in Rust, that can handle fine grained configuration changes at runtime without reloads, and designed to never ever stop.

I am not sure how they achieve it exactly but their implementation might be worth looking into when designing this feature.

jamesmunns · 2024-05-24T11:36:23Z

Noting that this has been scheduled for the just-starting-now milestone, should have some progress on this in the next weeks.

jamesmunns · 2024-07-22T16:03:55Z

This was implemented by #49, please feel free to open an issue if there are any follow-on needs!

jamesmunns added F-Configuration Functionality relating to configuration F-Reload Functionality relating to graceful reloading labels Apr 12, 2024

jamesmunns modified the milestones: Backlog, Kickstart Spike 1 Apr 12, 2024

jamesmunns modified the milestones: Backlog, Kickstart Spike 2 May 24, 2024

jamesmunns closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Hot Reload #25

Feature Request: Hot Reload #25

elhackeado commented Apr 10, 2024

moderation commented Apr 11, 2024

taikulawo commented Apr 12, 2024

jamesmunns commented Apr 12, 2024 •

edited

Loading

jamesmunns commented Apr 12, 2024

Et7f3 commented Apr 13, 2024

elhackeado commented Apr 14, 2024

studersi commented Apr 16, 2024

jamesmunns commented May 24, 2024

jamesmunns commented Jul 22, 2024

Feature Request: Hot Reload #25

Feature Request: Hot Reload #25

Comments

elhackeado commented Apr 10, 2024

Feature Description:

How Nginx implemented it ?

Any limitations with Nginx's approach ?

Too frequent hot reloading would make connections unstable and lose business data.

In some circumstances, the recycling time of the old worker process takes so long that it affects regular business.

moderation commented Apr 11, 2024

taikulawo commented Apr 12, 2024

jamesmunns commented Apr 12, 2024 • edited Loading

jamesmunns commented Apr 12, 2024

Et7f3 commented Apr 13, 2024

elhackeado commented Apr 14, 2024

studersi commented Apr 16, 2024

jamesmunns commented May 24, 2024

jamesmunns commented Jul 22, 2024

jamesmunns commented Apr 12, 2024 •

edited

Loading