[2.6 beta1 w/ dco] server side explicit-exit-notify not working #189

Originalimoc · 2022-12-07T11:54:11Z

Describe the bug
server client both 2.6 beta1 w/ dco
server:
udp
explicit-exit-notify 1
client:
udp
explicit-exit-notify 2

To Reproduce
Establish a TLS config connection first
then send server a SIGUSR1/SIGHUP/SIGTERM
server will log(this one is SIGTERM):

2022-12-07 11:26:09 event_wait : Interrupted system call (fd=-1,code=4)
2022-12-07 11:26:09 SENT CONTROL [Client]: 'RESTART' (status=1)
2022-12-07 11:26:11 Closing DCO interface

but client receives nothing/log nothing, need a manual SIGUSR1 on client to reestablish connection.

Expected behavior
Client receives RESTART then generates an internal SIGUSR1. This makes client will reconnect after server reboot.

Version information (please complete the following information):

OS: Ubuntu 22.04
OpenVPN version: 2.6 beta1 w/ dco
Repeat for peer if relevant

The text was updated successfully, but these errors were encountered:

schwabe · 2022-12-08T12:09:58Z

I cannot reproduce this. Server side:

Server side:

/src//openvpn/openvpn --server 10.33.0.0 255.255.255.0 --server-ipv6 fd00:f00f::1/64  --topology subnet --cert ~/bclient.pem --key ~/bclient.pem --dev tun --verb 4  --data-ciphers chacha20-poly1305:AES-256-GCM:AES-128-GCM:AES-192-GCM --tun-mtu 1400 --dh none --keepalive 10 25 --config ~/fp --push "explicit-exit-notify 3"  --verb 4  --explicit-exit-notify 1
[...]
2022-12-08 12:07:23 us=551539 DCO device tun0 opened
[...]
^C2022-12-08 12:07:29 us=150785 event_wait : Interrupted system call (fd=-1,code=4)
2022-12-08 12:07:29 us=150854 SENT CONTROL [test-client]: 'RESTART' (status=1)
^C2022-12-08 12:07:30 us=795453 event_wait : Interrupted system call (fd=-1,code=4)

Client side:

2022-12-08 13:07:25 Initialization Sequence Completed
2022-12-08 13:07:29 Connection reset command was pushed by server ('')
2022-12-08 13:07:29 Closing TUN/TAP interface

Originalimoc · 2022-12-08T14:25:37Z

Not only this, also this(reneg 15 secs for debug):

TLS: soft reset sec=16/15 bytes=0/-1 pkts=0/0
TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
TLS: move_session: dest=TM_LAME_DUCK src=TM_ACTIVE reinit_src=1
TLS Error: TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
TLS Error: TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
...

Something is wrong with the control channel.

Originalimoc · 2022-12-08T14:26:25Z

I'm using dh and tls-crypt, try add those two? data-ciphers is AES-128-GCM

schwabe · 2022-12-08T14:34:00Z

reneg-sec 15 is extremely low value and you seem have even a 5s timeout there, so I am not surprised that your control breaks with those very aggresive timings. So please try to reproduce the issue with sensible timings. Convoluting this issue with agressive timing that may cause other problems is not helpful here.

Originalimoc · 2022-12-08T14:40:04Z

My network is super fast/low latency w/ effectively 0 packet loss. And yes, this happened on default 3600. I even never used reneg-*/hand-window configs before.

Originalimoc · 2022-12-08T14:41:46Z

add dh and tls-crypt to your config to see if you can reproduce, add cannotreproduce is too early

PS: my network ping -i 0 -c 1000

--- 10.0.0.1 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 16792ms
rtt min/avg/max/mdev = 67.754/69.313/71.813/0.329 ms, pipe 5, ipg/ewma 16.808/69.555 ms

So if TLS didn't complete even in one sec I can safely assume it's failed(a test by SIGUSR1):

45:58 us=957848 Restart pause, 1 second(s)
45:59 us=958289 Re-using SSL/TLS context
46:00 us=267117 Initialization Sequence Completed

I'll moving on without reneg it doesn't seem to matter with AES-128.

schwabe · 2022-12-08T14:52:51Z

@Originalimoc I cannot reproduce this. You need to provide proper instructions to reproduce this as I cannot reproduce this. dh and tls-crypt does not make a difference here. At least I posted my whole configuration instead of you who only posted some fragements.

Please try to also reproduce that with a minimal configuration without all kind of extra/unusual options to figure out if that is related to one of the options that you seem to be using. Try removing tls-crypt and see if you can reproduce it and so on.

Originalimoc · 2022-12-08T14:55:21Z

full both side config: #191 (comment)

cron2 · 2022-12-08T14:57:04Z

Hi,

On Thu, Dec 08, 2022 at 06:42:03AM -0800, Originalimoc wrote: add dh and tls-crypt to your config to see if you can reproduce.

You are missing the point. "Adding arbitrary stuff to the config just to make it stop working" is not helpful for diagnosing actual breakage - in this case, `reneg-sec 15` is something that is very likely going to break things, and we will not(!) fix this. Reasonable values for `reneg-sec` are "a few minutes and up" - anything below 120 can be helpful for testing *iff* you are trying to diagnose TLS renegotiation issues, and exactly know what you are looking for. A good sample config has "the absolute minimum lines needed to reproduce a given problem". Then this can be fixed, and the next problem tackled. gert

…

-- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany ***@***.***

Originalimoc · 2022-12-08T14:58:36Z

It does happen on default. I already stated reneg 15 secs for debug from the very beginning

cron2 · 2022-12-08T15:04:07Z

Hi,

On Thu, Dec 08, 2022 at 06:58:46AM -0800, Originalimoc wrote: It does happen on default. I already stated **reneg 15 secs for debug**

This adds no value, just makes it much harder to follow what you're actually trying to demonstrate (= it is taking away valuable time that we could instead use for actual fixing the issue). Stick to one set of configs, then show a complete log from both sides with `verb 3` or `verb 4` that demonstrates the problem. gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany ***@***.***

cron2 · 2022-12-08T15:05:37Z

Hi,

On Thu, Dec 08, 2022 at 06:58:46AM -0800, Originalimoc wrote: It does happen on default. I already stated **reneg 15 secs for debug**

This ticket, for example, is about "explicit-exit-notify not working". What happens at renegotiation has nothing to do whatsoever with EEN, so the whole subthread about reneg-sec, dh, tls-crypt is just irrlevant. gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany ***@***.***

Originalimoc · 2022-12-08T15:06:45Z

To me both look like something is wrong with the control channel, so I post them both here. And full config is given: #191 (comment)

schwabe · 2022-12-08T15:12:13Z

@Originalimoc I did not happen for me with default setting as I already posted in my comment. So currently this not reproducible. YOu are adding all kind of questionable non default settings like hand-window 5, sndbuf , replay-window 5000 3 to your configs. I have to assume that the rest of your setup is just as sketchy as your configuration.

To repeat myself, unless you give a proper way to reproduce this issue, I consider this issue as non-reproducible. You have not even provided full logs from server and client yet.

Originalimoc · 2022-12-08T15:17:20Z

After disable server side dco, both problem(EEN/RENEG) goes away. And hand-window 5/reneg 15s work perfectly even it's "aggressive".

Nah...

cron2 · 2022-12-08T15:17:36Z

Hi,

On Thu, Dec 08, 2022 at 07:06:57AM -0800, Originalimoc wrote: To me both look like something is wrong with the control channel, so I post them both here.

This ticket is not about "things wrong with the control channel", but very explicit about "EEN not working with DCO". There's a reason why each problem needs to go into an individual ticket - it is much less time consuming to work on one problem at a time, and not having to read through multiple unrelated different aspects intermixed in one ticket. We do this in our spare time, and time is limited... gert

…

-- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany ***@***.***

cron2 · 2022-12-08T15:19:11Z

Hi,

On Thu, Dec 08, 2022 at 07:17:31AM -0800, Originalimoc wrote: After disable server side dco, both problem(EEN/RENEG) goes away. And hand-window 5/reneg 15s work perfectly even it's "aggressive".

"It happens to work". EEN with DCO changes signalling to use the control-channel for EEN, and that does no longer work with your aggressive timings. As you have seen. "Leave the timers alone" is generally good advice unless you know very exactly what these values do, and why you want to change this. gert -- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress Gert Doering - Munich, Germany ***@***.***

schwabe · 2022-12-08T15:19:50Z

@Originalimoc let me tell you as someone that actually understands openvpn that these timing setting can cause problems that you do not even understand. But if you think you are more knowledgable of OpenVPN than I am, then please fix the issue yourself and psot a patch instead of questioning our knowlege of OpenVPN.

Originalimoc · 2022-12-08T15:21:00Z

In case you didn't read above. I already said:
1\ This happened on default 3600s reneg, reneg 15 secs just for debug
2\ I even never used reneg-*/hand-window configs before, like 5+ years.
3\ My network can highly reliably make a full TLS handshake in 0.3 sec as #189 (comment) shows.

Originalimoc · 2022-12-08T15:28:09Z

It's not my job either. I can move on without reneg necessary. It's all under a watchdog anyway I can just auto SIGUSR1 restart it all within 0.5 sec. I report, hoping for an improve, you patch if can, great, not really enter arguing mode, having a bad day huh @swb?

schwabe · 2022-12-08T15:33:28Z

@Originalimoc if you are not going to help us to figure out why it is not working on your side, I am just going to close the ticket.

Originalimoc · 2022-12-08T15:38:08Z

You're not even use the same config to say "it's not reproducible"

Originalimoc · 2022-12-08T15:54:27Z

@cron2 hand-window 5, reneg-sec, tran-window IS NOT THERE WHEN PROBLEM FIRST APPEAR. Can you read FGS?

cron2 · 2022-12-08T15:57:15Z

@cron2 hand-window 5, reneg-sec, tran-window IS NOT THERE WHEN PROBLEM FIRST APPEAR. Can you read FGS?

Whatever "FGS" is supposed to stand for, but yes, I can read. And I read that you are bringing up new and unrelated stuff all the time, which eats my time.

cron2 · 2022-12-08T16:00:41Z

For the original problem, it is caused by your aggressive reconnect timers.

In DCO mode (EEN in control channel) the server waits for 3 seconds after sending the EEN so all clients have time to ACK the control-channel message. Your client connects so fast that it hits the server "going down", and then the server goes down "for real" and will not send another EEN.

If you look properly into your client log files, you see that it receives the RESTART, reconnects, and then the server disappears (without another RESTART). We're going to fix this server-side by not allowing reconnect in this 2-3 second time frame.

Originalimoc · 2022-12-08T16:16:01Z

ForGodSake.
OK.
No.
You still did't read.
client log files, you see that it receives the RESTART
No.
but client receives nothing/log nothing, need a manual SIGUSR1 on client to reestablish connection.

original problem? Let's turn off reneg(0) for now. So you mean connect-retry?
From document EEN do send and run, don't need an ACK?
So messy right now.

Originalimoc · 2022-12-08T16:21:08Z

We better start clean, a new issue post with all info packed into one.

cron2 · 2022-12-08T16:23:08Z

EEN with DCO needs to be sent over the control channel, and control-channel packets always need an ACK. This is the way the OpenVPN protocol works (we could make the wait time shorter if all ACKs are in, but this would be a much larger code change, which nobody had time and interest to implement).

Old-style EEN are sent as part of the data channel (OCC messages), which does not work with DCO.

cron2 · 2022-12-08T16:25:04Z

We better start clean, a new issue post with all info packed into one.

Do not randomly open new issues. One issue per problem, relevant content in there, non-relevant content left out. If you feel part of your contributions have not been relevant, delete them.

Originalimoc · 2022-12-08T16:27:35Z

DONE, NOT A BUG?
Kernel module with data channel continues to work, but ov process itself entered a bad state that can't get RESTART already, yes, it can only get once per process lifecycle.

Can the wait be removed between SENT CONTROL [Client]: 'RESTART' (status=1) and SIGUSR1[hard,] received, process restarting, It takes 2 seconds, then another 3 sec restart pause. So at least 5s wait is needed on the client side. That's way too long, enough to take a round trip to the moon.

Better man page needed, warn regarding --connect-retry.

schwabe · 2022-12-08T16:32:19Z

@Originalimoc either change your attitude or go away. Telling us something like "for god's sake" and questioning our ability toi read is something I perceive as an insult. I do not like working with people that insult me.

Originalimoc · 2022-12-08T16:33:22Z

I don't like talking to you from beginning either thanks. Getting angry afterwards. c2 is clearly more actually knowledgeable on the subject.

Currently we still allow clients to connect while the server is waiting to shut down. This window is very small (2s) and is only used when explicit-exit-notify is enabled on the server side. The chance of a client connecting during this time period is very low unless someone puts something stupid like --connect-retry 1 3 into his/her client config and forces the client to reconnect during this time period. Github: #189 Signed-off-by: Arne Schwabe <[email protected]> Acked-by: Gert Doering <[email protected]> Message-Id: <[email protected]> URL: https://www.mail-archive.com/[email protected]/msg25638.html Signed-off-by: Gert Doering <[email protected]> (cherry picked from commit 7d0a903)

Currently we still allow clients to connect while the server is waiting to shut down. This window is very small (2s) and is only used when explicit-exit-notify is enabled on the server side. The chance of a client connecting during this time period is very low unless someone puts something stupid like --connect-retry 1 3 into his/her client config and forces the client to reconnect during this time period. Github: #189 Signed-off-by: Arne Schwabe <[email protected]> Acked-by: Gert Doering <[email protected]> Message-Id: <[email protected]> URL: https://www.mail-archive.com/[email protected]/msg25638.html Signed-off-by: Gert Doering <[email protected]>

cron2 · 2022-12-12T13:46:23Z

This patch (from @schwabe, as the subject matter expert on this) will sidestep the problem somewhat - when "in the process of shutting down", the server will no longer accept new connections from clients. So the race condition between "the client reconnects very quickly, and the server is not yet restarted" will no longer lead to clients thinking they have a valid connection, which isn't true anymore (= client has to wait for ping timeout).

Yes, we should probably rework the server side code, to take less time to actually shut down - but the openvpn event loop is complex, and not very good in doing things "really quick now!". So this will not happen in the 2.6 beta cycle, but we might revisit this in the "refactor big parts in master, after a release has been done" phase

schwabe · 2022-12-12T14:04:50Z

And I am not sure that the extra logic/code for doing a shutdown really quick is worth adding. Aside from this ticket, I never heard that the shutdown time of a server that waits 5s is causing problems. So adding 30+ lines of codes to fix such an obscure problem is not worth the complexity it adds.

Originalimoc · 2022-12-12T15:41:30Z

"Something stupid like --connect-retry 1 3" LOL.
No, the inability to maintain a good internal state and to instant reconnect in case of endpoint restart is stupid instead. And adding some arbitrary 60s/5s/3s/2s timeout without explain-able good reason and can break until a manual restart if peer "misbehave" is another extra stupid point. My upper tunnel can behave stateless and never fatally break if either side does anything configurable, at most wait a few sec it behave as new.

Originalimoc · 2022-12-12T15:44:04Z

It can/need to be a resource concern but not an architectural one. Unless you're EXPECTING server/client not working anymore of a session after an unexpected incoming connection, this is a serious bug.

cron2 · 2022-12-12T15:48:36Z

"Something stupid like --connect-retry 1 3" LOL. No, the inability to maintain a good internal state and to instant reconnect in case of endpoint restart is stupid instead. And adding some arbitrary 60s/5s/3s/2s timeout without explain-able good reason and can break if peer "misbehave" is another extra stupid point.

This is not "arbitrary". It's called exponential backoff, and is standard practice for well-behaving software. If you have your first system go down because a software retried in milliseconds and logged every attempt, filling all disk space in the process, you'll understand.

Originalimoc · 2022-12-12T15:52:17Z

Not the point.
The starting point is off. Why do you need 60s timeout, are you talking to Mars or something. And why server need a restart pause, its initial state connects to nothing.

cron2 · 2022-12-12T15:52:37Z

It can/need to be a resource concern but not an architectural one. Unless you're EXPECTING server/client not working anymore of a session after an unexpected incoming connection, this is a serious bug.

No, we did not expect this. Control-channel EEN is new, and you triggered an unexpected behaviour by using more aggressive timers than our test framework does. So @schwabe fixed it.

Was this an annoying side effect? I'm sure it was. Was it "a serious bug"? Certainly not. No side crashed, no security impact.

cron2 · 2022-12-12T15:53:43Z

Not the point. The starting point is off. Why do you need 60s timeout, are you talking to Mars or something. And why server need a restart pause, its initial state connects to nothing.

The starting point is 5s, first retry 5s, then 10, 20,... - and if a remote server is down hard, retrying every 60 seconds is a very reasonable compromise between "reconnecting quickly" and "needlessly burning resources"

Originalimoc · 2022-12-12T15:54:50Z

The EEN part essentially crashed because on next session peer receive nothing.

schwabe · 2022-12-12T15:55:00Z

And besides the defaults are chosen to be a compromise that works for as many people as possible. Your setup might be able to tolerate a 3s retry from clients but that is not the case for everyone. I implmeented this exponentional backup since mobile phones that did loose server connection for an hour or two would quickly burn their battery with the default of 5s.

schwabe · 2022-12-12T15:56:59Z

The EEN part essentially crashed because on next session peer receive nothing.

I don't think you understand what crashed means. "crashed" means the program is not able to recover. And your "big problem" is only caused because you tweak all kind of obscure settings in even more obscure ways. So you walking besides the established and well-tested paths. So that your setup has more problem than the ones that run with well-tested defaults is not surprising at all.

Originalimoc · 2022-12-12T15:57:28Z

Hand window is 60.
Mobile side is better with persist key and no ping and float.

Originalimoc · 2022-12-12T15:59:22Z

Nah the state maintenance still sucks. I now configured to bypassed these. Not mean there's no bug though, way too much legacy debt.

cron2 · 2022-12-12T16:00:13Z

Hand window is 60. Mobile side is better with persist key and no ping and float.

And what, exactly, does this have to do with the issue here?

Nah the state maintenance still sucks. I now configured to bypassed these. Not mean there's no bug though.

I will now block you. The way you refuse to listen, and keep being offensive ("sucks") is really a waste of our time.

schwabe · 2022-12-12T16:02:17Z

Hand window is 60. Mobile side is better with persist key and no ping and float.

You are just trolling at this point. I don't think there is any use in continuing the coversation with you.

OrimoB · 2022-12-12T16:11:54Z

Huh blocking, that's a good move. I'm here spending time to praise you or what? Coding from ground up is better than deal with all these legacy bad decision without proper state management mess and that's is exactly what I'll do next. You are the one that refused to listen and insisting "this is good and well tested", live in your old dream. What sucks, remains sucks.

cron2 · 2022-12-12T16:17:37Z

Huh blocking, that's a good move. I'm here spending time to praise you or what? Coding from ground up is better than deal with all these legacy bad decision without proper state management mess and that's is exactly what I'll do next. You are the one that refused to listen and insisting "this is good and well tested", live in your old dream. What sucks, remains sucks.

If you pay for software and support, feel free to be as unfriendly as you want. If you get some piece of software for free, and get support for free as well, we expect you to spend some of your time on "be friendly and provide the information that is asked from you" in return.

You prefer to insult us, which is okay, but we are free to just not listen to you. So, yes, block.

schwabe added the cannotreproduce label Dec 8, 2022

Originalimoc closed this as completed Dec 8, 2022

cron2 mentioned this issue Dec 8, 2022

[2.6 beta1 w/ dco] server side explicit-exit-notify not working #192

Closed

Originalimoc reopened this Dec 8, 2022

Originalimoc closed this as completed Dec 8, 2022

cron2 added enhancement non-trivial change this is a larger code change, not a simple bugfix and removed cannotreproduce labels Dec 12, 2022

[2.6 beta1 w/ dco] server side explicit-exit-notify not working #189

[2.6 beta1 w/ dco] server side explicit-exit-notify not working #189

Comments

Originalimoc commented Dec 7, 2022 • edited Loading

schwabe commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

schwabe commented Dec 8, 2022

Originalimoc commented Dec 8, 2022 • edited Loading

Originalimoc commented Dec 8, 2022 • edited Loading

schwabe commented Dec 8, 2022 • edited Loading

Originalimoc commented Dec 8, 2022

cron2 commented Dec 8, 2022 via email

Originalimoc commented Dec 8, 2022 • edited Loading

cron2 commented Dec 8, 2022 via email

cron2 commented Dec 8, 2022 via email

Originalimoc commented Dec 8, 2022

schwabe commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

cron2 commented Dec 8, 2022 via email

cron2 commented Dec 8, 2022 via email

schwabe commented Dec 8, 2022

Originalimoc commented Dec 8, 2022 • edited Loading

Originalimoc commented Dec 8, 2022

schwabe commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

cron2 commented Dec 8, 2022

cron2 commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

Originalimoc commented Dec 8, 2022

cron2 commented Dec 8, 2022

cron2 commented Dec 8, 2022

Originalimoc commented Dec 8, 2022 • edited Loading

schwabe commented Dec 8, 2022 • edited Loading

Originalimoc commented Dec 8, 2022 • edited Loading

cron2 commented Dec 12, 2022

schwabe commented Dec 12, 2022

Originalimoc commented Dec 12, 2022 • edited Loading

Originalimoc commented Dec 12, 2022 • edited Loading

cron2 commented Dec 12, 2022

Originalimoc commented Dec 12, 2022

cron2 commented Dec 12, 2022

cron2 commented Dec 12, 2022

Originalimoc commented Dec 12, 2022

schwabe commented Dec 12, 2022

schwabe commented Dec 12, 2022

Originalimoc commented Dec 12, 2022

Originalimoc commented Dec 12, 2022 • edited Loading

cron2 commented Dec 12, 2022

schwabe commented Dec 12, 2022

OrimoB commented Dec 12, 2022

cron2 commented Dec 12, 2022

Originalimoc commented Dec 7, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

schwabe commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

schwabe commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 8, 2022 •

edited

Loading

Originalimoc commented Dec 12, 2022 •

edited

Loading

Originalimoc commented Dec 12, 2022 •

edited

Loading

Originalimoc commented Dec 12, 2022 •

edited

Loading