Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.6 beta1 w/ dco] server side explicit-exit-notify not working #189

Closed
Originalimoc opened this issue Dec 7, 2022 · 49 comments
Closed
Labels
enhancement non-trivial change this is a larger code change, not a simple bugfix

Comments

@Originalimoc
Copy link

Originalimoc commented Dec 7, 2022

Describe the bug
server client both 2.6 beta1 w/ dco
server:
udp
explicit-exit-notify 1
client:
udp
explicit-exit-notify 2

To Reproduce
Establish a TLS config connection first
then send server a SIGUSR1/SIGHUP/SIGTERM
server will log(this one is SIGTERM):

2022-12-07 11:26:09 event_wait : Interrupted system call (fd=-1,code=4)
2022-12-07 11:26:09 SENT CONTROL [Client]: 'RESTART' (status=1)
2022-12-07 11:26:11 Closing DCO interface

but client receives nothing/log nothing, need a manual SIGUSR1 on client to reestablish connection.

Expected behavior
Client receives RESTART then generates an internal SIGUSR1. This makes client will reconnect after server reboot.

Version information (please complete the following information):

  • OS: Ubuntu 22.04
  • OpenVPN version: 2.6 beta1 w/ dco
  • Repeat for peer if relevant
@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

I cannot reproduce this. Server side:

Server side:

/src//openvpn/openvpn --server 10.33.0.0 255.255.255.0 --server-ipv6 fd00:f00f::1/64  --topology subnet --cert ~/bclient.pem --key ~/bclient.pem --dev tun --verb 4  --data-ciphers chacha20-poly1305:AES-256-GCM:AES-128-GCM:AES-192-GCM --tun-mtu 1400 --dh none --keepalive 10 25 --config ~/fp --push "explicit-exit-notify 3"  --verb 4  --explicit-exit-notify 1
[...]
2022-12-08 12:07:23 us=551539 DCO device tun0 opened
[...]
^C2022-12-08 12:07:29 us=150785 event_wait : Interrupted system call (fd=-1,code=4)
2022-12-08 12:07:29 us=150854 SENT CONTROL [test-client]: 'RESTART' (status=1)
^C2022-12-08 12:07:30 us=795453 event_wait : Interrupted system call (fd=-1,code=4)

Client side:

2022-12-08 13:07:25 Initialization Sequence Completed
2022-12-08 13:07:29 Connection reset command was pushed by server ('')
2022-12-08 13:07:29 Closing TUN/TAP interface

@Originalimoc
Copy link
Author

Not only this, also this(reneg 15 secs for debug):

TLS: soft reset sec=16/15 bytes=0/-1 pkts=0/0
TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
TLS: move_session: dest=TM_LAME_DUCK src=TM_ACTIVE reinit_src=1
TLS Error: TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
TLS Error: TLS key negotiation failed to occur within 5 seconds (check your network connectivity)
TLS Error: TLS handshake failed
...

Something is wrong with the control channel.

@Originalimoc
Copy link
Author

I'm using dh and tls-crypt, try add those two? data-ciphers is AES-128-GCM

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

reneg-sec 15 is extremely low value and you seem have even a 5s timeout there, so I am not surprised that your control breaks with those very aggresive timings. So please try to reproduce the issue with sensible timings. Convoluting this issue with agressive timing that may cause other problems is not helpful here.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

My network is super fast/low latency w/ effectively 0 packet loss. And yes, this happened on default 3600. I even never used reneg-*/hand-window configs before.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

add dh and tls-crypt to your config to see if you can reproduce, add cannotreproduce is too early

PS: my network ping -i 0 -c 1000

--- 10.0.0.1 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 16792ms
rtt min/avg/max/mdev = 67.754/69.313/71.813/0.329 ms, pipe 5, ipg/ewma 16.808/69.555 ms

So if TLS didn't complete even in one sec I can safely assume it's failed(a test by SIGUSR1):

45:58 us=957848 Restart pause, 1 second(s)
45:59 us=958289 Re-using SSL/TLS context
46:00 us=267117 Initialization Sequence Completed

I'll moving on without reneg it doesn't seem to matter with AES-128.

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

@Originalimoc I cannot reproduce this. You need to provide proper instructions to reproduce this as I cannot reproduce this. dh and tls-crypt does not make a difference here. At least I posted my whole configuration instead of you who only posted some fragements.

Please try to also reproduce that with a minimal configuration without all kind of extra/unusual options to figure out if that is related to one of the options that you seem to be using. Try removing tls-crypt and see if you can reproduce it and so on.

@Originalimoc
Copy link
Author

full both side config: #191 (comment)

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022 via email

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

It does happen on default. I already stated reneg 15 secs for debug from the very beginning

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022 via email

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022 via email

@Originalimoc
Copy link
Author

To me both look like something is wrong with the control channel, so I post them both here. And full config is given: #191 (comment)

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

@Originalimoc I did not happen for me with default setting as I already posted in my comment. So currently this not reproducible. YOu are adding all kind of questionable non default settings like hand-window 5, sndbuf , replay-window 5000 3 to your configs. I have to assume that the rest of your setup is just as sketchy as your configuration.

To repeat myself, unless you give a proper way to reproduce this issue, I consider this issue as non-reproducible. You have not even provided full logs from server and client yet.

@Originalimoc
Copy link
Author

After disable server side dco, both problem(EEN/RENEG) goes away. And hand-window 5/reneg 15s work perfectly even it's "aggressive".

Nah...

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022 via email

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022 via email

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

@Originalimoc let me tell you as someone that actually understands openvpn that these timing setting can cause problems that you do not even understand. But if you think you are more knowledgable of OpenVPN than I am, then please fix the issue yourself and psot a patch instead of questioning our knowlege of OpenVPN.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

In case you didn't read above. I already said:
1\ This happened on default 3600s reneg, reneg 15 secs just for debug
2\ I even never used reneg-*/hand-window configs before, like 5+ years.
3\ My network can highly reliably make a full TLS handshake in 0.3 sec as #189 (comment) shows.

@Originalimoc
Copy link
Author

It's not my job either. I can move on without reneg necessary. It's all under a watchdog anyway I can just auto SIGUSR1 restart it all within 0.5 sec. I report, hoping for an improve, you patch if can, great, not really enter arguing mode, having a bad day huh @swb?

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

@Originalimoc if you are not going to help us to figure out why it is not working on your side, I am just going to close the ticket.

@Originalimoc
Copy link
Author

You're not even use the same config to say "it's not reproducible"

@Originalimoc
Copy link
Author

@cron2 hand-window 5, reneg-sec, tran-window IS NOT THERE WHEN PROBLEM FIRST APPEAR. Can you read FGS?

@Originalimoc Originalimoc reopened this Dec 8, 2022
@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022

@cron2 hand-window 5, reneg-sec, tran-window IS NOT THERE WHEN PROBLEM FIRST APPEAR. Can you read FGS?

Whatever "FGS" is supposed to stand for, but yes, I can read. And I read that you are bringing up new and unrelated stuff all the time, which eats my time.

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022

For the original problem, it is caused by your aggressive reconnect timers.

In DCO mode (EEN in control channel) the server waits for 3 seconds after sending the EEN so all clients have time to ACK the control-channel message. Your client connects so fast that it hits the server "going down", and then the server goes down "for real" and will not send another EEN.

If you look properly into your client log files, you see that it receives the RESTART, reconnects, and then the server disappears (without another RESTART). We're going to fix this server-side by not allowing reconnect in this 2-3 second time frame.

@Originalimoc
Copy link
Author

ForGodSake.
OK.
No.
You still did't read.
client log files, you see that it receives the RESTART
No.
but client receives nothing/log nothing, need a manual SIGUSR1 on client to reestablish connection.

original problem? Let's turn off reneg(0) for now. So you mean connect-retry?
From document EEN do send and run, don't need an ACK?
So messy right now.

@Originalimoc
Copy link
Author

We better start clean, a new issue post with all info packed into one.

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022

EEN with DCO needs to be sent over the control channel, and control-channel packets always need an ACK. This is the way the OpenVPN protocol works (we could make the wait time shorter if all ACKs are in, but this would be a much larger code change, which nobody had time and interest to implement).

Old-style EEN are sent as part of the data channel (OCC messages), which does not work with DCO.

@cron2
Copy link
Contributor

cron2 commented Dec 8, 2022

We better start clean, a new issue post with all info packed into one.

Do not randomly open new issues. One issue per problem, relevant content in there, non-relevant content left out. If you feel part of your contributions have not been relevant, delete them.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

DONE, NOT A BUG?
Kernel module with data channel continues to work, but ov process itself entered a bad state that can't get RESTART already, yes, it can only get once per process lifecycle.

Can the wait be removed between SENT CONTROL [Client]: 'RESTART' (status=1) and SIGUSR1[hard,] received, process restarting, It takes 2 seconds, then another 3 sec restart pause. So at least 5s wait is needed on the client side. That's way too long, enough to take a round trip to the moon.

Better man page needed, warn regarding --connect-retry.

@schwabe
Copy link
Contributor

schwabe commented Dec 8, 2022

@Originalimoc either change your attitude or go away. Telling us something like "for god's sake" and questioning our ability toi read is something I perceive as an insult. I do not like working with people that insult me.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 8, 2022

I don't like talking to you from beginning either thanks. Getting angry afterwards. c2 is clearly more actually knowledgeable on the subject.

cron2 pushed a commit that referenced this issue Dec 12, 2022
Currently we still allow clients to connect while the server is waiting
to shut down. This window is very small (2s) and is only used when
explicit-exit-notify is enabled on the server side.

The chance of a client connecting during this time period is very low
unless someone puts something stupid like --connect-retry 1 3 into his/her
client config and forces the client to reconnect during this time period.

Github: #189

Signed-off-by: Arne Schwabe <[email protected]>
Acked-by: Gert Doering <[email protected]>
Message-Id: <[email protected]>
URL: https://www.mail-archive.com/[email protected]/msg25638.html
Signed-off-by: Gert Doering <[email protected]>
(cherry picked from commit 7d0a903)
cron2 pushed a commit that referenced this issue Dec 12, 2022
Currently we still allow clients to connect while the server is waiting
to shut down. This window is very small (2s) and is only used when
explicit-exit-notify is enabled on the server side.

The chance of a client connecting during this time period is very low
unless someone puts something stupid like --connect-retry 1 3 into his/her
client config and forces the client to reconnect during this time period.

Github: #189

Signed-off-by: Arne Schwabe <[email protected]>
Acked-by: Gert Doering <[email protected]>
Message-Id: <[email protected]>
URL: https://www.mail-archive.com/[email protected]/msg25638.html
Signed-off-by: Gert Doering <[email protected]>
@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

This patch (from @schwabe, as the subject matter expert on this) will sidestep the problem somewhat - when "in the process of shutting down", the server will no longer accept new connections from clients. So the race condition between "the client reconnects very quickly, and the server is not yet restarted" will no longer lead to clients thinking they have a valid connection, which isn't true anymore (= client has to wait for ping timeout).

Yes, we should probably rework the server side code, to take less time to actually shut down - but the openvpn event loop is complex, and not very good in doing things "really quick now!". So this will not happen in the 2.6 beta cycle, but we might revisit this in the "refactor big parts in master, after a release has been done" phase

@cron2 cron2 added enhancement non-trivial change this is a larger code change, not a simple bugfix and removed cannotreproduce labels Dec 12, 2022
@schwabe
Copy link
Contributor

schwabe commented Dec 12, 2022

And I am not sure that the extra logic/code for doing a shutdown really quick is worth adding. Aside from this ticket, I never heard that the shutdown time of a server that waits 5s is causing problems. So adding 30+ lines of codes to fix such an obscure problem is not worth the complexity it adds.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 12, 2022

"Something stupid like --connect-retry 1 3" LOL.
No, the inability to maintain a good internal state and to instant reconnect in case of endpoint restart is stupid instead. And adding some arbitrary 60s/5s/3s/2s timeout without explain-able good reason and can break until a manual restart if peer "misbehave" is another extra stupid point. My upper tunnel can behave stateless and never fatally break if either side does anything configurable, at most wait a few sec it behave as new.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 12, 2022

It can/need to be a resource concern but not an architectural one. Unless you're EXPECTING server/client not working anymore of a session after an unexpected incoming connection, this is a serious bug.

@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

"Something stupid like --connect-retry 1 3" LOL. No, the inability to maintain a good internal state and to instant reconnect in case of endpoint restart is stupid instead. And adding some arbitrary 60s/5s/3s/2s timeout without explain-able good reason and can break if peer "misbehave" is another extra stupid point.

This is not "arbitrary". It's called exponential backoff, and is standard practice for well-behaving software. If you have your first system go down because a software retried in milliseconds and logged every attempt, filling all disk space in the process, you'll understand.

@Originalimoc
Copy link
Author

Not the point.
The starting point is off. Why do you need 60s timeout, are you talking to Mars or something. And why server need a restart pause, its initial state connects to nothing.

@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

It can/need to be a resource concern but not an architectural one. Unless you're EXPECTING server/client not working anymore of a session after an unexpected incoming connection, this is a serious bug.

No, we did not expect this. Control-channel EEN is new, and you triggered an unexpected behaviour by using more aggressive timers than our test framework does. So @schwabe fixed it.

Was this an annoying side effect? I'm sure it was. Was it "a serious bug"? Certainly not. No side crashed, no security impact.

@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

Not the point. The starting point is off. Why do you need 60s timeout, are you talking to Mars or something. And why server need a restart pause, its initial state connects to nothing.

The starting point is 5s, first retry 5s, then 10, 20,... - and if a remote server is down hard, retrying every 60 seconds is a very reasonable compromise between "reconnecting quickly" and "needlessly burning resources"

@Originalimoc
Copy link
Author

The EEN part essentially crashed because on next session peer receive nothing.

@schwabe
Copy link
Contributor

schwabe commented Dec 12, 2022

And besides the defaults are chosen to be a compromise that works for as many people as possible. Your setup might be able to tolerate a 3s retry from clients but that is not the case for everyone. I implmeented this exponentional backup since mobile phones that did loose server connection for an hour or two would quickly burn their battery with the default of 5s.

@schwabe
Copy link
Contributor

schwabe commented Dec 12, 2022

The EEN part essentially crashed because on next session peer receive nothing.

I don't think you understand what crashed means. "crashed" means the program is not able to recover. And your "big problem" is only caused because you tweak all kind of obscure settings in even more obscure ways. So you walking besides the established and well-tested paths. So that your setup has more problem than the ones that run with well-tested defaults is not surprising at all.

@Originalimoc
Copy link
Author

Hand window is 60.
Mobile side is better with persist key and no ping and float.

@Originalimoc
Copy link
Author

Originalimoc commented Dec 12, 2022

Nah the state maintenance still sucks. I now configured to bypassed these. Not mean there's no bug though, way too much legacy debt.

@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

Hand window is 60. Mobile side is better with persist key and no ping and float.

And what, exactly, does this have to do with the issue here?

Nah the state maintenance still sucks. I now configured to bypassed these. Not mean there's no bug though.

I will now block you. The way you refuse to listen, and keep being offensive ("sucks") is really a waste of our time.

@schwabe
Copy link
Contributor

schwabe commented Dec 12, 2022

Hand window is 60. Mobile side is better with persist key and no ping and float.

You are just trolling at this point. I don't think there is any use in continuing the coversation with you.

@OrimoB
Copy link

OrimoB commented Dec 12, 2022

Huh blocking, that's a good move. I'm here spending time to praise you or what? Coding from ground up is better than deal with all these legacy bad decision without proper state management mess and that's is exactly what I'll do next. You are the one that refused to listen and insisting "this is good and well tested", live in your old dream. What sucks, remains sucks.

@cron2
Copy link
Contributor

cron2 commented Dec 12, 2022

Huh blocking, that's a good move. I'm here spending time to praise you or what? Coding from ground up is better than deal with all these legacy bad decision without proper state management mess and that's is exactly what I'll do next. You are the one that refused to listen and insisting "this is good and well tested", live in your old dream. What sucks, remains sucks.

If you pay for software and support, feel free to be as unfriendly as you want. If you get some piece of software for free, and get support for free as well, we expect you to spend some of your time on "be friendly and provide the information that is asked from you" in return.

You prefer to insult us, which is okay, but we are free to just not listen to you. So, yes, block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement non-trivial change this is a larger code change, not a simple bugfix
Projects
None yet
Development

No branches or pull requests

4 participants