-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP networking unstable on macos and linux environments #5615
Comments
Hello @cpq, thank you reporting this one. We will take a look. |
Thank you. Here is an example of Linux environment failing to talk to time.google.com via UDP: https://github.com/cesanta/mongoose/actions/runs/2390068186 There are multiple tests failed. Exactly the same tests succeeded on windows and macos environment on this run. |
Is this a problem that is happening with all UDP traffic, or is it isolated to time.google.com? For example, what happens if you use time.windows.com instead? |
@chkimes, thank you for looking into this. DNS works just fine, which is UDP over port 53. I have added one extra test to query udp://time.windows.com:123 before querying google It fails on time.windows.com too. The same test executes fine from a local workstation, and within the same run on windows environment. |
Thanks - that's very interesting. Can you run a local VM packet capture while the run is executing? You can add these steps to the workflow:
The workflow logs will output a file.io link you can use to download the capture. Alternatively, you can upload it as an artifact of the job. |
Done, In order to double-check that tcpdump works, I've done a second run with port 53 included : https://github.com/cesanta/mongoose/runs/6613208125?check_suite_focus=true Does that mean that the traffic does not hit the interface? It looks like it gets dropped somewhere on the way.
|
I think file.io is deleting the files once you've downloaded them, so I'm unable to view them. However if there are no packets in one and packets in another then that is definitely suspicious of client-side behavior (unrelated to hosting environment). We should at least see outbound packets. MacOS hosting and networking is quite different, however Windows and Linux are extremely similar. That there are differences between the latter two also indicates something client-side. However I'm not immediately sure what. Is the application picking a specific source port for the UDP traffic? Or is it using any OS-assigned high port? |
No specific port is chosen, so it's a whatever ethemeral port an OS assigns. |
The source port was a long shot, but easy enough to check. One of the older SNTP specs says to specify src and dst: https://datatracker.ietf.org/doc/html/rfc2030#section-4
If an NTP implementation was already listening on port 123, the kernel would block picking that as the source port. This also would not get captured by tcpdump since the packet never hits and interface. However, if you're using ephemeral high ports then that's not a concern. Using the SNTP tool, I can see that communications to port 123 are behaving as I might expect on Linux: https://github.com/chkimes/test-sntp/runs/6627294882?check_suite_focus=true This result, coupled with the different behavior on Linux and Windows (where, from a virtualization standpoint, the networking layer is identical), to me reinforces the idea that networking is behaving as expected. As for why this doesn't work now but used to work - I don't have an immediate answer. It is very likely related to a new image version, of which the most likely candidates in my mind are:
|
@chkimes thank you! Actually SNTP version 4, which was used in the first place, allows source port to be anything. I did some more debugging and I am getting somewhere. 1810710e59e 2 sock.c:129:iolog
-- 2 0.0.0.0:60086 -> 8.8.8.8:53 33
0000 00 01 01 00 00 01 00 00 00 00 00 00 04 74 69 6d .............tim
0010 65 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00 e.google.com....
0020 01 .
1810710e59e 3 sock.c:498:mg_iotest n=2 ms=50 <-- just before poll(), with timeout of 50 ms
1810710e59e 4 sock.c:580:mg_mgr_poll 2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll 1 -- tchRc
1810710e59e 3 sock.c:498:mg_iotest n=2 ms=50 <--- next poll() iteration - but the timestamp is the same
1810710e59e 4 sock.c:580:mg_mgr_poll 2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll 1 -- tchR
1810710e59e 3 sock.c:498:mg_iotest n=2 ms=50 <---- still the same!
1810710e59e 4 sock.c:580:mg_mgr_poll 2 -- tchrc
1810710e59e 4 sock.c:580:mg_mgr_poll 1 -- tchRc
1810710e59e 3 sock.c:498:mg_iotest n=2 ms=50 <--- still the same! so instead of waiting for response, we busy-loop
... The app has a limit on the number of iterations, so after 50 iteration is stops. Obviously, an sntp test program does not have such count limit, and it succeeds, cause it waits for a timeout based on a timestamp, not on a iteration limit. When I run the test on a workstation, the
So, apparently, GA images ignore timeout value for |
Thanks - that's very interesting! After a brief scan of your code, it looks like you're checking |
@chkimes , I have to apologise. The issue we saw was due to the bug in the app, and it was a regression after a switch from select() to poll(). for (struct mg_connection *c = mgr->conns; c != NULL; c = c->next, i++) {
if (c->is_closing || c->is_resolving || FD(c) == INVALID_SOCKET) {
// Socket not valid, ignore <<<<<----- SOME FDs in the array are ignored
} else {
fds[i].fd = FD(c); That loop populated fds[] array, but it skipped some fds, leaving fds[0].fd == 0, and poll() reported it as ready, thus falling into a busy loop. The fix was avoid gaps in the fds[] array and keep only those fds that must be tested. Now I hope the only issue is left with that intermittent macos failures, but I guess we can live with that. Apologies again, and thanks for you support and attention. |
Glad we were able to get to a resolution on Linux! Is the behavior similar in the MacOS failures? What kind of intermittent failure rate are you seeing there? |
It's about 50/50 - now I am throwing printf()s there to see more precisely what happens - I'll report soon. |
Interesting. https://github.com/cesanta/mongoose/runs/6632199560?check_suite_focus=true time.windows.com DNS-resolved and SNTP-ed successfully.
|
Yep! I believe the same |
Got the capture. MS site worked, google - not. Request sent, but got no response: |
Thanks for that! That packet shows in the trace at least, so we can see the client is trying to send it. Understanding how far the packet got or whether a response packet was sent will, unfortunately, be a pretty difficult task. I will note that MacOS is quite different from Linux/Windows from a virtualization and networking standpoint, however it's not different in a way that I would expect it to behave differently for different endpoints/ports. Since I don't have an immediate suspect, I think our best path forward is to generate some data and see if we can find patterns. I would ask you to run tcpdump in your tests on MacOS for a while, and we can use that data to correlate it with the failures. I can combine that with some of our backend telemetry to get things like what region a job ran in or what kind of hardware it was on. In order to avoid managing a bunch of file uploads, we can instead just spit the output into the logs:
|
Thank you @chkimes , much appreciated. |
Let's keep it open, since the MacOS issue is still ongoing. |
So far, that's the stats. Those lines with "ok" are successful. Those without - failed.
|
Thanks for generating the data @cpq! I investigated the runs from backend telemetry and I have found a pretty clear correlation with which datacenters the jobs are running on. I'm going to spend some time to investigate what the differences are between these. |
@chkimes thank you! I guess, a command-line sntp tool might be used for testing .. |
An update: it actually seems that only I haven't yet identified the root cause, but it seems like SNTP specifically is being blocked if it's outside of the MS network. I'm waiting on followup from some network engineers to learn more. |
Update: I'm still waiting on investigation from the network team. |
Thank you for the update @chkimes ! |
I got confirmation that there is a network ACL blocking port 123 that is unexpectedly being applied to our traffic. I'm still working with the network team on a path forward (so that the rule applies to what is intended, but excludes Mac VM traffic), but we at least understand the root cause. |
@chkimes thank you for your effort, much appreciated! |
FYI - recently I see a much higher rate of failures, now including linux environments. |
The network team will be working on a change to isolate Mac traffic from this network rule. Not clear on rollout date, but I will update when I hear about it. Linux environments should not be impacted by this particular rule, so we may have some additional investigation to do since the root cause is likely not the same. FYI I will be out on leave starting tomorrow until Tuesday. |
Are there any news on the issue? |
I'm still working with the networking team on this, it turns out the blocking rule is part of a sweeping mitigation against NTP reflection attacks and carving out an exemption specifically for Actions Mac traffic is taking some time to get correct (plus additional review from security teams). Are you able to switch NTP targets to |
@chkimes thanks for hinting about time.windows.com! |
From time to time, there are network issues on GitHub runners. They still didn't fix them, and it's a huge problem to restart jobs everytime. - Windows: actions/runner-images#5850 - Linux/MacOS: actions/runner-images#5615
From time to time, there are network issues on GitHub runners. They still didn't fix them, and it's a huge problem to restart jobs everytime. - Windows: actions/runner-images#5850 - Linux/MacOS: actions/runner-images#5615
From time to time, there are network issues on GitHub runners. They still didn't fix them, and it's a huge problem to restart jobs everytime. - Windows: actions/runner-images#5850 - Linux/MacOS: actions/runner-images#5615
From time to time, there are network issues on GitHub runners. They still didn't fix them, and it's a huge problem to restart jobs everytime. - Windows: actions/runner-images#5850 - Linux/MacOS: actions/runner-images#5615
…js#5125) From time to time, there are network issues on GitHub runners. They still didn't fix them, and it's a huge problem to restart jobs everytime. - Windows: actions/runner-images#5850 - Linux/MacOS: actions/runner-images#5615
Hey @cpq! How it is going on your side? Any feedback? |
@erik-bershel let us enable SNTP lookup on Mac and see. Will follow up. Thanks for pinging me. |
Hello! The issue seems stale now, feel free to reach us again if you have any new connected problems or issues! |
Description
Repo: https://github.com/cesanta/mongoose
GA actions runs unit test which makes a request to the time.google.com SNTP server, via UDP to synchronise time.
On macos environment, that request fails the majority of time - even if the unit test is patched to repeat the request, all subsequent UDP requests seem to be dropped. Only a restart may fix that, and sometimes does it after several attempts.
The most flaky environment is macos, and recently linux started to show that behavior.
Note: TCP requests to the external hosts seem to be fine. Only UDP has problems.
Example of failed test (this case, macos environment failed): https://github.com/cesanta/mongoose/actions/runs/2388366976
Virtual environments affected
Image version and build link
https://github.com/cesanta/mongoose/actions/runs/2388366976
Macos:
Linux:
Is it regression?
No response
Expected behavior
UDP requests to time.google.com must go through.
Note: the same test, same requests, same everything, work flawleslly on Windows environment - and most of the time on Linux environment. They almost always fail on Macos environemnt, and sometimes on Linux environment.
Actual behavior
No response
Repro steps
Send UDP packets to time.google.com
The text was updated successfully, but these errors were encountered: