Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirrored network becomes unavailable 24h after session start #10587

Closed
1 of 2 tasks
sbalmos opened this issue Oct 4, 2023 · 9 comments
Closed
1 of 2 tasks

Mirrored network becomes unavailable 24h after session start #10587

sbalmos opened this issue Oct 4, 2023 · 9 comments
Labels

Comments

@sbalmos
Copy link

sbalmos commented Oct 4, 2023

Windows Version

Microsoft Windows [Version 10.0.22621.2361]

WSL Version

2.0.1.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.123.1-1

Distro Version

Debian 11

Other Software

1Password SSH agent relay tunneling using npiperelay.

Repro Steps

Within 24 hours of starting a Debian 11 WSL session running with mirrored networking, the network becomes unavailable. All existing connections are dropped, and all attempts to use non-loopback IPs return Network is unreachable. Remediation requires completely exiting WSL and performing a full shutdown of the WSL environment through wsl.exe --shutdown.

Expected Behavior

Networking remains available throughout the life of the session.

Actual Behavior

Networking becomes unavailable the next day.

sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 192.168.0.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.036 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.036/0.036/0.036/0.000 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl --shutdown
PS C:\Users\sbalmos> wsl
removing previous socket...
Starting SSH-Agent relay...
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=14.1 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=11.5 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=11.9 ms
^C
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 11.511/12.520/14.105/1.134 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$

Current time is 10:48am ET. Since I restarted WSL here to regain networking, I expect to see networking become unavailable somewhere around 10:45am ET, give or take a few minutes.

Diagnostic Logs

No response

@OneBlue OneBlue added the network label Oct 4, 2023
@keith-horton
Copy link
Member

Hi there. Since this takes a very long time to repro, could you run the following until you get a repro - it will capture minimal traces (just WSL traces managing the Linux network settings).

e.g.

C:>logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets
The command completed successfully.

<<<<<<<< Now Repro >>>>>>>>

C:>logman stop wsl-trace -ets

Error:
Data Collector Set was not found.

C:>logman stop wsl_trace -ets
The command completed successfully.

C:>dir *.etl
Volume in drive C has no label.
Volume Serial Number is C64F-A1F6

Directory of C:\

10/04/2023 07:53 PM 368,640 wsl_trace.etl
1 File(s) 368,640 bytes
0 Dir(s) 34,756,386,816 bytes free

Please send back the generated ETL file.

Once you have a repro, could you then run a very short repro attempting to make a network connection from the WSL container. The below will be a much deeper trace to try to collect where data is getting lost.

powershell .\collect-wsl-logs.ps1 .\wsl_networking.wprp

(from https://github.com/microsoft/WSL/tree/master/diagnostics)

Thanks!

@sbalmos
Copy link
Author

sbalmos commented Oct 5, 2023

ETS and WSL log traces attached. Started the ETS approx. 2 hours before predicted loss of networking, and ended the ETS afterwards. WSL log trace was performed immediately after loss of networking. Loss occurred at 24h+10m... and here's the juicy new tidbit - I just happened to notice that when the event occurred, the interfaces don't lose IPs or anything at the interface level. However, the in-WSL Linux routing table is completely wiped clean. No default routes, nothing. That may be the smoking gun or where to point the Eye of Sauron next.

WSLTraces.zip

@keith-horton
Copy link
Member

Thanks. I can see from the trace that our code in WSL has been successfully pushing IP updates into the container. There weren't errors setting things up with Linux.

It doesn't look like the wsl_networking.wprp created a trace to observe traffic failing. While it's in this bad state, can you dump out the Linux state (https://github.com/microsoft/WSL/blob/master/diagnostics/networking.sh), then run

wpr.exe -start wsl_networking.wprp -filemode

(then generate network traffic from the Linux container, like trying to ping an address, or wget bing.com a few times)

wpr.exe -stop wsl_networking.etl

please let me know what traffic you tried to send, and that ETL file.

If you could also cat /etc/resolv.conf so we can see what the DNS configuration is.
Thanks!

@sbalmos
Copy link
Author

sbalmos commented Oct 6, 2023

The ETL file is approximately 300 megs, zipped to 70, too big for an attachment. I have made it available at https://1drv.ms/u/s!AtUhMGXKAUHRgqFFUgXnAVLNKoHZmA?e=uLvtda

For giggles and completeness, networking-good.txt is a run of the networking shell script while everything is okay. networking-bad.txt is a dump of the script in the bad state. The ETL is also attached. Some pings against 1.1.1.1, local router 192.168.0.1, bing, google, etc were all attempted. Interestingly, looking at the networking-bad dump and some other observations, the IPv4 default route and subnets are nuked. But IPv6 remains up and available. In fact, if I know the IPv6 address of some services, new traffic is passed. Existing traffic was dropped - at the time of the event, I had an IPv6 connection open to one of the Libera IRC network servers, which was dropped. But I was able to successfully ping the IPv6 addresses of both Google and the Libera IRC server I was connected to at the time. Those all, both successful and unsuccessful pings, were captured in the ETL file.

@keith-horton
Copy link
Member

Thanks. The traffic over IPv6 is working (because there's a v6 route), but IPv4 doesn't have a route, so all of that traffic is failing. We have now heard a couple of instances where something is running on the Windows host that is affecting the vNIC that we use - causing the vmNIC in the container to go down & up again, at which point Linux will delete the IPv4 route (that's just Linux stack behavior, for whatever reasons).

It's not clear what is changing the state of the vNIC on the host though. There's nothing indicated in WSL that it changed (if HNS changed it for example, we would get a callback notification). (HNS is the component that creates the vNICs).

We are going to talk more internally about better responding to this and syncing IP state in Linux when we see changes occur unexpectedly.

@sbalmos
Copy link
Author

sbalmos commented Oct 13, 2023

Yup, thanks Keith! I read the other thread, and that one's author is a lot more thorough than I am. I just confirmed over on that thread that the IPv6 Temporary IP behavior he suspected is also what triggers it for me.

@keith-horton
Copy link
Member

Thank you all for your help debugging this. I was able to reproduce this and I have a fix which will hopefully be out with the next update.

@keith-horton
Copy link
Member

The preview release should have the fix for this. Which hopefully will be going to the public release soon.
You can get the prerelease here:

wsl --update --pre-release

Thanks again!

@CatalinFetoiu
Copy link
Collaborator

closing since the issue is fixed. if you still encounter the problem, please open a new issue. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants