Skip to content

Commit

Permalink
Moving Windows troubleshooting topics to /tasks/debug-application-clu…
Browse files Browse the repository at this point in the history
…ster/

Signed-off-by: Mark Rossetti <[email protected]>
  • Loading branch information
marosset committed Mar 31, 2022
1 parent aef1728 commit 534c45f
Show file tree
Hide file tree
Showing 2 changed files with 174 additions and 258 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -704,9 +704,8 @@ Privileged containers are [not supported](#compatibility-v1-pod-spec-containers-

## Getting help and troubleshooting {#troubleshooting}

Your main source of help for troubleshooting your Kubernetes cluster should start
with the [Troubleshooting](/docs/tasks/debug-application-cluster/troubleshooting/)
page.
For help with debugging and troubleshooting your Kubernetes cluster and/or workloads please start
with the [Troubleshooting](/docs/tasks/debug-application-cluster/) section.

Some additional, Windows-specific troubleshooting help is included
in this section. Logs are an important element of troubleshooting
Expand All @@ -715,268 +714,15 @@ troubleshooting assistance from other contributors. Follow the
instructions in the
SIG Windows [contributing guide on gathering logs](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#gathering-logs).

### Node-level troubleshooting {#troubleshooting-node}

1. How do I know `start.ps1` completed successfully?

You should see kubelet, kube-proxy, and (if you chose Flannel as your networking
solution) flanneld host-agent processes running on your node, with running logs
being displayed in separate PowerShell windows. In addition to this, your Windows
node should be listed as "Ready" in your Kubernetes cluster.

1. Can I configure the Kubernetes node processes to run in the background as services?

The kubelet and kube-proxy are already configured to run as native Windows Services,
offering resiliency by re-starting the services automatically in the event of
failure (for example a process crash). You have two options for configuring these
node components as services.

1. As native Windows Services

You can run the kubelet and kube-proxy as native Windows Services using `sc.exe`.

```powershell
# Create the services for kubelet and kube-proxy in two separate commands
sc.exe create <component_name> binPath= "<path_to_binary> --service <other_args>"
# Please note that if the arguments contain spaces, they must be escaped.
sc.exe create kubelet binPath= "C:\kubelet.exe --service --hostname-override 'minion' <other_args>"
# Start the services
Start-Service kubelet
Start-Service kube-proxy
# Stop the service
Stop-Service kubelet (-Force)
Stop-Service kube-proxy (-Force)
# Query the service status
Get-Service kubelet
Get-Service kube-proxy
```
1. Using `nssm.exe`
You can also always use alternative service managers like
[nssm.exe](https://nssm.cc/) to run these processes (flanneld,
kubelet & kube-proxy) in the background for you. You can use this
[sample script](https://github.com/Microsoft/SDN/tree/master/Kubernetes/flannel/register-svc.ps1),
leveraging nssm.exe to register kubelet, kube-proxy, and flanneld.exe to run
as Windows services in the background.
```powershell
register-svc.ps1 -NetworkMode <Network mode> -ManagementIP <Windows Node IP> -ClusterCIDR <Cluster subnet> -KubeDnsServiceIP <Kube-dns Service IP> -LogDir <Directory to place logs>
# NetworkMode = The network mode l2bridge (flannel host-gw, also the default value) or overlay (flannel vxlan) chosen as a network solution
# ManagementIP = The IP address assigned to the Windows node. You can use ipconfig to find this
# ClusterCIDR = The cluster subnet range. (Default value 10.244.0.0/16)
# KubeDnsServiceIP = The Kubernetes DNS service IP (Default value 10.96.0.10)
# LogDir = The directory where kubelet and kube-proxy logs are redirected into their respective output files (Default value C:\k)
```
If the above referenced script is not suitable, you can manually configure
`nssm.exe` using the following examples.
```powershell
# Register flanneld.exe
nssm install flanneld C:\flannel\flanneld.exe
nssm set flanneld AppParameters --kubeconfig-file=c:\k\config --iface=<ManagementIP> --ip-masq=1 --kube-subnet-mgr=1
nssm set flanneld AppEnvironmentExtra NODE_NAME=<hostname>
nssm set flanneld AppDirectory C:\flannel
nssm start flanneld
# Register kubelet.exe
# Microsoft releases the pause infrastructure container at mcr.microsoft.com/oss/kubernetes/pause:3.6
nssm install kubelet C:\k\kubelet.exe
nssm set kubelet AppParameters --hostname-override=<hostname> --v=6 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.6 --resolv-conf="" --allow-privileged=true --enable-debugging-handlers --cluster-dns=<DNS-service-IP> --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false --log-dir=<log directory> --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config
nssm set kubelet AppDirectory C:\k
nssm start kubelet
# Register kube-proxy.exe (l2bridge / host-gw)
nssm install kube-proxy C:\k\kube-proxy.exe
nssm set kube-proxy AppDirectory c:\k
nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --hostname-override=<hostname>--kubeconfig=c:\k\config --enable-dsr=false --log-dir=<log directory> --logtostderr=false
nssm.exe set kube-proxy AppEnvironmentExtra KUBE_NETWORK=cbr0
nssm set kube-proxy DependOnService kubelet
nssm start kube-proxy
# Register kube-proxy.exe (overlay / vxlan)
nssm install kube-proxy C:\k\kube-proxy.exe
nssm set kube-proxy AppDirectory c:\k
nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --feature-gates="WinOverlay=true" --hostname-override=<hostname> --kubeconfig=c:\k\config --network-name=vxlan0 --source-vip=<source-vip> --enable-dsr=false --log-dir=<log directory> --logtostderr=false
nssm set kube-proxy DependOnService kubelet
nssm start kube-proxy
```
For initial troubleshooting, you can use the following flags in [nssm.exe](https://nssm.cc/) to redirect stdout and stderr to a output file:
```powershell
nssm set <Service Name> AppStdout C:\k\mysvc.log
nssm set <Service Name> AppStderr C:\k\mysvc.log
```
For additional details, see [NSSM - the Non-Sucking Service Manager](https://nssm.cc/usage).
1. My Pods are stuck at "Container Creating" or restarting over and over
Check that your pause image is compatible with your OS version. The
[instructions](https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/deploying-resources)
assume that both the OS and the containers are version 1803. If you have a later
version of Windows, such as an Insider build, you need to adjust the images
accordingly. See [Pause container](#pause-container) for more details.
### Network troubleshooting {#troubleshooting-network}
1. My Windows Pods do not have network connectivity
If you are using virtual machines, ensure that MAC spoofing is **enabled** on all
the VM network adapter(s).
1. My Windows Pods cannot ping external resources
Windows Pods do not have outbound rules programmed for the ICMP protocol. However,
TCP/UDP is supported. When trying to demonstrate connectivity to resources
outside of the cluster, substitute `ping <IP>` with corresponding
`curl <IP>` commands.
If you are still facing problems, most likely your network configuration in
[cni.conf](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf)
deserves some extra attention. You can always edit this static file. The
configuration update will apply to any new Kubernetes resources.
One of the Kubernetes networking requirements
(see [Kubernetes model](/docs/concepts/cluster-administration/networking/)) is
for cluster communication to occur without
NAT internally. To honor this requirement, there is an
[ExceptionList](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf#L20)
for all the communication where you do not want outbound NAT to occur. However,
this also means that you need to exclude the external IP you are trying to query
from the `ExceptionList`. Only then will the traffic originating from your Windows
pods be SNAT'ed correctly to receive a response from the outside world. In this
regard, your `ExceptionList` in `cni.conf` should look as follows:
```conf
"ExceptionList": [
"10.244.0.0/16", # Cluster subnet
"10.96.0.0/12", # Service subnet
"10.127.130.0/24" # Management (host) subnet
]
```

1. My Windows node cannot access `NodePort` type Services

Local NodePort access from the node itself fails. This is a known
limitation. NodePort access works from other nodes or external clients.

1. vNICs and HNS endpoints of containers are being deleted

This issue can be caused when the `hostname-override` parameter is not passed to
[kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/). To resolve
it, users need to pass the hostname to kube-proxy as follows:

```powershell
C:\k\kube-proxy.exe --hostname-override=$(hostname)
```

1. With flannel, my nodes are having issues after rejoining a cluster

Whenever a previously deleted node is being re-joined to the cluster, flannelD
tries to assign a new pod subnet to the node. Users should remove the old pod
subnet configuration files in the following paths:

```powershell
Remove-Item C:\k\SourceVip.json
Remove-Item C:\k\SourceVipRequest.json
```

1. After launching `start.ps1`, flanneld is stuck in "Waiting for the Network to be created"

There are numerous reports of this [issue](https://github.com/coreos/flannel/issues/1066); most likely it is a timing issue for when the management IP of the flannel network is set. A workaround is to relaunch `start.ps1` or relaunch it manually as follows:

```powershell
[Environment]::SetEnvironmentVariable("NODE_NAME", "<Windows_Worker_Hostname>")
C:\flannel\flanneld.exe --kubeconfig-file=c:\k\config --iface=<Windows_Worker_Node_IP> --ip-masq=1 --kube-subnet-mgr=1
```

1. My Windows Pods cannot launch because of missing `/run/flannel/subnet.env`

This indicates that Flannel didn't launch correctly. You can either try
to restart `flanneld.exe` or you can copy the files over manually from
`/run/flannel/subnet.env` on the Kubernetes master to `C:\run\flannel\subnet.env`
on the Windows worker node and modify the `FLANNEL_SUBNET` row to a different
number. For example, if node subnet 10.244.4.1/24 is desired:

```env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.4.1/24
FLANNEL_MTU=1500
FLANNEL_IPMASQ=true
```

1. My Windows node cannot access my services using the service IP

This is a known limitation of the networking stack on Windows. However, Windows Pods can access the Service IP.

1. No network adapter is found when starting the kubelet

The Windows networking stack needs a virtual adapter for Kubernetes networking to work. If the following commands return no results (in an admin shell), virtual network creation — a necessary prerequisite for the kubelet to work — has failed:

```powershell
Get-HnsNetwork | ? Name -ieq "cbr0"
Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*"
```

Often it is worthwhile to modify the [InterfaceName](https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/start.ps1#L7) parameter of the start.ps1 script, in cases where the host's network adapter isn't "Ethernet". Otherwise, consult the output of the `start-kubelet.ps1` script to see if there are errors during virtual network creation.

1. DNS resolution is not properly working

Check the DNS limitations for Windows in this [section](#dns-limitations).

1. `kubectl port-forward` fails with "unable to do port forwarding: wincat not found"

This was implemented in Kubernetes 1.15 by including `wincat.exe` in the pause infrastructure container `mcr.microsoft.com/oss/kubernetes/pause:3.6`. Be sure to use a supported version of Kubernetes.
If you would like to build your own pause infrastructure container be sure to include [wincat](https://github.com/kubernetes/kubernetes/tree/master/build/pause/windows/wincat).

1. My Kubernetes installation is failing because my Windows Server node is behind a proxy

If you are behind a proxy, the following PowerShell environment variables must be defined:

```PowerShell
[Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://proxy.example.com:80/", [EnvironmentVariableTarget]::Machine)
[Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://proxy.example.com:443/", [EnvironmentVariableTarget]::Machine)
```

### Further investigation

If these steps don't resolve your problem, you can get help running Windows containers on Windows nodes in Kubernetes through:

* StackOverflow [Windows Server Container](https://stackoverflow.com/questions/tagged/windows-server-container) topic
* Kubernetes Official Forum [discuss.kubernetes.io](https://discuss.kubernetes.io/)
* Kubernetes Slack [#SIG-Windows Channel](https://kubernetes.slack.com/messages/sig-windows)

### Reporting issues and feature requests

If you have what looks like a bug, or you would like to
make a feature request, please use the
[GitHub issue tracking system](https://github.com/kubernetes/kubernetes/issues).
You can open issues on
[GitHub](https://github.com/kubernetes/kubernetes/issues/new/choose) and assign
them to SIG-Windows. You should first search the list of issues in case it was
make a feature request, please follow the [SIG Windows contributing guide](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#reporting-issues-and-feature-requests) to create a new issue.
You should first search the list of issues in case it was
reported previously and comment with your experience on the issue and add additional
logs. SIG-Windows Slack is also a great avenue to get some initial support and
troubleshooting ideas prior to creating a ticket.

If filing a bug, please include detailed information about how to reproduce the problem, such as:

* Kubernetes version: output from `kubectl version`
* Environment details: Cloud provider, OS distro, networking choice and configuration, and Docker version
* Detailed steps to reproduce the problem
* [Relevant logs](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#gathering-logs)

It helps if you tag the issue as **sig/windows**, by commenting on the issue with `/sig windows`. This helps to bring
the issue to a SIG Windows member's attention


## {{% heading "whatsnext" %}}

### Deployment tools
Expand Down
Loading

0 comments on commit 534c45f

Please sign in to comment.