diff --git a/content/en/docs/setup/production-environment/windows/intro-windows-in-kubernetes.md b/content/en/docs/setup/production-environment/windows/intro-windows-in-kubernetes.md index 91118c4d6c812..dc22759af6d9d 100644 --- a/content/en/docs/setup/production-environment/windows/intro-windows-in-kubernetes.md +++ b/content/en/docs/setup/production-environment/windows/intro-windows-in-kubernetes.md @@ -704,9 +704,8 @@ Privileged containers are [not supported](#compatibility-v1-pod-spec-containers- ## Getting help and troubleshooting {#troubleshooting} -Your main source of help for troubleshooting your Kubernetes cluster should start -with the [Troubleshooting](/docs/tasks/debug-application-cluster/troubleshooting/) -page. +For help with debugging and troubleshooting your Kubernetes cluster and/or workloads please start +with the [Troubleshooting](/docs/tasks/debug-application-cluster/) section. Some additional, Windows-specific troubleshooting help is included in this section. Logs are an important element of troubleshooting @@ -715,268 +714,15 @@ troubleshooting assistance from other contributors. Follow the instructions in the SIG Windows [contributing guide on gathering logs](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#gathering-logs). -### Node-level troubleshooting {#troubleshooting-node} - -1. How do I know `start.ps1` completed successfully? - - You should see kubelet, kube-proxy, and (if you chose Flannel as your networking - solution) flanneld host-agent processes running on your node, with running logs - being displayed in separate PowerShell windows. In addition to this, your Windows - node should be listed as "Ready" in your Kubernetes cluster. - -1. Can I configure the Kubernetes node processes to run in the background as services? - - The kubelet and kube-proxy are already configured to run as native Windows Services, - offering resiliency by re-starting the services automatically in the event of - failure (for example a process crash). You have two options for configuring these - node components as services. - - 1. As native Windows Services - - You can run the kubelet and kube-proxy as native Windows Services using `sc.exe`. - - ```powershell - # Create the services for kubelet and kube-proxy in two separate commands - sc.exe create binPath= " --service " - - # Please note that if the arguments contain spaces, they must be escaped. - sc.exe create kubelet binPath= "C:\kubelet.exe --service --hostname-override 'minion' " - - # Start the services - Start-Service kubelet - Start-Service kube-proxy - - # Stop the service - Stop-Service kubelet (-Force) - Stop-Service kube-proxy (-Force) - - # Query the service status - Get-Service kubelet - Get-Service kube-proxy - ``` - - 1. Using `nssm.exe` - - You can also always use alternative service managers like - [nssm.exe](https://nssm.cc/) to run these processes (flanneld, - kubelet & kube-proxy) in the background for you. You can use this - [sample script](https://github.com/Microsoft/SDN/tree/master/Kubernetes/flannel/register-svc.ps1), - leveraging nssm.exe to register kubelet, kube-proxy, and flanneld.exe to run - as Windows services in the background. - - ```powershell - register-svc.ps1 -NetworkMode -ManagementIP -ClusterCIDR -KubeDnsServiceIP -LogDir - - # NetworkMode = The network mode l2bridge (flannel host-gw, also the default value) or overlay (flannel vxlan) chosen as a network solution - # ManagementIP = The IP address assigned to the Windows node. You can use ipconfig to find this - # ClusterCIDR = The cluster subnet range. (Default value 10.244.0.0/16) - # KubeDnsServiceIP = The Kubernetes DNS service IP (Default value 10.96.0.10) - # LogDir = The directory where kubelet and kube-proxy logs are redirected into their respective output files (Default value C:\k) - ``` - - If the above referenced script is not suitable, you can manually configure - `nssm.exe` using the following examples. - - ```powershell - # Register flanneld.exe - nssm install flanneld C:\flannel\flanneld.exe - nssm set flanneld AppParameters --kubeconfig-file=c:\k\config --iface= --ip-masq=1 --kube-subnet-mgr=1 - nssm set flanneld AppEnvironmentExtra NODE_NAME= - nssm set flanneld AppDirectory C:\flannel - nssm start flanneld - - # Register kubelet.exe - # Microsoft releases the pause infrastructure container at mcr.microsoft.com/oss/kubernetes/pause:3.6 - nssm install kubelet C:\k\kubelet.exe - nssm set kubelet AppParameters --hostname-override= --v=6 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.6 --resolv-conf="" --allow-privileged=true --enable-debugging-handlers --cluster-dns= --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false --log-dir= --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config - nssm set kubelet AppDirectory C:\k - nssm start kubelet - - # Register kube-proxy.exe (l2bridge / host-gw) - nssm install kube-proxy C:\k\kube-proxy.exe - nssm set kube-proxy AppDirectory c:\k - nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --hostname-override=--kubeconfig=c:\k\config --enable-dsr=false --log-dir= --logtostderr=false - nssm.exe set kube-proxy AppEnvironmentExtra KUBE_NETWORK=cbr0 - nssm set kube-proxy DependOnService kubelet - nssm start kube-proxy - - # Register kube-proxy.exe (overlay / vxlan) - nssm install kube-proxy C:\k\kube-proxy.exe - nssm set kube-proxy AppDirectory c:\k - nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --feature-gates="WinOverlay=true" --hostname-override= --kubeconfig=c:\k\config --network-name=vxlan0 --source-vip= --enable-dsr=false --log-dir= --logtostderr=false - nssm set kube-proxy DependOnService kubelet - nssm start kube-proxy - ``` - - For initial troubleshooting, you can use the following flags in [nssm.exe](https://nssm.cc/) to redirect stdout and stderr to a output file: - - ```powershell - nssm set AppStdout C:\k\mysvc.log - nssm set AppStderr C:\k\mysvc.log - ``` - - For additional details, see [NSSM - the Non-Sucking Service Manager](https://nssm.cc/usage). - -1. My Pods are stuck at "Container Creating" or restarting over and over - - Check that your pause image is compatible with your OS version. The - [instructions](https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/deploying-resources) - assume that both the OS and the containers are version 1803. If you have a later - version of Windows, such as an Insider build, you need to adjust the images - accordingly. See [Pause container](#pause-container) for more details. - -### Network troubleshooting {#troubleshooting-network} - -1. My Windows Pods do not have network connectivity - - If you are using virtual machines, ensure that MAC spoofing is **enabled** on all - the VM network adapter(s). - -1. My Windows Pods cannot ping external resources - - Windows Pods do not have outbound rules programmed for the ICMP protocol. However, - TCP/UDP is supported. When trying to demonstrate connectivity to resources - outside of the cluster, substitute `ping ` with corresponding - `curl ` commands. - - If you are still facing problems, most likely your network configuration in - [cni.conf](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf) - deserves some extra attention. You can always edit this static file. The - configuration update will apply to any new Kubernetes resources. - - One of the Kubernetes networking requirements - (see [Kubernetes model](/docs/concepts/cluster-administration/networking/)) is - for cluster communication to occur without - NAT internally. To honor this requirement, there is an - [ExceptionList](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf#L20) - for all the communication where you do not want outbound NAT to occur. However, - this also means that you need to exclude the external IP you are trying to query - from the `ExceptionList`. Only then will the traffic originating from your Windows - pods be SNAT'ed correctly to receive a response from the outside world. In this - regard, your `ExceptionList` in `cni.conf` should look as follows: - - ```conf - "ExceptionList": [ - "10.244.0.0/16", # Cluster subnet - "10.96.0.0/12", # Service subnet - "10.127.130.0/24" # Management (host) subnet - ] - ``` - -1. My Windows node cannot access `NodePort` type Services - - Local NodePort access from the node itself fails. This is a known - limitation. NodePort access works from other nodes or external clients. - -1. vNICs and HNS endpoints of containers are being deleted - - This issue can be caused when the `hostname-override` parameter is not passed to - [kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/). To resolve - it, users need to pass the hostname to kube-proxy as follows: - - ```powershell - C:\k\kube-proxy.exe --hostname-override=$(hostname) - ``` - -1. With flannel, my nodes are having issues after rejoining a cluster - - Whenever a previously deleted node is being re-joined to the cluster, flannelD - tries to assign a new pod subnet to the node. Users should remove the old pod - subnet configuration files in the following paths: - - ```powershell - Remove-Item C:\k\SourceVip.json - Remove-Item C:\k\SourceVipRequest.json - ``` - -1. After launching `start.ps1`, flanneld is stuck in "Waiting for the Network to be created" - - There are numerous reports of this [issue](https://github.com/coreos/flannel/issues/1066); most likely it is a timing issue for when the management IP of the flannel network is set. A workaround is to relaunch `start.ps1` or relaunch it manually as follows: - - ```powershell - [Environment]::SetEnvironmentVariable("NODE_NAME", "") - C:\flannel\flanneld.exe --kubeconfig-file=c:\k\config --iface= --ip-masq=1 --kube-subnet-mgr=1 - ``` - -1. My Windows Pods cannot launch because of missing `/run/flannel/subnet.env` - - This indicates that Flannel didn't launch correctly. You can either try - to restart `flanneld.exe` or you can copy the files over manually from - `/run/flannel/subnet.env` on the Kubernetes master to `C:\run\flannel\subnet.env` - on the Windows worker node and modify the `FLANNEL_SUBNET` row to a different - number. For example, if node subnet 10.244.4.1/24 is desired: - - ```env - FLANNEL_NETWORK=10.244.0.0/16 - FLANNEL_SUBNET=10.244.4.1/24 - FLANNEL_MTU=1500 - FLANNEL_IPMASQ=true - ``` - -1. My Windows node cannot access my services using the service IP - - This is a known limitation of the networking stack on Windows. However, Windows Pods can access the Service IP. - -1. No network adapter is found when starting the kubelet - - The Windows networking stack needs a virtual adapter for Kubernetes networking to work. If the following commands return no results (in an admin shell), virtual network creation — a necessary prerequisite for the kubelet to work — has failed: - - ```powershell - Get-HnsNetwork | ? Name -ieq "cbr0" - Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*" - ``` - - Often it is worthwhile to modify the [InterfaceName](https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/start.ps1#L7) parameter of the start.ps1 script, in cases where the host's network adapter isn't "Ethernet". Otherwise, consult the output of the `start-kubelet.ps1` script to see if there are errors during virtual network creation. - -1. DNS resolution is not properly working - - Check the DNS limitations for Windows in this [section](#dns-limitations). - -1. `kubectl port-forward` fails with "unable to do port forwarding: wincat not found" - - This was implemented in Kubernetes 1.15 by including `wincat.exe` in the pause infrastructure container `mcr.microsoft.com/oss/kubernetes/pause:3.6`. Be sure to use a supported version of Kubernetes. - If you would like to build your own pause infrastructure container be sure to include [wincat](https://github.com/kubernetes/kubernetes/tree/master/build/pause/windows/wincat). - -1. My Kubernetes installation is failing because my Windows Server node is behind a proxy - - If you are behind a proxy, the following PowerShell environment variables must be defined: - - ```PowerShell - [Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://proxy.example.com:80/", [EnvironmentVariableTarget]::Machine) - [Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://proxy.example.com:443/", [EnvironmentVariableTarget]::Machine) - ``` - -### Further investigation - -If these steps don't resolve your problem, you can get help running Windows containers on Windows nodes in Kubernetes through: - -* StackOverflow [Windows Server Container](https://stackoverflow.com/questions/tagged/windows-server-container) topic -* Kubernetes Official Forum [discuss.kubernetes.io](https://discuss.kubernetes.io/) -* Kubernetes Slack [#SIG-Windows Channel](https://kubernetes.slack.com/messages/sig-windows) - ### Reporting issues and feature requests If you have what looks like a bug, or you would like to -make a feature request, please use the -[GitHub issue tracking system](https://github.com/kubernetes/kubernetes/issues). -You can open issues on -[GitHub](https://github.com/kubernetes/kubernetes/issues/new/choose) and assign -them to SIG-Windows. You should first search the list of issues in case it was +make a feature request, please follow the [SIG Windows contributing guide](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#reporting-issues-and-feature-requests) to create a new issue. +You should first search the list of issues in case it was reported previously and comment with your experience on the issue and add additional logs. SIG-Windows Slack is also a great avenue to get some initial support and troubleshooting ideas prior to creating a ticket. -If filing a bug, please include detailed information about how to reproduce the problem, such as: - -* Kubernetes version: output from `kubectl version` -* Environment details: Cloud provider, OS distro, networking choice and configuration, and Docker version -* Detailed steps to reproduce the problem -* [Relevant logs](https://github.com/kubernetes/community/blob/master/sig-windows/CONTRIBUTING.md#gathering-logs) - -It helps if you tag the issue as **sig/windows**, by commenting on the issue with `/sig windows`. This helps to bring -the issue to a SIG Windows member's attention - - ## {{% heading "whatsnext" %}} ### Deployment tools diff --git a/content/en/docs/tasks/debug-application-cluster/windows.md b/content/en/docs/tasks/debug-application-cluster/windows.md new file mode 100644 index 0000000000000..af6ffbd49fad9 --- /dev/null +++ b/content/en/docs/tasks/debug-application-cluster/windows.md @@ -0,0 +1,170 @@ +--- +reviewers: +- aravindhp +- jayunit100 +- jsturtevant +- marosset +title: Windows debugging tips +content_type: concept +--- + + + + + +## Node-level troubleshooting {#troubleshooting-node} + +1. My Pods are stuck at "Container Creating" or restarting over and over + + Ensure that your pause image is compatible with your Windows OS version. + See [Pause container](/docs/setup/production-environment/windows/intro-windows-in-kubernetes#pause-container) + to see the latest / recommended pause image and/or get more information. + + {{< note >}} + If using containerd as your container runtime the pause image is specified in the + `plugins.plugins.cri.sandbox_image` field of the of config.toml configration file. + {{< /note >}} + +1. My pods show status as `ErrImgPull` or `ImagePullBackOff` + + Ensure that your Pod is getting scheduled to a [compatable](https://docs.microsoft.com/virtualization/windowscontainers/deploy-containers/version-compatibility) Windows Node. + + More information on how to specify a compatable node for your Pod can be found in [this guide](docs/setup/production-environment/windows/user-guide-windows-containers/#ensuring-os-specific-workloads-land-on-the-appropriate-container-host). + +## Network troubleshooting {#troubleshooting-network} + +1. My Windows Pods do not have network connectivity + + If you are using virtual machines, ensure that MAC spoofing is **enabled** on all + the VM network adapter(s). + +1. My Windows Pods cannot ping external resources + + Windows Pods do not have outbound rules programmed for the ICMP protocol. However, + TCP/UDP is supported. When trying to demonstrate connectivity to resources + outside of the cluster, substitute `ping ` with corresponding + `curl ` commands. + + If you are still facing problems, most likely your network configuration in + [cni.conf](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf) + deserves some extra attention. You can always edit this static file. The + configuration update will apply to any new Kubernetes resources. + + One of the Kubernetes networking requirements + (see [Kubernetes model](/docs/concepts/cluster-administration/networking/)) is + for cluster communication to occur without + NAT internally. To honor this requirement, there is an + [ExceptionList](https://github.com/Microsoft/SDN/blob/master/Kubernetes/flannel/l2bridge/cni/config/cni.conf#L20) + for all the communication where you do not want outbound NAT to occur. However, + this also means that you need to exclude the external IP you are trying to query + from the `ExceptionList`. Only then will the traffic originating from your Windows + pods be SNAT'ed correctly to receive a response from the outside world. In this + regard, your `ExceptionList` in `cni.conf` should look as follows: + + ```conf + "ExceptionList": [ + "10.244.0.0/16", # Cluster subnet + "10.96.0.0/12", # Service subnet + "10.127.130.0/24" # Management (host) subnet + ] + ``` + +1. My Windows node cannot access `NodePort` type Services + + Local NodePort access from the node itself fails. This is a known + limitation. NodePort access works from other nodes or external clients. + +1. vNICs and HNS endpoints of containers are being deleted + + This issue can be caused when the `hostname-override` parameter is not passed to + [kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/). To resolve + it, users need to pass the hostname to kube-proxy as follows: + + ```powershell + C:\k\kube-proxy.exe --hostname-override=$(hostname) + ``` + +1. My Windows node cannot access my services using the service IP + + This is a known limitation of the networking stack on Windows. However, Windows Pods can access the Service IP. + +1. No network adapter is found when starting the kubelet + + The Windows networking stack needs a virtual adapter for Kubernetes networking to work. + If the following commands return no results (in an admin shell), + virtual network creation — a necessary prerequisite for the kubelet to work — has failed: + + ```powershell + Get-HnsNetwork | ? Name -ieq "cbr0" + Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*" + ``` + + Often it is worthwhile to modify the [InterfaceName](https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/start.ps1#L7) parameter of the start.ps1 script, + in cases where the host's network adapter isn't "Ethernet". + Otherwise, consult the output of the `start-kubelet.ps1` script to see if there are errors during virtual network creation. + +1. DNS resolution is not properly working + + Check the DNS limitations for Windows in this [section](#dns-limitations). + +1. `kubectl port-forward` fails with "unable to do port forwarding: wincat not found" + + This was implemented in Kubernetes 1.15 by including `wincat.exe` in the pause infrastructure container `mcr.microsoft.com/oss/kubernetes/pause:3.6`. + Be sure to use a supported version of Kubernetes. + If you would like to build your own pause infrastructure container be sure to include [wincat](https://github.com/kubernetes/kubernetes/tree/master/build/pause/windows/wincat). + +1. My Kubernetes installation is failing because my Windows Server node is behind a proxy + + If you are behind a proxy, the following PowerShell environment variables must be defined: + + ```PowerShell + [Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://proxy.example.com:80/", [EnvironmentVariableTarget]::Machine) + [Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://proxy.example.com:443/", [EnvironmentVariableTarget]::Machine) + ``` + +### Flannel troubleshooting + +1. With Flannel, my nodes are having issues after rejoining a cluster + + Whenever a previously deleted node is being re-joined to the cluster, flannelD + tries to assign a new pod subnet to the node. Users should remove the old pod + subnet configuration files in the following paths: + + ```powershell + Remove-Item C:\k\SourceVip.json + Remove-Item C:\k\SourceVipRequest.json + ``` + +1. Flanneld is stuck in "Waiting for the Network to be created" + + There are numerous reports of this [issue](https://github.com/coreos/flannel/issues/1066); + most likely it is a timing issue for when the management IP of the flannel network is set. + A workaround is to relaunch `start.ps1` or relaunch it manually as follows: + + ```powershell + [Environment]::SetEnvironmentVariable("NODE_NAME", "") + C:\flannel\flanneld.exe --kubeconfig-file=c:\k\config --iface= --ip-masq=1 --kube-subnet-mgr=1 + ``` + +1. My Windows Pods cannot launch because of missing `/run/flannel/subnet.env` + + This indicates that Flannel didn't launch correctly. You can either try + to restart `flanneld.exe` or you can copy the files over manually from + `/run/flannel/subnet.env` on the Kubernetes master to `C:\run\flannel\subnet.env` + on the Windows worker node and modify the `FLANNEL_SUBNET` row to a different + number. For example, if node subnet 10.244.4.1/24 is desired: + + ```env + FLANNEL_NETWORK=10.244.0.0/16 + FLANNEL_SUBNET=10.244.4.1/24 + FLANNEL_MTU=1500 + FLANNEL_IPMASQ=true + ``` + +### Further investigation + +If these steps don't resolve your problem, you can get help running Windows containers on Windows nodes in Kubernetes through: + +* StackOverflow [Windows Server Container](https://stackoverflow.com/questions/tagged/windows-server-container) topic +* Kubernetes Official Forum [discuss.kubernetes.io](https://discuss.kubernetes.io/) +* Kubernetes Slack [#SIG-Windows Channel](https://kubernetes.slack.com/messages/sig-windows) \ No newline at end of file