This guide provides step-by-step troubleshooting instructions to nsure your Kubernetes cluster and JupyterHub service are running correctly after a system reboot.
-
Verify GPU Availability Run the following command to check if the GPU is accessible:
nvidia-smi
Ensure the GPUs are detected correctly. If not, confirm that the NVIDIA drivers and CUDA are installed and properly configured.
-
Network Connectivity a. Check Ping to Master Node Verify connectivity to the master node:
ping 10.105.10.80
b. Check Kubernetes API Port Ensure the Kubernetes API is accessible:
bash nc -zv 10.105.10.80 6443
c. Check Access to Services in Browser Use port-forwarding or the service's external IP to verify that the services are reachable in a browser. Replace and with actual values.bash kubectl port-forward --address 0.0.0.0 -n jhub svc/proxy-public 8888:80
- Restart kubelet Service
Restart the Kubernetes node manager:
Verify the status of the kubelet service:
sudo systemctl restart kubelet
sudo systemctl status kubelet
- Verify Kubernetes Context
Check the current Kubernetes contexts:
kubectl config get-contexts
- Check Cluster State
a. Verify Pods in All Namespacesb. Check API Server Podskubectl get pods --all-namespaces
c. Check API Port Binding Ensure the Kubernetes API server is bound to the correct port:kubectl -n kube-system get pods | grep apiserver
sudo netstat -tuln | grep 6443
- If the cluster fails to start, reinitialize it:
Disable swap:
sudo swapoff -a
- Initialize Kubernetes:
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/cri-dockerd.sock
- Reapply network plugin (Flannel):
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
- Verify all pods are running:
kubectl get pods --all-namespaces
- Upgrade JupyterHub Release
If JupyterHub fails to start, upgrade the release:
helm upgrade jhub2 jupyterhub/jupyterhub -f jupyterhub-config_oct28.yaml -n jhub2
- Verify JupyterHub Pods
Check the status of JupyterHub pods:
kubectl --namespace=jhub2 get pod
- Delete Faulty Pods
If any pods are in an error state, delete and recreate them:
kubectl delete pod <pod-name> -n jhub2
- Check Services
Verify the services in the namespace:
Describe a specific service if needed:
kubectl get svc -n jhub2
kubectl describe svc -n jhub2
To access JupyterHub, use port-forwarding:
kubectl port-forward --address 0.0.0.0 -n jhub2 svc/proxy-public 8888:80
Open the following URL in your browser:
http://<your-ip>:8888
If issues persist, view recent events in the namespace:
kubectl get events -n jhub2
Retrieve logs for a specific pod:
kubectl logs -n jhub2 <pod-name>
Ensure the storage class is properly configured:
kubectl get sc
kubectl describe sc local-storage
Always replace placeholders (e.g., , ) with actual values. Regularly monitor system resources (nvidia-smi, kubectl top nodes) to prevent resource exhaustion.