This guide can be used to troubleshoot basic issues with a JARVICE deployment.
When JARVICE does not appear to be operating correctly, the status of the
JARVICE system pods is the first thing to check. This can be done with the
following kubectl
command:
$ kubectl --namespace jarvice-system get pods
NAME READY STATUS RESTARTS AGE
jarvice-api-658bd7dcd-4zwbn 1/1 Running 0 15h
jarvice-api-658bd7dcd-7bqk5 1/1 Running 0 15h
jarvice-api-experimental-689974999b-2l7hl 1/1 Running 0 15h
jarvice-api-experimental-689974999b-5whpc 1/1 Running 0 15h
jarvice-appsync-654f585fbc-j45wj 1/1 Running 0 15h
jarvice-dal-54c8db44cc-65ddc 1/1 Running 0 15h
jarvice-dal-54c8db44cc-hbkkq 1/1 Running 0 15h
jarvice-db-bf4449ccc-l8chx 1/1 Running 0 43h
jarvice-mc-portal-6c84b8864-dn9dq 1/1 Running 0 15h
jarvice-mc-portal-6c84b8864-rwnw7 1/1 Running 0 15h
jarvice-mc-portal-experimental-767b57f49c-bqsgv 1/1 Running 0 15h
jarvice-mc-portal-experimental-767b57f49c-wts7t 1/1 Running 0 15h
jarvice-memcached-0 1/1 Running 0 6d17h
jarvice-memcached-1 1/1 Running 0 43h
jarvice-memcached-2 1/1 Running 0 20d
jarvice-pod-scheduler-7d45c6c9cc-j5r4b 1/1 Running 0 15h
jarvice-scheduler-5f5b9b566d-7652b 1/1 Running 0 15h
A healthy running system should look very similar to the output above. An unhealthy system may show status' similar to this:
$ kubectl --namespace jarvice-system get pods
NAME READY STATUS RESTARTS AGE
jarvice-api-cfd76ff6f-2j8qn 0/1 ErrImagePull 0 67s
jarvice-appsync-766ddb4b-d4l9t 0/1 ErrImagePull 0 66s
jarvice-dal-679d6b44cd-dtg27 0/1 ErrImagePull 0 66s
jarvice-db-bf4449ccc-xqxp6 1/1 Running 0 42h
jarvice-mc-portal-75c949fd4d-99nnx 0/1 ErrImagePull 0 65s
jarvice-scheduler-867d454b86-rv6ct 0/1 ErrImagePull 0 64s
The sections below provide more troubleshooting information. For more detailed information on the lifecycle of kubernetes pods, visit: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
In order to get more detail on pods that are not "Running", execute the
following kubectl describe
command to find out what the issue may be:
$ kubectl --namespace jarvice-system describe pods jarvice-dal-679d6b44cd-dtg27
Name: jarvice-dal-679d6b44cd-dtg27
Namespace: jarvice-system
Priority: 0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m37s default-scheduler Successfully assigned jarvice-paulbsch/jarvice-dal-679d6b44cd-dtg27 to dal1k8s-worker-08
Normal Pulling 72s (x4 over 2m32s) kubelet, dal1k8s-worker-08 Pulling image "gcr.io/jarvice-system/jarvice-dal:jarvice-master"
Warning Failed 72s (x4 over 2m32s) kubelet, dal1k8s-worker-08 Failed to pull image "gcr.io/jarvice-system/jarvice-dal:jarvice-master": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/jarvice-system/jarvice-dal/manifests/jarvice-master: unknown: Unable to parse json key.
Warning Failed 72s (x4 over 2m32s) kubelet, dal1k8s-worker-08 Error: ErrImagePull
Warning Failed 47s (x6 over 2m32s) kubelet, dal1k8s-worker-08 Error: ImagePullBackOff
Normal BackOff 35s (x7 over 2m32s) kubelet, dal1k8s-worker-08 Back-off pulling image "gcr.io/jarvice-system/jarvice-dal:jarvice-master"
The clipped describe output above shows that there is an issue pulling the JARVICE container images.
If kubernetes is unable to pull the JARVICE container images, it is either
unable to contact the gcr.io
container registry or the value of
jarvice.imagePullSecret
is not valid. Be sure that your gcr.io
registry
key was properly converted for use in your override.yaml
values file.
Here is the command to get the appropriate string to use for
jarvice.imagePullSecret
:
$ echo "_json_key:$(cat file_path_to/gcr.io.json)" | base64 -w 0
This will typically occur when using persistence for the jarvice-db
pod
and it is unable to bind the PersistentVolumeClaim. A kubectl describe
of the PersistentVolumeClaim may provide more information. That command
should look similar to the following:
$ kubectl --namespace jarvice-system describe pvc jarvice-db-pvc
Please review the following for more information on using persistence with
the jarvice-db
:
If kubectl describe
of JARVICE system pods shows that kubernetes is
unable find any nodes to place pods on due to node taints, please review
the section on
Node taints and pod tolerations.
If kubectl describe
of JARVICE system pods shows that kubernetes is
unable find any nodes to place pods on due to insufficient memory or CPU
resources, it will be necessary to add another node to the kubernetes cluster
for jarvice-system
pods. This could entail adding more physical/virtual
nodes and/or modifying taints on nodes already joined to the cluster.
If modifying taints on already existing nodes, please review the section on Node taints and pod tolerations.
This is typically due to the jarvice-dal
pods not being able to communicate
with the database. If using a database outside of the one provided in
the jarvice-helm
chart, check that it is accessible from the kubernetes
cluster. If using jarvice-db
as provided in this chart, be sure that it's
pod is in the Running
state.
If the above issues can be ruled out, it may be necessary to disable the
NetworkPolicy for the jarvice-db
pod and/or globally for all
jarvice-system
pods.
This can be done by setting jarvice_db.networkPolicy.enabled
to false
and/or setting jarvice.networkPolicy.enabled
to false
respectively.
The JARVICE control plane manages job status from all attached downstream clusters (including the default in a single or multi-cluster configuration). The system is designed to reconcile job status changes periodically to not rely on complex notification mechanisms or even require full-time connectivity. When troubleshooting any job status issue, log levels of respective component(s) will need to be increased from 30
(errors and warnings only) to at least 20
(informatinal) or even 10
(debug, very verbose) in some cases. See each section below for details
On Kubernetes clusters, job pods must be "bound" to compute nodes before processing can begin. jarvice-pod-scheduler
runs downstream and on the default cluster and attempts to "best-fit" pods onto nodes, packing where possible, and ensuring jobs with multiple pods are "gang-scheduled" on like nodes. Putting this component in log level 20
via the jarvice.JARVICE_POD_SCHED_LOGLEVEL
setting is the minimum recommended value to see a summary of why nodes are being skipped. To see precise (but highly verbose) details for each node evaluated, including why it's disqualified from scheduling a particular job, set the log level to 10
(debug).
Common reasons for job pods not being bound include, but are not limited to:
- Insufficient resources (the most common).
- Lack of "like" nodes when trying to schedule a multi-node job - nodes are weighed and grouped by weight.
- Improper taints or node label mismatches.
jarvice-sched-pass
(upstream control plane only) is responsible for all job status changes. Putting this component in log level 20
via the jarvice.JARVICE_SCHED_PASS_LOGEVEL
is generally sufficient to see why. Most errors are well spelled out.
Common reasons for job status not changing include, but are not limited to:
- Unable to communicate with downstream components - in this case, check the connection from upstream to downstream including any Ingress controllers downstream.
- Failures (not timeouts) when communicating with downstream components - in this case, examine
jarvice-k8s-scheduler
downstream (on Kubernetes clusters) by putting it in log level20
(or10
for additional verbosity) via thejarvice.JARVICE_K8S_SCHED_LOGEVEL
setting. - Job pods not binding to nodes (on Kubernetes) - see Job pods remain in Pending state (Kubernetes)
- Pass "budget" insufficient and not all jobs can be examined - see Advanced: Scheduler Performance Tuning in the Resource Planning and Scaling Guide
Generally, this relates to capacity and scaling parameter issues. Putting jarvice-sched-pass
in log level 20
via the jarvice.JARVICE_SCHED_PASS_LOGLEVEL
parameter is recommended to determine why.
Generally, see Advanced: Scheduler Performance Tuning in the Resource Planning and Scaling Guide for information on how to overcome this issue.