Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

prometheus-server CrashLoopBackOff #15742

Closed
ratnakarreddyg opened this issue Jul 22, 2019 · 18 comments
Closed

prometheus-server CrashLoopBackOff #15742

ratnakarreddyg opened this issue Jul 22, 2019 · 18 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ratnakarreddyg
Copy link

Describe the bug
A clear and concise description of what the bug is.

Version of Helm and Kubernetes:
Helm Version: v2.14.2
Kubernetes Version: v1.15.0

Which chart:
stable/prometheus

What happened:
Pod named prometheus-server-66fbdff99b-z4vbj always in CrashLoopBackOff state

What you expected to happen:
prometheus-server pod supposed to start and running

How to reproduce it (as minimally and precisely as possible):
helm install stable/prometheus --name prometheus --namespace prometheus --set server.global.scrape_interval=5s,server.global.evaluation_interval=5s

Anything else we need to know:

@marinakog
Copy link

I have the same issue: here describe of pod:Name: prometheus-server-55479c9d54-6gh9t
Namespace: monitoring
Priority: 0
PriorityClassName:
Node: phx3187268/100.111.143.19
Start Time: Tue, 30 Jul 2019 14:38:10 -0700
Labels: app=prometheus
chart=prometheus-8.15.0
component=server
heritage=Tiller
pod-template-hash=1103575810
release=prometheus
Annotations:
Status: Running
IP: 192.168.0.30
Controlled By: ReplicaSet/prometheus-server-55479c9d54
Containers:
prometheus-server-configmap-reload:
Container ID: docker://405fd0c96cb567d3182a7e6d2baa1d6ff5c7ae062fe79f7f3b8ceebc3032ec46
Image: jimmidyson/configmap-reload:v0.2.2
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:befec9f23d2a9da86a298d448cc9140f56a457362a7d9eecddba192db1ab489e
Port:
Host Port:
Args:
--volume-dir=/etc/config
--webhook-url=http://127.0.0.1:9090/-/reload
State: Running
Started: Tue, 30 Jul 2019 14:38:25 -0700
Ready: True
Restart Count: 0
Environment:
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-kb7pt (ro)
prometheus-server:
Container ID: docker://d5d45806e69bda9abfad75a6210d03ad7d6e9ecbc292de51af56440fc95cf162
Image: prom/prometheus:v2.11.1
Image ID: docker-pullable://prom/prometheus@sha256:8f34c18cf2ccaf21e361afd18e92da2602d0fa23a8917f759f906219242d8572
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention.time=15d
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 30 Jul 2019 14:38:52 -0700
Finished: Tue, 30 Jul 2019 14:38:52 -0700
Ready: False
Restart Count: 2
Liveness: http-get http://:9090/-/healthy delay=30s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=30s timeout=30s period=10s #success=1 #failure=3
Environment:
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-kb7pt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-server
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pvc-prometheus
ReadOnly: false
prometheus-server-token-kb7pt:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-server-token-kb7pt
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Normal Scheduled 47s default-scheduler Successfully assigned monitoring/prometheus-server-55479c9d54-6gh9t to phx3187268
Normal Pulling 44s kubelet, phx3187268 pulling image "jimmidyson/configmap-reload:v0.2.2"
Normal Pulled 32s kubelet, phx3187268 Successfully pulled image "jimmidyson/configmap-reload:v0.2.2"
Normal Created 32s kubelet, phx3187268 Created container
Normal Started 32s kubelet, phx3187268 Started container
Normal Pulling 32s kubelet, phx3187268 pulling image "prom/prometheus:v2.11.1"
Normal Pulled 26s kubelet, phx3187268 Successfully pulled image "prom/prometheus:v2.11.1"
Warning BackOff 20s (x3 over 23s) kubelet, phx3187268 Back-off restarting failed container
Normal Created 5s (x3 over 25s) kubelet, phx3187268 Created container
Normal Started 5s (x3 over 25s) kubelet, phx3187268 Started container
Normal Pulled 5s (x2 over 24s) kubelet, phx3187268 Container image "prom/prometheus:v2.11.1" already present on machine
Warning DNSConfigForming 4s (x8 over 45s) kubelet, phx3187268 Search Line limits were exceeded, some search paths have been omitted, the applied search line is: monitoring.svc.cluster.local svc.cluster.local cluster.local devweblogicphx.oraclevcn.com subnet3ad3phx.devweblogicphx.oraclevcn.com us.oracle.com

@taylorfturner
Copy link

Seeing a very similar type of activity on my dask-scheduler pod when implementing stable/dask in a ticket that I opened #15979

@marinakog
Copy link

Here the log
[opc@marina-kogan-sandbox prometheus]$ kubectl -n monitoring logs prometheus-server-5bc5568444-5s8bk -c prometheus-server
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:331 host_details="(Linux 4.14.35-1902.2.0.el7uek.x86_64 #2 SMP Fri Jun 14 21:15:44 PDT 2019 x86_64 prometheus-server-5bc5568444-5s8bk (none))"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-07-31T18:24:39.386Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-07-31T18:24:39.387Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2019-07-31T18:24:39.387Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:521 msg="Stopping scrape discovery manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:535 msg="Stopping notify discovery manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:557 msg="Stopping scrape manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:531 msg="Notify discovery manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:517 msg="Scrape discovery manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:551 msg="Scrape manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=manager.go:776 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=manager.go:782 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-07-31T18:24:39.388Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-07-31T18:24:39.388Z caller=main.go:722 msg="Notifier manager stopped"
level=error ts=2019-07-31T18:24:39.391Z caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

@stale
Copy link

stale bot commented Aug 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2019
@nightcatnl
Copy link

having the same issue

Events:
Type Reason Age From Message


Normal Scheduled 7m21s default-scheduler Successfully assigned monitoring/prometheus-server-75959db9-5v6dm to docker02
Normal Pulled 7m20s kubelet, docker02 Container image "jimmidyson/configmap-reload:v0.2.2" already present on machine
Normal Created 7m20s kubelet, docker02 Created container prometheus-server-configmap-reload
Normal Started 7m20s kubelet, docker02 Started container prometheus-server-configmap-reload
Normal Pulled 6m30s (x4 over 7m20s) kubelet, docker02 Container image "prom/prometheus:v2.11.1" already present on machine
Normal Created 6m30s (x4 over 7m20s) kubelet, docker02 Created container prometheus-server
Normal Started 6m30s (x4 over 7m20s) kubelet, docker02 Started container prometheus-server
Warning BackOff 2m19s (x27 over 7m19s) kubelet, docker02 Back-off restarting failed container

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2019
@elliotpryde
Copy link

Also seeing this when simply running helm install stable/prometheus.

Helm Version: v2.14.3
Kubernetes Version: v1.14.6

level=info ts=2019-09-06T21:03:04.361Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:331 host_details="(Linux 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 kissing-warthog-prometheus-server-b94c6d879-n8jj9 (none))"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-09-06T21:03:04.361Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-09-06T21:03:04.362Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-09-06T21:03:04.362Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:521 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:535 msg="Stopping notify discovery manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:557 msg="Stopping scrape manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:531 msg="Notify discovery manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:517 msg="Scrape discovery manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:551 msg="Scrape manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=manager.go:776 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=manager.go:782 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-06T21:03:04.363Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-06T21:03:04.363Z caller=main.go:722 msg="Notifier manager stopped"
level=error ts=2019-09-06T21:03:04.364Z caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

@ivanstreams
Copy link

I'm seeing the same problem. Is there a solution or workaround?

@tjg184
Copy link
Contributor

tjg184 commented Oct 17, 2019

Same problem here using helm install stable/prometheus.

caller=main.go:731 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

@ivanstreams
Copy link

Tried using "server.skipTSDBLock=true". It bypasses that step, but fails in the next:

main.go:731 err="opening storage failed: create dir: mkdir /data/wal: permission denied"

Then tried using server.persistentVolume.mountPath=/tmp as a test and it also fails:

main.go:731 err="opening storage failed: create dir: mkdir /tmp/wal: permission denied"

@KshamaG
Copy link

KshamaG commented Oct 21, 2019

I was seeing the same error . I was able to resolve the issue by applying the workaround given here.
Note: Replace prometheus-alertmanager with prometheus-server in the workaround steps.

@guswns531
Copy link

guswns531 commented Oct 22, 2019

I solved this problem with the below way.

kubectl edit deploy prometheus-server -n prometheus

from

      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534

to

  securityContext:
    fsGroup: 0
    runAsGroup: 0
    runAsUser: 0

honestly, I am not sure this change won't cause another problem. but now it works

@mram0509
Copy link

I am having the same issue - Changing securityContext does not fix it.
Has any body found a workaround for this?

@stale
Copy link

stale bot commented Dec 16, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2019
@stale
Copy link

stale bot commented Dec 30, 2019

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Dec 30, 2019
@ZacHaque
Copy link

ZacHaque commented Jun 25, 2020

nice worked setting -

securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0

@uweeby
Copy link

uweeby commented Jul 31, 2020

this fixed my issue also

securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0

so is there some other part of the setup/config that is expected to be done ahead of time thats missing?

@smakintel
Copy link

smakintel commented Jun 14, 2021

Worked for me. (using Rancher , edited prometheus-server deployment YAML file)

      securityContext:
        fsGroup: 0
        runAsGroup: 0
        runAsNonRoot: false
        runAsUser: 0

@tutstechnology
Copy link

tutstechnology commented Oct 14, 2021

That worked for me. Thank you very much.

  securityContext:
    fsGroup: 0
    runAsGroup: 0
    runAsNonRoot: false
    runAsUser: 0

I did the installation via helm, and edited the values ​​file and put these values.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests