Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146

felipejfc · 2016-10-02T18:42:32Z

Containers:
  deis-monitor-telegraf:
    Container ID:   docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238
    Image:      quay.io/deis/telegraf:v2.4.0
    Image ID:       docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a
    Port:
    State:      Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    7
      Started:      Sun, 02 Oct 2016 15:35:26 -0300
      Finished:     Sun, 02 Oct 2016 15:35:29 -0300
    Ready:      False
    Restart Count:  7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled on the master is in crashloopbackoff for some reason, the ones scheduled on minions are normal though.

The text was updated successfully, but these errors were encountered:

jchauncey · 2016-10-02T18:43:22Z

What do the pod logs say?

On Oct 2, 2016 2:42 PM, "Felipe Cavalcanti" [email protected]
wrote:

Containers:
deis-monitor-telegraf:
Container ID: docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238
Image: quay.io/deis/telegraf:v2.4.0
Image ID: docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a
Port:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 7
Started: Sun, 02 Oct 2016 15:35:26 -0300
Finished: Sun, 02 Oct 2016 15:35:29 -0300
Ready: False
Restart Count: 7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled
on the master is in crashloopbackoff for some reason, the ones scheduled on
minions are normal though.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#146, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAaRGCEsB4hgWyVU9B5uOdd7oqi3FTX1ks5qv_sYgaJpZM4KMFG_
.

felipejfc · 2016-10-03T02:24:46Z

Nothing hepful...

$ kube-stag logs deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics
$ kube-stag logs -p deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics

jchauncey · 2016-10-03T04:17:55Z

You should have a lot more output than that. That means something is wrong with the image. I did an install last night on a 1.4.0 cluster with 2.6.0 and it everything came up fine.

jchauncey · 2016-10-03T05:10:38Z

Did you change anything in the chart configuration? Or is this a stock install? Have you tried deleting the pods and recreating them using the daemonset file in the manifest directory of the chart?

felipejfc · 2016-10-03T13:50:37Z

It is a stock install... I tried deleting the pod (note that not all of them are in crashloop, only the one in the master node)

it seems to be restarting like every 2 minutes

felipejfc · 2016-10-03T14:37:26Z

just upgraded deis to 2.6, monitor-telegraf was bumped to version 2.5.1 and the problem is still happening... :/

jchauncey · 2016-10-03T15:34:04Z

Telegraf running on the master node is a new behavior I think. I noticed too that on my 1.4 cluster it shows us in the list when you do kubectl get nodes not sure why they are doing that.

What os are you using?

felipejfc · 2016-10-03T16:10:04Z

Debian with kernel 4.4

WillPlatnick · 2016-10-04T15:14:12Z

👍 same issue, deployed 1.4 via kops

shulcsm · 2016-10-04T17:10:37Z

Same issue, fresh install.

jchauncey · 2016-10-04T17:11:14Z

@shulcsm kops too? @felipejfc are you also using kops

shulcsm · 2016-10-04T17:16:42Z

kubeadm on Ubuntu 16.04

jchauncey · 2016-10-04T17:21:44Z

Ok there is a hunch going around that this may be related to some kubernetes 1.4 work where they made add-ons daemonsets (which is how we deploy telegraf). This is why we are seeing telegraf get scheduled onto the master node. I am working on a way to restrict that from happening.

felipejfc · 2016-10-04T18:13:21Z

@jchauncey yes! I do use kops

jchauncey · 2016-10-04T20:03:33Z

This is related to this issue - kubernetes/kubernetes#29178

jchauncey · 2016-10-04T20:28:36Z

I'm not entire sure how to solve this problem yet. Considering that 1.4 doesn't have a label to tell a daemonset to not schedule there. I'll keep thinking about other ways to solve htis. But I would still like to know why telegraf is crashlooping.

shulcsm · 2016-10-04T20:43:23Z

I tainted the master (my cluster consists of one node) and everyhing is running now.

jchauncey · 2016-10-04T20:49:26Z

@shulcsm what taint did you apply?

shulcsm · 2016-10-04T20:51:12Z

kubectl taint nodes --all dedicated-

jchauncey · 2016-10-04T20:59:56Z

@felipejfc and @WillPlatnick if you two can see if the above command fixes your issue that would be great

felipejfc · 2016-10-04T21:32:04Z

Well, what would be the implications on tainting my master with dedicated-?

jchauncey · 2016-10-04T21:49:38Z

Afaik it should make it so nothing runs on it

On Oct 4, 2016 5:32 PM, "Felipe Cavalcanti" [email protected]
wrote:

Well, what would be the implications on tainting my master with dedicated-?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#146 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAaRGD0fH5OSzqjqKgzjkvHDTbctJhT4ks5qwsXVgaJpZM4KMFG_
.

jchauncey · 2016-10-05T17:28:05Z

@felipejfc is it possible for you to ssh into your master node and look at the kubelet configuration and see if you can find where it does the following: --pod-cidr=

Trying to see if we are also being affected by this problem - kubernetes/kops#204

felipejfc · 2016-10-06T14:14:01Z

@jchauncey


admin@ip-172-21-124-39:~$ ps aux | grep kubelet
root       843  1.9  1.0 448284 89452 ?        Ssl  Oct02 110:46 /usr/local/bin/kubelet --allow-privileged=true --api-servers=http://127.0.0.1:8080 --babysit-daemons=true --cgroup-root=docker --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/manifests --configure-cbr0=true --enable-debugging-handlers=true --hostname-override=ip-172-21-124-39.ec2.internal --network-plugin-mtu=9001 --network-plugin=kubenet --node-labels=kubernetes.io/role=master --non-masquerade-cidr=100.64.0.0/10 --pod-cidr=10.123.45.0/29 --reconcile-cidr=true --register-schedulable=false --v=2

jchauncey · 2016-10-06T14:32:10Z

--pod-cidr=10.123.45.0/29 is not enough IPs for the number of pods we are trying to run on the master. It should probably be upped to a /28

WillPlatnick · 2016-10-10T13:41:57Z

Spoke to kops maintainer @justinsb - He's going to put in a PR in kops to raise it, but he requests that Deis put in a PR for this with kube-up too so a discussion can be had there.

bacongobbler · 2016-10-10T17:23:22Z

Just to confirm it's a v1.4.0 issue, can you try running this on kubernetes v1.3.8? From what I'm reading in kubernetes/kubernetes, kube-up with GCE uses v1.3.8 uses /30 and v1.4.0 uses /29 as the pod CIDR. Not sure if that's what is making a difference here, but kubernetes/kubernetes#32886 is the PR in question. Just thought I'd report on what upstream's pod CIDR ranges are.

WillPlatnick · 2016-10-12T16:04:24Z

kops merged in a default /28 for us. Updated the cluster, verified kubelet is running with a /28 and the issue is still occuring. Nothing in the logs other than creating topic

jchauncey · 2016-10-12T16:08:28Z

k let me think of some other things that might help us debug this problem

felixbuenemann · 2016-10-25T00:21:40Z

The deis monitor was already a daemon set before 1.4.x and it runs on all nodes on 1.3.x as well.

felipejfc · 2016-10-25T00:59:03Z

Yes but on 1.3.0 it was not being stuck in crash loop restart

felixbuenemann · 2016-10-25T01:47:35Z

It is working for me on 1.4.3/1.4.4 on CoreOS beta with a podCIDR of 10.2.0.0/16 and 1.3.8/1.3.9 with same podCIDR on CoreOS stable.

If the container crashes and the only log message is "Creating topic with URL …" then the curl request must fail. So my guess would be a connectivity issue to nsqd. A modified deis-monitor-telegraf image which uses "curl -v -s" should be helpful to see what's going on.

See https://github.com/deis/monitor/blob/master/telegraf/rootfs/start-telegraf#L17

felixbuenemann · 2016-10-26T16:24:20Z

I've done some debugging with @WillPlatnick and it seems connectivity from pods on the controller to the service network is not working, while is works on the workers. This seems to be specific to kops.

jchauncey · 2016-10-26T16:25:17Z

Is there anyway to get enough debug information we can open an issue with kops?

felixbuenemann · 2016-10-26T16:46:31Z

I think @WillPlatnick is already working on opening an issue with kops.

WillPlatnick · 2016-10-26T18:45:02Z

The base issue is a kubernetes one apparently. They tried to fix it yesterday, but it didn't go too well and had to be reverted.

kubernetes/kubernetes#35526 is the active PR to fix this. Hopefully will be in 1.5.

justinsb · 2016-10-26T19:44:53Z

I think the problem is specific to configurations where the master is registered as a node, when running kubenet. Hopefully we'll get it fixed upstream.

bacongobbler · 2017-01-11T21:16:12Z

kubernetes/kubernetes#35526 has since been merged and is available upstream in k8s v1.5.0+. Closing.

mboersma added the bug label Oct 3, 2016

mboersma added the priority 1 label Oct 4, 2016

mboersma assigned jchauncey Oct 4, 2016

bacongobbler added upstream issue and removed priority 1 labels Oct 31, 2016

bacongobbler closed this as completed Jan 11, 2017

bacongobbler mentioned this issue Sep 25, 2017

Fluentd pod crashing on Azure Container Service deis/workflow#847

Open

Cryptophobia mentioned this issue Mar 20, 2018

Fluentd pod crashing on Azure Container Service teamhephy/workflow#6

Open

Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146

Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146

Comments

felipejfc commented Oct 2, 2016

jchauncey commented Oct 2, 2016

felipejfc commented Oct 3, 2016

jchauncey commented Oct 3, 2016

jchauncey commented Oct 3, 2016

felipejfc commented Oct 3, 2016

felipejfc commented Oct 3, 2016

jchauncey commented Oct 3, 2016

felipejfc commented Oct 3, 2016

WillPlatnick commented Oct 4, 2016

shulcsm commented Oct 4, 2016

jchauncey commented Oct 4, 2016

shulcsm commented Oct 4, 2016

jchauncey commented Oct 4, 2016

felipejfc commented Oct 4, 2016

jchauncey commented Oct 4, 2016

jchauncey commented Oct 4, 2016

shulcsm commented Oct 4, 2016

jchauncey commented Oct 4, 2016

shulcsm commented Oct 4, 2016

jchauncey commented Oct 4, 2016

felipejfc commented Oct 4, 2016

jchauncey commented Oct 4, 2016

jchauncey commented Oct 5, 2016

felipejfc commented Oct 6, 2016

jchauncey commented Oct 6, 2016

WillPlatnick commented Oct 10, 2016

bacongobbler commented Oct 10, 2016 • edited Loading

WillPlatnick commented Oct 12, 2016

jchauncey commented Oct 12, 2016

felixbuenemann commented Oct 25, 2016

felipejfc commented Oct 25, 2016

felixbuenemann commented Oct 25, 2016 • edited Loading

felixbuenemann commented Oct 26, 2016

jchauncey commented Oct 26, 2016

felixbuenemann commented Oct 26, 2016

WillPlatnick commented Oct 26, 2016

justinsb commented Oct 26, 2016

bacongobbler commented Jan 11, 2017

bacongobbler commented Oct 10, 2016 •

edited

Loading

felixbuenemann commented Oct 25, 2016 •

edited

Loading