Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146

Closed
felipejfc opened this issue Oct 2, 2016 · 38 comments
Assignees

Comments

@felipejfc
Copy link

Containers:
  deis-monitor-telegraf:
    Container ID:   docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238
    Image:      quay.io/deis/telegraf:v2.4.0
    Image ID:       docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a
    Port:
    State:      Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    7
      Started:      Sun, 02 Oct 2016 15:35:26 -0300
      Finished:     Sun, 02 Oct 2016 15:35:29 -0300
    Ready:      False
    Restart Count:  7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled on the master is in crashloopbackoff for some reason, the ones scheduled on minions are normal though.

@jchauncey
Copy link
Member

What do the pod logs say?

On Oct 2, 2016 2:42 PM, "Felipe Cavalcanti" [email protected]
wrote:

Containers:
deis-monitor-telegraf:
Container ID: docker://49183ac8c79d76792489bdc4314eae09bca2dddecb49e81f7a2be533295c7238
Image: quay.io/deis/telegraf:v2.4.0
Image ID: docker://sha256:90156d3ebc440f6b017dae901da5e096e5e92291ab2f2a345516d7416315236a
Port:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 7
Started: Sun, 02 Oct 2016 15:35:26 -0300
Finished: Sun, 02 Oct 2016 15:35:29 -0300
Ready: False
Restart Count: 7

It used to run fine on my 1.3.5 cluster but now the pod that is scheduled
on the master is in crashloopbackoff for some reason, the ones scheduled on
minions are normal though.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#146, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAaRGCEsB4hgWyVU9B5uOdd7oqi3FTX1ks5qv_sYgaJpZM4KMFG_
.

@felipejfc
Copy link
Author

Nothing hepful...

$ kube-stag logs deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics
$ kube-stag logs -p deis-monitor-telegraf-hsrw3 --namespace deis
Creating topic with URL: http://100.70.57.61:4151/topic/create?topic=metrics

@jchauncey
Copy link
Member

You should have a lot more output than that. That means something is wrong with the image. I did an install last night on a 1.4.0 cluster with 2.6.0 and it everything came up fine.

@jchauncey
Copy link
Member

Did you change anything in the chart configuration? Or is this a stock install? Have you tried deleting the pods and recreating them using the daemonset file in the manifest directory of the chart?

@felipejfc
Copy link
Author

It is a stock install... I tried deleting the pod (note that not all of them are in crashloop, only the one in the master node)

it seems to be restarting like every 2 minutes

@felipejfc
Copy link
Author

just upgraded deis to 2.6, monitor-telegraf was bumped to version 2.5.1 and the problem is still happening... :/

@jchauncey
Copy link
Member

Telegraf running on the master node is a new behavior I think. I noticed too that on my 1.4 cluster it shows us in the list when you do kubectl get nodes not sure why they are doing that.

What os are you using?

@felipejfc
Copy link
Author

Debian with kernel 4.4

@mboersma mboersma added the bug label Oct 3, 2016
@WillPlatnick
Copy link

👍 same issue, deployed 1.4 via kops

@shulcsm
Copy link

shulcsm commented Oct 4, 2016

Same issue, fresh install.

@jchauncey
Copy link
Member

@shulcsm kops too? @felipejfc are you also using kops

@shulcsm
Copy link

shulcsm commented Oct 4, 2016

kubeadm on Ubuntu 16.04

@jchauncey
Copy link
Member

Ok there is a hunch going around that this may be related to some kubernetes 1.4 work where they made add-ons daemonsets (which is how we deploy telegraf). This is why we are seeing telegraf get scheduled onto the master node. I am working on a way to restrict that from happening.

@felipejfc
Copy link
Author

@jchauncey yes! I do use kops

@jchauncey
Copy link
Member

This is related to this issue - kubernetes/kubernetes#29178

@jchauncey
Copy link
Member

I'm not entire sure how to solve this problem yet. Considering that 1.4 doesn't have a label to tell a daemonset to not schedule there. I'll keep thinking about other ways to solve htis. But I would still like to know why telegraf is crashlooping.

@shulcsm
Copy link

shulcsm commented Oct 4, 2016

I tainted the master (my cluster consists of one node) and everyhing is running now.

@jchauncey
Copy link
Member

@shulcsm what taint did you apply?

@shulcsm
Copy link

shulcsm commented Oct 4, 2016

kubectl taint nodes --all dedicated-

@jchauncey
Copy link
Member

@felipejfc and @WillPlatnick if you two can see if the above command fixes your issue that would be great

@felipejfc
Copy link
Author

Well, what would be the implications on tainting my master with dedicated-?

@jchauncey
Copy link
Member

Afaik it should make it so nothing runs on it

On Oct 4, 2016 5:32 PM, "Felipe Cavalcanti" [email protected]
wrote:

Well, what would be the implications on tainting my master with dedicated-?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#146 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAaRGD0fH5OSzqjqKgzjkvHDTbctJhT4ks5qwsXVgaJpZM4KMFG_
.

@jchauncey
Copy link
Member

@felipejfc is it possible for you to ssh into your master node and look at the kubelet configuration and see if you can find where it does the following: --pod-cidr=

Trying to see if we are also being affected by this problem - kubernetes/kops#204

@felipejfc
Copy link
Author

@jchauncey


admin@ip-172-21-124-39:~$ ps aux | grep kubelet
root       843  1.9  1.0 448284 89452 ?        Ssl  Oct02 110:46 /usr/local/bin/kubelet --allow-privileged=true --api-servers=http://127.0.0.1:8080 --babysit-daemons=true --cgroup-root=docker --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --config=/etc/kubernetes/manifests --configure-cbr0=true --enable-debugging-handlers=true --hostname-override=ip-172-21-124-39.ec2.internal --network-plugin-mtu=9001 --network-plugin=kubenet --node-labels=kubernetes.io/role=master --non-masquerade-cidr=100.64.0.0/10 --pod-cidr=10.123.45.0/29 --reconcile-cidr=true --register-schedulable=false --v=2

@jchauncey
Copy link
Member

--pod-cidr=10.123.45.0/29 is not enough IPs for the number of pods we are trying to run on the master. It should probably be upped to a /28

@WillPlatnick
Copy link

Spoke to kops maintainer @justinsb - He's going to put in a PR in kops to raise it, but he requests that Deis put in a PR for this with kube-up too so a discussion can be had there.

@bacongobbler
Copy link
Member

bacongobbler commented Oct 10, 2016

Just to confirm it's a v1.4.0 issue, can you try running this on kubernetes v1.3.8? From what I'm reading in kubernetes/kubernetes, kube-up with GCE uses v1.3.8 uses /30 and v1.4.0 uses /29 as the pod CIDR. Not sure if that's what is making a difference here, but kubernetes/kubernetes#32886 is the PR in question. Just thought I'd report on what upstream's pod CIDR ranges are.

@WillPlatnick
Copy link

kops merged in a default /28 for us. Updated the cluster, verified kubelet is running with a /28 and the issue is still occuring. Nothing in the logs other than creating topic

@jchauncey
Copy link
Member

k let me think of some other things that might help us debug this problem

@felixbuenemann
Copy link

The deis monitor was already a daemon set before 1.4.x and it runs on all nodes on 1.3.x as well.

@felipejfc
Copy link
Author

Yes but on 1.3.0 it was not being stuck in crash loop restart

@felixbuenemann
Copy link

felixbuenemann commented Oct 25, 2016

It is working for me on 1.4.3/1.4.4 on CoreOS beta with a podCIDR of 10.2.0.0/16 and 1.3.8/1.3.9 with same podCIDR on CoreOS stable.

If the container crashes and the only log message is "Creating topic with URL …" then the curl request must fail. So my guess would be a connectivity issue to nsqd. A modified deis-monitor-telegraf image which uses "curl -v -s" should be helpful to see what's going on.

See https://github.com/deis/monitor/blob/master/telegraf/rootfs/start-telegraf#L17

@felixbuenemann
Copy link

I've done some debugging with @WillPlatnick and it seems connectivity from pods on the controller to the service network is not working, while is works on the workers. This seems to be specific to kops.

@jchauncey
Copy link
Member

Is there anyway to get enough debug information we can open an issue with kops?

@felixbuenemann
Copy link

I think @WillPlatnick is already working on opening an issue with kops.

@WillPlatnick
Copy link

The base issue is a kubernetes one apparently. They tried to fix it yesterday, but it didn't go too well and had to be reverted.

kubernetes/kubernetes#35526 is the active PR to fix this. Hopefully will be in 1.5.

@justinsb
Copy link

I think the problem is specific to configurations where the master is registered as a node, when running kubenet. Hopefully we'll get it fixed upstream.

@bacongobbler
Copy link
Member

kubernetes/kubernetes#35526 has since been merged and is available upstream in k8s v1.5.0+. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants