-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor-Telegraf Pod is on CrashLoopBackOff state on Master node With K8s 1.4.0 #146
Comments
What do the pod logs say? On Oct 2, 2016 2:42 PM, "Felipe Cavalcanti" [email protected]
|
Nothing hepful...
|
You should have a lot more output than that. That means something is wrong with the image. I did an install last night on a |
Did you change anything in the chart configuration? Or is this a stock install? Have you tried deleting the pods and recreating them using the daemonset file in the manifest directory of the chart? |
It is a stock install... I tried deleting the pod (note that not all of them are in crashloop, only the one in the master node) it seems to be restarting like every 2 minutes |
just upgraded deis to 2.6, monitor-telegraf was bumped to version 2.5.1 and the problem is still happening... :/ |
Telegraf running on the master node is a new behavior I think. I noticed too that on my 1.4 cluster it shows us in the list when you do What os are you using? |
Debian with kernel 4.4 |
👍 same issue, deployed 1.4 via kops |
Same issue, fresh install. |
@shulcsm kops too? @felipejfc are you also using kops |
kubeadm on Ubuntu 16.04 |
Ok there is a hunch going around that this may be related to some kubernetes 1.4 work where they made add-ons daemonsets (which is how we deploy telegraf). This is why we are seeing telegraf get scheduled onto the master node. I am working on a way to restrict that from happening. |
@jchauncey yes! I do use kops |
This is related to this issue - kubernetes/kubernetes#29178 |
I'm not entire sure how to solve this problem yet. Considering that 1.4 doesn't have a label to tell a daemonset to not schedule there. I'll keep thinking about other ways to solve htis. But I would still like to know why telegraf is crashlooping. |
I tainted the master (my cluster consists of one node) and everyhing is running now. |
@shulcsm what taint did you apply? |
kubectl taint nodes --all dedicated- |
@felipejfc and @WillPlatnick if you two can see if the above command fixes your issue that would be great |
Well, what would be the implications on tainting my master with dedicated-? |
Afaik it should make it so nothing runs on it On Oct 4, 2016 5:32 PM, "Felipe Cavalcanti" [email protected]
|
@felipejfc is it possible for you to ssh into your master node and look at the kubelet configuration and see if you can find where it does the following: Trying to see if we are also being affected by this problem - kubernetes/kops#204 |
|
|
Spoke to kops maintainer @justinsb - He's going to put in a PR in kops to raise it, but he requests that Deis put in a PR for this with kube-up too so a discussion can be had there. |
Just to confirm it's a v1.4.0 issue, can you try running this on kubernetes v1.3.8? From what I'm reading in kubernetes/kubernetes, kube-up with GCE uses v1.3.8 uses /30 and v1.4.0 uses /29 as the pod CIDR. Not sure if that's what is making a difference here, but kubernetes/kubernetes#32886 is the PR in question. Just thought I'd report on what upstream's pod CIDR ranges are. |
kops merged in a default /28 for us. Updated the cluster, verified kubelet is running with a /28 and the issue is still occuring. Nothing in the logs other than creating topic |
k let me think of some other things that might help us debug this problem |
The deis monitor was already a daemon set before 1.4.x and it runs on all nodes on 1.3.x as well. |
Yes but on 1.3.0 it was not being stuck in crash loop restart |
It is working for me on 1.4.3/1.4.4 on CoreOS beta with a podCIDR of 10.2.0.0/16 and 1.3.8/1.3.9 with same podCIDR on CoreOS stable. If the container crashes and the only log message is "Creating topic with URL …" then the curl request must fail. So my guess would be a connectivity issue to nsqd. A modified deis-monitor-telegraf image which uses "curl -v -s" should be helpful to see what's going on. See https://github.com/deis/monitor/blob/master/telegraf/rootfs/start-telegraf#L17 |
I've done some debugging with @WillPlatnick and it seems connectivity from pods on the controller to the service network is not working, while is works on the workers. This seems to be specific to kops. |
Is there anyway to get enough debug information we can open an issue with kops? |
I think @WillPlatnick is already working on opening an issue with kops. |
The base issue is a kubernetes one apparently. They tried to fix it yesterday, but it didn't go too well and had to be reverted. kubernetes/kubernetes#35526 is the active PR to fix this. Hopefully will be in 1.5. |
I think the problem is specific to configurations where the master is registered as a node, when running kubenet. Hopefully we'll get it fixed upstream. |
kubernetes/kubernetes#35526 has since been merged and is available upstream in k8s v1.5.0+. Closing. |
It used to run fine on my 1.3.5 cluster but now the pod that is scheduled on the master is in crashloopbackoff for some reason, the ones scheduled on minions are normal though.
The text was updated successfully, but these errors were encountered: