-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqadmin: missing channel from topic list #753
Conversation
I'm looking forward to the explanation on this one |
I've seen the same thing recently with NSQ 0.3.6 |
thanks @chrusty that's helpful confirmation. This was with 0.3.6-alpha+build.27.967fce5 I had to resolve the production situation where i observed this, and wasn't able to reproduce this easily in a development environment so it might take me longer than i had hoped to track this down. |
After having a look around, I'm beginning to wonder if this could be a symptom of having my system-clocks out of sync. Recently some of my system clocks have skewed a bit and I know it has caused issues for some of the databases I'm using (this is pretty shameful, I know). This is still pretty vague, I know. |
I appear to have stumbled upon another case of this in production. with version The general situation is that given 10 servers, this happens where only one nsqd has a an extra channel. It's reported and returned properly from every nsqlookupd, and returned properly from the individual nsqd stats endpoints. Nsqadmin also properly queries the nsqd node with the odd channel. I'm including redacted nsqadmin logs, nsqlookup output and nsqd stats output. In this example the channel
$ curl 'http://lookup01:4161/lookup?topic=events' | jq .
{
"producers": [
{
"version": "1.0.0-alpha+build.7.ea01a579",
"http_port": 4151,
"tcp_port": 4150,
"broadcast_address": "a01",
"hostname": "a01",
"remote_address": "1.2.3.4:58636"
},
{
"version": "1.0.0-alpha+build.7.ea01a579",
"http_port": 4151,
"tcp_port": 4150,
"broadcast_address": "s01",
"hostname": "s01",
"remote_address": "1.2.3.5:34638"
},
// other hosts
],
"channels": [
"others....",
"queuereader_m",
"others..."
]
}
|
@jehiah so you think this is a bug in |
I think it's a nsqadmin aggregation bug. I suspect the aggregation somehow only returns items that are in the first nsq stats output it parses. Unfortunately I forgot to capture the matching nsqadmin API output. |
@jehiah you intend to take this one? |
not immediately |
I've run into something probably unrelated but possibly sometimes giving similar symptoms: nsqadmin seems to keep and re-use http client connections after a week or so of inactivity, but they don't work (presumably because the AWS EC2 security-groups firewall forgot the connection) and they time out. Then on the second (and further) attempts to load the same page, it uses a new connection and works instantly. I started noticing this a month or two ago, and once the pattern was figured out, it is surprisingly consistent. I load up nsqadmin after leaving it alone for a full week or so, and go to the "Nodes" list page, and then go to individual an nsqd node page. (So it makes a stats request to just one nsqd for the page. This also affects topic and channel pages, but multiple nsqd at a time. So for isolating the issue I used nodes.) For every single nsqd, the first time, the request times out. Later times works. I can curl nsqd stats, while on the nsqadmin server, just before loading the node page for that nsqd. Curl works first time, nsqadmin fails first time. I did tcpdump once, and was able to confirm that there was no SYN etc on the first attempt, just an http request on a theoretically existing connection, then a bunch of TCP retransmissions. The crazy part is that there are multiple reasons why this should be impossible. First, I've ensured I'm using an nsqadmin built from the master branch in early August, with go-1.8.3 on macOS cross-building for linux. EDIT: fixed typo of missing critical "no" in "no SYN" EDIT2: after reading more, I've learned that os/kernel configuration of tcp keepalive is not enough to enable it, the process must enable it on a socket with EDIT3: that's because |
RFR @mreiferson @ploxiln I was able to reproduce this bug with the following setup Node A: topic: T channels a,b,c With that setup the count was previously 3 on both nodes, and depending on the order |
for _, aChannelStats := range a.Channels { | ||
found := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice fix 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it only took me about a hundred printf's to find it =)
@@ -81,7 +81,7 @@ | |||
{{#if ../graph_active}} | |||
<td class="bold rate" target="{{rate "topic" node topic_name ""}}"></td> | |||
{{/if}} | |||
<td>{{commafy ../channels.length}}</td> | |||
<td>{{commafy this/channels.length}}</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is greek to me :)
I wonder if "this/" is not needed and just "channels.length" would suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... I confused myself by working in the wrong part of the template for a while, but there is a global channels
var so this is at least more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough :) no objection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch
the api endpoint
/api/topics/:topic
seems to be missing a channel that i see in/stats
output from nsqd's. I'm not sure what the bug is yet, but this causes it to be missing from the nsqadmin UI at/topics/:topic
even though it shows properly if you manually navigate to/topics/:topic/:channel
.