Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqadmin: missing channel from topic list #753

Merged
merged 2 commits into from
Jul 31, 2018

Conversation

jehiah
Copy link
Member

@jehiah jehiah commented Jul 31, 2018

the api endpoint /api/topics/:topic seems to be missing a channel that i see in /stats output from nsqd's. I'm not sure what the bug is yet, but this causes it to be missing from the nsqadmin UI at /topics/:topic even though it shows properly if you manually navigate to /topics/:topic/:channel.

@jehiah jehiah added the bug label May 2, 2016
@mreiferson
Copy link
Member

I'm looking forward to the explanation on this one

@chrusty
Copy link

chrusty commented May 4, 2016

I've seen the same thing recently with NSQ 0.3.6

@jehiah
Copy link
Member Author

jehiah commented May 4, 2016

thanks @chrusty that's helpful confirmation. This was with 0.3.6-alpha+build.27.967fce5

I had to resolve the production situation where i observed this, and wasn't able to reproduce this easily in a development environment so it might take me longer than i had hoped to track this down.

@chrusty
Copy link

chrusty commented May 6, 2016

After having a look around, I'm beginning to wonder if this could be a symptom of having my system-clocks out of sync. Recently some of my system clocks have skewed a bit and I know it has caused issues for some of the databases I'm using (this is pretty shameful, I know).

This is still pretty vague, I know.

@jehiah
Copy link
Member Author

jehiah commented Jun 23, 2017

I appear to have stumbled upon another case of this in production. with version 1.0.0-alpha+build.7.ea01a579

The general situation is that given 10 servers, this happens where only one nsqd has a an extra channel. It's reported and returned properly from every nsqlookupd, and returned properly from the individual nsqd stats endpoints. Nsqadmin also properly queries the nsqd node with the odd channel.

I'm including redacted nsqadmin logs, nsqlookup output and nsqd stats output. In this example the channel queuereader_m is missing from the nsqadmin page for events.

[nsqadmin] 2017/06/23 20:12:08.345616 CI: querying nsqlookupd http://lookup05:4161/lookup?topic=events
[nsqadmin] 2017/06/23 20:12:08.345634 CI: querying nsqlookupd http://lookup04:4161/lookup?topic=events
[nsqadmin] 2017/06/23 20:12:08.345672 CI: querying nsqlookupd http://lookup03:4161/lookup?topic=events
[nsqadmin] 2017/06/23 20:12:08.345620 CI: querying nsqlookupd http://lookup02:4161/lookup?topic=events
[nsqadmin] 2017/06/23 20:12:08.345624 CI: querying nsqlookupd http://lookup01:4161/lookup?topic=events
[nsqadmin] 2017/06/23 20:12:08.348347 CI: querying nsqd http://a01:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348354 CI: querying nsqd http://a11:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348358 CI: querying nsqd http://s02:4151/stats?format=json <---
[nsqadmin] 2017/06/23 20:12:08.348370 CI: querying nsqd http://a12:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348359 CI: querying nsqd http://a05:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348422 CI: querying nsqd http://s03:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348364 CI: querying nsqd http://s01:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348442 CI: querying nsqd http://a09:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348472 CI: querying nsqd http://a01:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348351 CI: querying nsqd http://a03:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348516 CI: querying nsqd http://a10:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348575 CI: querying nsqd http://a02:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.348615 CI: querying nsqd http://a13:4151/stats?format=json
[nsqadmin] 2017/06/23 20:12:08.380435 200 GET /api/topics/events (127.0.0.1:56997) 34.826237ms
$ curl 'http://lookup01:4161/lookup?topic=events' | jq .
{
  "producers": [
    {
      "version": "1.0.0-alpha+build.7.ea01a579",
      "http_port": 4151,
      "tcp_port": 4150,
      "broadcast_address": "a01",
      "hostname": "a01",
      "remote_address": "1.2.3.4:58636"
    },
    {
      "version": "1.0.0-alpha+build.7.ea01a579",
      "http_port": 4151,
      "tcp_port": 4150,
      "broadcast_address": "s01",
      "hostname": "s01",
      "remote_address": "1.2.3.5:34638"
    },
	// other hosts
  ],
  "channels": [
    "others....",
    "queuereader_m",
    "others..."
  ]
}
$ curl 'http://s02:4151/stats?format=json'
{
  "topics": [
	{
      "e2e_processing_latency": {
        "percentiles": null,
        "count": 0
      },
      "paused": false,
      "message_count": 1110,
      "backend_depth": 0,
      "depth": 0,
      "channels": [
		{....}, // other channels
        {
          "e2e_processing_latency": {
            "percentiles": null,
            "count": 0
          },
          "paused": false,
          "clients": [],
          "channel_name": "queuereader_m",
          "depth": 9517,
          "backend_depth": 8407,
          "in_flight_count": 0,
          "deferred_count": 0,
          "message_count": 1110,
          "requeue_count": 0,
          "timeout_count": 0
        },	
      ],
      "topic_name": "events"
    },
  ],
  "start_time": 1497479679,
  "health": "OK",
  "version": "1.0.0-alpha+build.7.ea01a579"
}

@mreiferson
Copy link
Member

@jehiah so you think this is a bug in nsqadmin query/parsing/aggregation or the frontend displaying the results?

@jehiah
Copy link
Member Author

jehiah commented Jun 25, 2017

I think it's a nsqadmin aggregation bug. I suspect the aggregation somehow only returns items that are in the first nsq stats output it parses.

Unfortunately I forgot to capture the matching nsqadmin API output.

@mreiferson
Copy link
Member

@jehiah you intend to take this one?

@jehiah
Copy link
Member Author

jehiah commented Jun 26, 2017

not immediately

@ploxiln
Copy link
Member

ploxiln commented Sep 24, 2017

I've run into something probably unrelated but possibly sometimes giving similar symptoms: nsqadmin seems to keep and re-use http client connections after a week or so of inactivity, but they don't work (presumably because the AWS EC2 security-groups firewall forgot the connection) and they time out. Then on the second (and further) attempts to load the same page, it uses a new connection and works instantly.

I started noticing this a month or two ago, and once the pattern was figured out, it is surprisingly consistent. I load up nsqadmin after leaving it alone for a full week or so, and go to the "Nodes" list page, and then go to individual an nsqd node page. (So it makes a stats request to just one nsqd for the page. This also affects topic and channel pages, but multiple nsqd at a time. So for isolating the issue I used nodes.) For every single nsqd, the first time, the request times out. Later times works.

I can curl nsqd stats, while on the nsqadmin server, just before loading the node page for that nsqd. Curl works first time, nsqadmin fails first time. I did tcpdump once, and was able to confirm that there was no SYN etc on the first attempt, just an http request on a theoretically existing connection, then a bunch of TCP retransmissions.

The crazy part is that there are multiple reasons why this should be impossible. First, I've ensured /proc/sys/net/ipv4/tcp_keepalive_{time,intvl,probes} is set (to {540,45,3} in my case but the default {7200,75,9} should also work). Second, I don't see this behavior after just 24 or 36 hours, I have to wait about a week to see it. Third, go added a work-around for this common problem in early 2016: https://go-review.googlesource.com/c/go/+/22670

I'm using an nsqadmin built from the master branch in early August, with go-1.8.3 on macOS cross-building for linux.

EDIT: fixed typo of missing critical "no" in "no SYN"

EDIT2: after reading more, I've learned that os/kernel configuration of tcp keepalive is not enough to enable it, the process must enable it on a socket with setsockopt(... SO_KEEPALIVE ...) as well. I'm still not sure why the golang Transport feature is not working though

EDIT3: that's because nsq/internal/http_api/api_request.go:NewDeadlineTransport() creates a new transport without the MaxIdle stuff. that stuff was added in go-1.7

@jehiah jehiah force-pushed the nsqadmin_fix_753 branch from 4961787 to 598e111 Compare July 31, 2018 06:04
@jehiah jehiah requested review from mreiferson and ploxiln July 31, 2018 06:05
@jehiah
Copy link
Member Author

jehiah commented Jul 31, 2018

RFR @mreiferson @ploxiln

I was able to reproduce this bug with the following setup

Node A: topic: T channels a,b,c
Node B: topic: T channels a.

With that setup the count was previously 3 on both nodes, and depending on the order /stats responses completed you would see all three channels listed on the topics page, or just one.

for _, aChannelStats := range a.Channels {
found := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice fix 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it only took me about a hundred printf's to find it =)

@@ -81,7 +81,7 @@
{{#if ../graph_active}}
<td class="bold rate" target="{{rate "topic" node topic_name ""}}"></td>
{{/if}}
<td>{{commafy ../channels.length}}</td>
<td>{{commafy this/channels.length}}</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is greek to me :)
I wonder if "this/" is not needed and just "channels.length" would suffice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... I confused myself by working in the wrong part of the template for a while, but there is a global channels var so this is at least more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough :) no objection

Copy link
Member

@mreiferson mreiferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch

@mreiferson mreiferson merged commit f0dd3c3 into nsqio:master Jul 31, 2018
@mreiferson mreiferson deleted the nsqadmin_fix_753 branch July 31, 2018 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants