Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: message stalling with Snappy enabled #532

Closed
amattn opened this issue Jan 26, 2015 · 11 comments · Fixed by #830
Closed

nsqd: message stalling with Snappy enabled #532

amattn opened this issue Jan 26, 2015 · 11 comments · Fixed by #830

Comments

@amattn
Copy link

amattn commented Jan 26, 2015

Hi,

Root problem:

Most of the messages that go through our system do so rapidly (tens or hundreds of milliseconds) but some messages stall for multiple seconds. The length of the stall is tied to the nsqd client heartbeat.

The stalled messages don't really stand out (roughly 6k of json). We send other large messages that don't seem to stall.

My own investigation so far suggests that this area of the code in protocol_v2.go is related:

func (p *protocolV2) Send(client *clientV2, frameType int32, data []byte) error {
    client.Lock()

    var zeroTime time.Time
    if client.HeartbeatInterval > 0 {
        client.SetWriteDeadline(time.Now().Add(client.HeartbeatInterval))
    } else {
        client.SetWriteDeadline(zeroTime)
    }

    _, err := util.SendFramedResponse(client.Writer, frameType, data)
    if err != nil {
        client.Unlock()
        return err
    }

    if frameType != frameTypeMessage {
        err = client.Flush()
    }

    client.Unlock()

    return err
}

Specifically, I believe that certain messages aren't being flushed properly. The heartbeat hits the client and then that flushes the stalled messages through.

Have you seen this behavior before?
Is there any reason that a message might get stuck in util.SendFramedResponse?
Is there a reason we don't explicitly flush frameTypeMessage?

@mreiferson
Copy link
Member

hey @amattn

Can you paste the code you're using to configure the Consumer? There are a bunch of settings related to buffering and flushing...

@amattn
Copy link
Author

amattn commented Jan 26, 2015

Thanks! Here's our settings:
(we don't use magic numbers, but you get the idea)

config.HeartbeatInterval was higher, but we dropped it down to 3 seconds as a temporary workaround for force flushing.

config := nsq.NewConfig()
config.UserAgent = fmt.Sprintf("%s build %d", Version(), BuildNumber())
config.LookupdPollInterval = 11 * time.Second
config.Snappy = true
config.HeartbeatInterval = 3 * time.Second
config.MaxInFlight = 200

nsq_consumer, err := nsq.NewConsumer(topic, channel, config)
if err != nil {

}

// set handler
nsq_consumer.AddConcurrentHandlers(nr, 20)

@mreiferson
Copy link
Member

Here are the two settings that affect buffering: https://github.com/bitly/go-nsq/blob/master/config.go#L92-L99

You shouldn't (ever) see a delay that's longer than OutputBufferTimeout.

HeartbeatInterval is a red herring, you can return HeartbeatInterval to its default after we figure this out...

@mreiferson
Copy link
Member

It's possible this has something to do with Snappy = true, which is technically another layer of buffering.

Can you try disabling Snappy and see if the problem persists?

@mreiferson mreiferson added the bug label Jan 26, 2015
@mreiferson mreiferson changed the title message stalling nsqd: message stalling Jan 26, 2015
@amattn
Copy link
Author

amattn commented Jan 26, 2015

config.Snappy = false

Fixed!

Is this a snappy bug? For now we'll disable snappy and possibly try out deflate.

Any other recommendations?

@mreiferson
Copy link
Member

I don't think it's a Snappy bug per-se, but most likely our use of snappy 🔥

@mreiferson mreiferson changed the title nsqd: message stalling nsqd: message stalling with Snappy enabled Jan 26, 2015
@mreiferson
Copy link
Member

(updated title)

@mreiferson
Copy link
Member

@amattn is there some easy test case/code/app I can use to debug this? I'm having trouble reproducing 😞

@amattn
Copy link
Author

amattn commented Jan 31, 2015

I’ll see if I can reproduce in a test system…

@itsimba
Copy link

itsimba commented Dec 28, 2016

I also meet this problems
when consumer's snappy is true, I test for msgSize > 4096, it's stalled

https://github.com/nsqio/go-nsq/blob/master/protocol.go line 61

_, err = io.ReadFull(r, buf)

this code can not read full message, it's waiting for reading.
I test it, but I didn't know why.


also that, I replace nsqd's and consumer's snappy with google's snappy, it ok
https://github.com/golang/snappy


Just replacing the consumer's snappy code is correct too.

@mreiferson
Copy link
Member

@lauyoume thanks for the tip! I've opened #830 to handle this. I think it's good to defer to official "golang" packages when possible anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants