Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: ephemeral topic/channel churn can live-lock #1251

Closed
ploxiln opened this issue May 9, 2020 · 1 comment · Fixed by #1314
Closed

nsqd: ephemeral topic/channel churn can live-lock #1251

ploxiln opened this issue May 9, 2020 · 1 comment · Fixed by #1314
Assignees
Labels

Comments

@ploxiln
Copy link
Member

ploxiln commented May 9, 2020

from #1246

If consumers of ephemeral topics and channels come and go fast enough, for a short period of time, some can get stuck in indefinite retry loops.

Demo by @slayercat:

package main

import nsq "github.com/nsqio/go-nsq"
import "time"
import "log"
import "fmt"

func nsqSubscribe(addr string, topic string, channel string, hdlr nsq.HandlerFunc) error {
	consumer, err := nsq.NewConsumer(topic, channel, nsq.NewConfig())
	if err != nil {
		print("new consumer error: ", err, "\n")
		time.Sleep(1 * time.Second) //wait 1s
		panic(err)
	}
	consumer.AddHandler(hdlr)
	for {
		err = consumer.ConnectToNSQD(addr)
		if err != nil {
			log.Printf("connect nsqd error: %v. retry\n", err)
			time.Sleep(1 * time.Second) //wait 1s
			continue
		} else {
			break
		}
	}
	_ = <-consumer.StopChan
	panic("nsq conn dead topic=" + topic + " channel=" + channel)
	return nil
}

func handlerFunc1(message *nsq.Message) error {
	message.Finish()
	return nil
}

func main() {
	for i := 0; i < 10; i++ {
		go nsqSubscribe("127.0.0.1:4150",
			fmt.Sprintf("testTopic%d#ephemeral", i),
			fmt.Sprintf("testChannel%d#ephemeral", i),
			handlerFunc1)
	}

	nsqSubscribe("127.0.0.1:4150", ("testTopic#ephemeral"), ("testChannel#ephemeral"), handlerFunc1)
}

Start a simple stand-alone nsqd, and then build and run the demo above in a bash loop:

go build -o demo demo.go

for I in $(seq 100); do
    ./demo &
    sleep 0.1
    kill $!
done

Then, nsqd will get stuck in the client topic/channel subscribe retry loop, seemingly indefinitely (long after the above loop finishes, and even after nsqd is signalled to exit gracefully):

[nsqd] 2020/05/09 13:13:17.694506 INFO: CHANNEL(testChannel6#ephemeral): deleting
[nsqd] 2020/05/09 13:13:17.694701 INFO: TOPIC(testTopic6#ephemeral): new channel(testChannel6#ephemeral)
[nsqd] 2020/05/09 13:13:17.694786 INFO: TOPIC(testTopic6#ephemeral): deleting channel testChannel6#ephemeral
[nsqd] 2020/05/09 13:13:17.694796 INFO: CHANNEL(testChannel6#ephemeral): deleting
[nsqd] 2020/05/09 13:13:17.695104 INFO: TOPIC(testTopic6#ephemeral): new channel(testChannel6#ephemeral)
[nsqd] 2020/05/09 13:13:17.695187 INFO: TOPIC(testTopic6#ephemeral): deleting channel testChannel6#ephemeral
[nsqd] 2020/05/09 13:13:17.695203 INFO: CHANNEL(testChannel6#ephemeral): deleting
[nsqd] 2020/05/09 13:13:17.695535 INFO: TOPIC(testTopic6#ephemeral): new channel(testChannel6#ephemeral)
[nsqd] 2020/05/09 13:13:17.695620 INFO: TOPIC(testTopic6#ephemeral): deleting channel testChannel6#ephemeral
[nsqd] 2020/05/09 13:13:17.695631 INFO: CHANNEL(testChannel6#ephemeral): deleting
[nsqd] 2020/05/09 13:13:17.696009 INFO: TOPIC(testTopic6#ephemeral): new channel(testChannel6#ephemeral)
[nsqd] 2020/05/09 13:13:17.696102 INFO: TOPIC(testTopic6#ephemeral): deleting channel testChannel6#ephemeral
[nsqd] 2020/05/09 13:13:17.696115 INFO: CHANNEL(testChannel6#ephemeral): deleting
[nsqd] 2020/05/09 13:13:17.696359 INFO: TOPIC(testTopic6#ephemeral): new channel(testChannel6#ephemeral)

Some ideas:

  • add 100ms of sleep before retry
  • add random sleep between 1ms and 250ms between retries
  • give up after 2 retries
@mreiferson
Copy link
Member

see #1314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants