Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in tailing code #845

Closed
cyriltovena opened this issue Aug 2, 2019 · 11 comments · Fixed by #897
Closed

Panic in tailing code #845

cyriltovena opened this issue Aug 2, 2019 · 11 comments · Fixed by #897
Labels
component/loki good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! type/bug Somehing is not working as expected

Comments

@cyriltovena
Copy link
Contributor

Stumble on this today

/go/src/github.com/grafana/loki/pkg/querier/tail.go:193 +0x295
created by github.com/grafana/loki/pkg/querier.(*Tailer).checkIngesterConnections	
/go/src/github.com/grafana/loki/pkg/querier/tail.go:219 +0xa8
github.com/grafana/loki/pkg/querier.(*Tailer).readTailClient(0xc020d72000, 0xc000504840, 0x10, 0x0, 0x0)
goroutine 95605 [running]:
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x105e078]
panic: runtime error: invalid memory address or nil pointer dereference

We should make sure this doesn't happen. Might be already fixed.

/cc @sandlis

@cyriltovena cyriltovena added kind/bug component/loki help wanted We would love help on these issues. Please come help us! good first issue These are great first issues. If you are looking for a place to start, start here! labels Aug 2, 2019
@sandeepsukhani
Copy link
Contributor

I was suspecting this might be happening due to server stopping before closing websockets, it seems it is not the case. If you see it again please let me know, will check it out.

@pracucci
Copy link
Contributor

pracucci commented Aug 14, 2019

@cyriltovena An hypothesis is the tailClients map - returned by Querier.tailDisconnectedIngesters() - contains a nil value (if you manually add it, the stack trace you get is the same of you, obviously with different memory addresses).

Looking at the tailDisconnectedIngesters() implementation, isn't the case possible if q.ring.GetAll() omits an ingester which is listed in tailDisconnectedIngesters input argument? I'm not much familiar with the ring implementation yet, but isn't possible if an ingester left the ring in the meanwhile?

@sandeepsukhani
Copy link
Contributor

I doubt that would be the case, otherwise code at other places which also use ring for querying ingesters would also have panicked. Would still check for surety.

@pracucci
Copy link
Contributor

I doubt that would be the case, otherwise code at other places which also use ring for querying ingesters would also have panicked.

@sandlis Can you elaborate on this, please? Do you mean it's not possible an ingester leave the ring?

@sandeepsukhani
Copy link
Contributor

Ring would most likely just not return ingester which is leaving or left.

@pracucci
Copy link
Contributor

Ring would most likely just not return ingester which is leaving or left.

Ok, that was my initial hypothesis. If so, then look at this

func (q *Querier) tailDisconnectedIngesters(ctx context.Context, req *logproto.TailRequest, connectedIngestersAddr []string) (map[string]logproto.Querier_TailClient, error) {
	tailClients := make(map[string]logproto.Querier_TailClient)
	for i := range connectedIngestersAddr {
		tailClients[connectedIngestersAddr[i]] = nil
	}

	disconnectedIngesters := []ring.IngesterDesc{}
	replicationSet, err := q.ring.GetAll()
	if err != nil {
		return nil, err
	}

	for _, ingester := range replicationSet.Ingesters {
		if _, isOk := tailClients[ingester.Addr]; isOk {
			delete(tailClients, ingester.Addr)
		} else {
			disconnectedIngesters = append(disconnectedIngesters, ingester)
		}
	}

	clients, err := q.forGivenIngesters(ring.ReplicationSet{Ingesters: disconnectedIngesters}, func(client logproto.QuerierClient) (interface{}, error) {
		return client.Tail(ctx, req)
	})
	if err != nil {
		return nil, err
	}

	for i := range clients {
		tailClients[clients[i].addr] = clients[i].response.(logproto.Querier_TailClient)
	}

	return tailClients, nil
}

If an ingester which was previously in the ring (passed as connectedIngestersAddr) now is no more returned by q.ring.GetAll(), then tailClients (which is returned by this function) will contain nil for that ingester address, causing the panic.

Am I missing anything?

@sandeepsukhani
Copy link
Contributor

sandeepsukhani commented Aug 14, 2019

This code only conn to disconnected ingesters. The second for loop removes those nil entries.

@pracucci
Copy link
Contributor

The second for loop removes those nil entries.

The second for loop iterates over the output of q.ring.GetAll(), so if a previous ingester is not returned anymore from q.ring.GetAll(), that ingester address will be kept in tailClients with nil value.

@sandeepsukhani
Copy link
Contributor

Ohh, you are right. Sorry, was reading code on my phone so missed that bit. Will fix it.

@pracucci
Copy link
Contributor

@sandlis I'm already working on a PR to fix it (while validating the idea with you). I will submit the PR in the next few hours 🙏

@sandeepsukhani
Copy link
Contributor

@pracucci Thank you so much for taking care of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/loki good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants