-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy: Make sure there is no race condition between sender and recv of resp channel #745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM! But I've a question.
BTW, It should be add into CHANGELOG! 😝
c1f824e
to
4fb699a
Compare
This fixes potential race condition. Bit more efficient version of #744 Signed-off-by: Bartek Plotka <[email protected]>
4fb699a
to
3d8ccb1
Compare
var storeDebugMsgs []string | ||
// Allow to buffer max 10 series response. | ||
// Each might be quite large (multi chunk long series given by sidecar). | ||
respSender, respRecv, closeFn = newRespCh(gctx, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hardcode 10
?
Is that something that could impact performances (in both good and bad way) ? What if thanos is running on big machines, might people want to be able to tune that parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
This is a tradeoff on buffering memory vs time when connection is open.
The problem right now that this message can be bit large as we send THE whole series for sidecar. Imagine scrape inveral 5 (kind of worse scenario), 6w query, single series. Message might be up to 2MB.. single query will buffer up to 20MB in that worst case. Imagine multiple queries like this.. etc (: So keeping it low makes sense more.
In terms of giving control above we can do it later, but we need to make sure it easy to understand and if there is nothing else that limits/extands that buffer (e.g gRPC itself on both ends)
I'm not an expert in Goroutines so I'm afraid I won't be able to give a good review on this but I still found a comment to make 😄 The logic seems sound, do those changes deserves extra testing ? Race conditions are hard to test though. |
Yea - we have quite a lot tests for this, also I ran it with Ok, I will merge this PR today if no objections will arise - quite urgent. @domgreen @improbable-ludwik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - still think we could easily end up OOMing with a large set of stores ... however, this would at least restrict it.
@@ -171,48 +202,74 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe | |||
|
|||
} | |||
|
|||
type warnSender interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - would move the declaration of this interface up near the ctxRespSender
which satisfies this interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sooo I think it can be in both places, but the pattern is to have consumers of the interface to define the interface, so I vote for leaving it near consumer.
Example references to this pattern: https://stackoverflow.com/questions/53381694/where-should-we-define-go-interface-in-multiple-consumer-scenario-what-about-i#comment93639200_53381694 https://twitter.com/davecheney/status/942593128355192832?lang=en
|
||
if len(seriesSet) == 0 { | ||
// This is indicates that configured StoreAPIs are not the ones end user expects | ||
err := errors.New("No store matched for this query") | ||
level.Warn(s.logger).Log("err", err, "stores", strings.Join(storeDebugMsgs, ";")) | ||
respCh <- storepb.NewWarnSeriesResponse(err) | ||
respSender.send(storepb.NewWarnSeriesResponse(err)) | ||
return nil | ||
} | ||
|
||
mergedSet := storepb.MergeSeriesSets(seriesSet...) | ||
for mergedSet.Next() { | ||
var series storepb.Series | ||
series.Labels, series.Chunks = mergedSet.At() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed in this PR but would be good to add documentation to the interface of the iterator methods ... Next()
At()
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 👍 This came from the TSDB I think. Worth to contribute such there.. or actually hosting this in Thanos is important as well I think since we use it heavily. Where to put it though? Maybe blog post?
closeFn() | ||
}() | ||
|
||
for _, st := range stores { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are a large number of stores e.g. 100+ we would end up spinning up 100+ goroutines that would then all go and fetch 10 series from their store ... does this also risk OOMing us? That could be then 1000 series in memory on the query as we are merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.... Go routines are not a problem, there can be thousands of those.
But you are right.. the 10 buffer for each store and 10 for global channel makes it more problematic for large number of stores. Wonder if we should start some ticket to discuss potential optimization for those. Having pool of go routines could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging if no other objection @domgreen |
This fixes potential race condition. Thanks to @jojohappy who highlighted the problem in this PR: #744
Sorry for getting duplicate PR, but found other race conditions as well (what if sending stream gives err!) and this fixes CI and edge cases for users so it's quite urgent.
Particular race conditions and known issues this PR fixed:
@domgreen you were fixing something similar. This should fix all cases. More unittests to be added tomorrow.
PTAL @jojohappy
Signed-off-by: Bartek Plotka [email protected]