-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Add initial remote-write-receive component #811
Conversation
Finally <3 First of all, as you can see our CI does not like NOT checked errors (: and both:
looks like it would be nice at least log them. |
d042091
to
162c12f
Compare
Looks like that made CI happy 🙂 . |
I think having a simple e2e test added would be good. I'll look into adding one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, just super small nits, mostly around style. (:
Nice work!
cmd/thanos/remote-write-receive.go
Outdated
) | ||
|
||
func registerRemoteWriteReceive(m map[string]setupFunc, app *kingpin.Application, name string) { | ||
cmd := app.Command(name, "accept Prometheus remote write API requests (EXPERIMENTAL, this may change drastically without notice)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd := app.Command(name, "accept Prometheus remote write API requests (EXPERIMENTAL, this may change drastically without notice)") | |
cmd := app.Command(name, "Accept Prometheus remote write API requests (EXPERIMENTAL, this may change drastically without notice)") |
cmd/thanos/remote-write-receive.go
Outdated
return nil | ||
}, | ||
func(err error) { | ||
level.Debug(logger).Log("msg", "initial load group errored", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't get this log messages here and on every group. This code is invoked even though this group did not errof (but e.g other group errored). I think we can remove those debug logs. We will print error for problematic group via g.Run
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is a leftover from some debugging, I'll remove it.
cmd/thanos/remote-write-receive.go
Outdated
}, | ||
func(err error) { | ||
level.Debug(logger).Log("msg", "tsdb group errored", "err", err) | ||
if err := localStorage.Close(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checking .. localStorage.Close closes db
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does
cmd/thanos/remote-write-receive.go
Outdated
func(err error) { | ||
level.Debug(logger).Log("msg", "tsdb group errored", "err", err) | ||
if err := localStorage.Close(); err != nil { | ||
level.Error(logger).Log("msg", "Error stopping storage", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all log lines should start lowercase
cmd/thanos/remote-write-receive.go
Outdated
|
||
level.Debug(logger).Log("msg", "setting up metric http listen-group") | ||
if err := metricHTTPListenGroup(g, logger, reg, httpMetricsBindAddr); err != nil { | ||
level.Debug(logger).Log("msg", "metric listener errored", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicated error handling. (rule "Handle error just once") (:
I think error wrap is the way to go (and removal of this log line).
pkg/receive/handler.go
Outdated
|
||
var ( | ||
m = cmux.New(listener) | ||
httpl = m.Match(cmux.HTTP1Fast()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment what this is for? It might be not straightforward.
pkg/receive/handler.go
Outdated
return | ||
} | ||
|
||
err = h.receiver.Receive(&wreq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do if err := ... ; err != nil
as well
pkg/receive/receivewriter.go
Outdated
package receive | ||
|
||
import ( | ||
"github.com/go-kit/kit/log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make format
needed (:
scripts/prometheus.yml
Outdated
@@ -0,0 +1,10 @@ | |||
# When the Thanos remote-write-receive component is started, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. can we add this to remote-write-receive
command description instead? Might be easier to discover
@@ -136,4 +142,22 @@ do | |||
--cluster.peers 127.0.0.1:19391 & | |||
done | |||
|
|||
sleep 0.5 | |||
|
|||
if [ -n "${REMOTE_WRITE_ENABLED}" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I did not know that this script still works (:
And yea, some tests would be nice. (: |
cmd/thanos/remote-write-receive.go
Outdated
|
||
var metadata = &promMetadata{ | ||
// Start out with the full time range. The shipper will constrain it later. | ||
// TODO(fabxc): minimum timestamp is never adjusted if shipping is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code copied from somewhere? Slightly confusing to see TODO(fabxc)
in here, but not a big deal I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yea was surprised as well at the beginning (: It is copied indeed.
pkg/receive/handler.go
Outdated
// Options for the web Handler. | ||
type Options struct { | ||
Receiver *ReceiveWriter | ||
Context context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed as it is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few minor things found next to what @bwplotka already discovered.
Overall really good stuff! 👍
pkg/receive/handler.go
Outdated
// Handler serves various HTTP endpoints of the Prometheus server | ||
type Handler struct { | ||
readyStorage *tsdb.ReadyStorage | ||
context context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed as it is not used.
pkg/receive/handler.go
Outdated
ReadTimeout: h.options.ReadTimeout, | ||
} | ||
|
||
errCh := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be a run.Group
.
pkg/receive/handler.go
Outdated
func (h *Handler) receive(w http.ResponseWriter, req *http.Request) { | ||
defer func() { | ||
if r := recover(); r != nil { | ||
fmt.Println("panic recovered:", r, string(debug.Stack())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to have a counter for how many panics were recovered?
pkg/receive/receivewriter.go
Outdated
Appender() (storage.Appender, error) | ||
} | ||
|
||
type ReceiveWriter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is spelled receive.ReceiveWriter
and should be renamed to receive.Writer
to be more idiomatic.
d7765cc
to
abf0029
Compare
@bwplotka @metalmatze I believe I addressed everything or answered the comments and added the receiver to the e2e test setup (the test is that a prometheus is successfully pushing an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks solid overall just a few nits. 👯♂️
Sorry again for taking so long to get around to this :(
cmd/thanos/main.go
Outdated
@@ -78,6 +78,7 @@ func main() { | |||
registerCompact(cmds, app, "compact") | |||
registerBucket(cmds, app, "bucket") | |||
registerDownsample(cmds, app, "downsample") | |||
registerRemoteWriteReceive(cmds, app, "remote-write-receive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - would shorten this to just receive
/remote
/receiver
cmd/thanos/remote-write-receive.go
Outdated
cancel() | ||
peer.Close(2 * time.Second) | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this whole block? Will receive need to peer via gossip? If we are removing gossip in the near future I think we should not add it to any new components and instead help people migrate to the new approach.
cmd/thanos/remote-write-receive.go
Outdated
level.Debug(logger).Log("msg", "setting up grpc server") | ||
{ | ||
var ( | ||
logger = log.With(logger, "component", "receiver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
receiver
> receive
? or the verbose name? Let's keep it consistant.
pkg/component/component.go
Outdated
Rule = sourceStoreAPI{component: component{name: "rule"}} | ||
Sidecar = sourceStoreAPI{component: component{name: "sidecar"}} | ||
Store = sourceStoreAPI{component: component{name: "store"}} | ||
RemoteWrite = sourceStoreAPI{component: component{name: "remote_write"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency
pkg/receive/handler.go
Outdated
|
||
reqBuf, err := snappy.Decode(nil, compressed) | ||
if err != nil { | ||
fmt.Println("snappy decode error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt
> logger
?
scripts/quickstart.sh
Outdated
|
||
if [ -n "${REMOTE_WRITE_ENABLED}" ] | ||
then | ||
./thanos remote-write-receive \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is great taht were adding it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either we leave this or remove gossip 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka what do you think on this ... personally, I think we should phase out gossip so (in a different PR) change this file to work via the new approach.
So for this PR I would remove receive
from this and once the above work has been done add it back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have the e2e tests, that works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no git-history of this, I'm going to comment this out for now, but leave it. Does that sound like an ok compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with that ... can you make sure a comment about why its commented is included 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
cmd/thanos/remote-write-receive.go
Outdated
dataDir string, | ||
peer cluster.Peer, | ||
) error { | ||
level.Info(logger).Log("msg", "setting up receiver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a warn log here saying that it is currently experimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good idea 👍
abf0029
to
6c371a6
Compare
pkg/receive/handler.go
Outdated
var ( | ||
requestDuration = prometheus.NewHistogramVec( | ||
prometheus.HistogramOpts{ | ||
Name: "prometheus_http_request_duration_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want thanos
prefix for those metrics? Or do we want to keep it similar to get advantage of common dashboards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that was a copy paste mistake. Yes lets make it thanos_http_...
.
pkg/receive/handler.go
Outdated
return func(w http.ResponseWriter, r *http.Request) { | ||
if h.isReady() { | ||
f(w, r) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return here, no need for else
pkg/receive/handler.go
Outdated
ReadTimeout: h.options.ReadTimeout, | ||
} | ||
|
||
errCh := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still valid?
6c371a6
to
e1315c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some unaddressed bit, plus some suggestions.. Overall LGTM, but bit major is not versioned remote read receive endpoint, thoughts? (:
IF all will be addressed -> LGTM from me.
cmd/thanos/receive.go
Outdated
Default("./data").String() | ||
|
||
m[name] = func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ bool) error { | ||
peer, err := newPeerFn(logger, reg, false, "", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actualy remove gossip stuff if you want, up to you - making sure deprecated stuff is in new component is quite not necessary. (: Not a blocker
cmd/thanos/receive.go
Outdated
dataDir string, | ||
peer cluster.Peer, | ||
) error { | ||
level.Info(logger).Log("msg", "setting up receive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's kill this message, below as setting up receive; this component is ...
cmd/thanos/receive.go
Outdated
select { | ||
case <-dbOpen: | ||
break | ||
// In case a shutdown is initiated before the dbOpen is released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing trailing period in comment (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would remove this, I think this is quite clear.
cmd/thanos/receive.go
Outdated
level.Debug(logger).Log("msg", "setting up grpc server") | ||
{ | ||
var ( | ||
logger = log.With(logger, "component", "receiver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not doing this in very top?
@@ -204,6 +204,7 @@ github.com/onsi/ginkgo v1.6.0 h1:Ix8l273rp3QzYgXSR+c8d1fTG7UPgYkOSELPhiY/YGw= | |||
github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= | |||
github.com/onsi/gomega v1.4.1 h1:PZSj/UFNaVp3KxrzHOcS7oyuWA7LoOY/77yCTEFu21U= | |||
github.com/onsi/gomega v1.4.1/go.mod h1:C1qb7wdrVGGVU+Z6iS04AVkA3Q65CEZX59MT0QO5uiA= | |||
github.com/opentracing-contrib/go-stdlib v0.0.0-20170113013457-1de4cc2120e7 h1:8KbikWulLUcMM96hBxjgoo6gTmCkG6HYSDohv/WygYU= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modules ❤️
pkg/receive/handler.go
Outdated
} | ||
|
||
readyf := h.testReady | ||
router.Post("/receive", readyf(h.receive)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this starts to be some API. We already keep Prometheus like API to guide people and be intuitive. This means that remote receiver would be nice to use the same or similar... does Prometheus serves remote write endpoint? What endpoint looks like? api/v1/write
? As a basics something like v1
in path is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tend to agree but flexible on the url ... api/v1/receive
feels more correct to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with the prefix, going with /api/v1/receive
, no strong opinion though
pkg/receive/handler.go
Outdated
ReadTimeout: h.options.ReadTimeout, | ||
} | ||
|
||
errCh := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed
1ab4011
to
1a7ea0c
Compare
57c0542
to
d459673
Compare
d459673
to
6cb54d4
Compare
I believe I addressed everything:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, all LGTM from my side.
@@ -17,25 +17,37 @@ import ( | |||
"gopkg.in/alecthomas/kingpin.v2" | |||
) | |||
|
|||
func regCommonServerFlags(cmd *kingpin.CmdClause) ( | |||
func regGRPCFlags(cmd *kingpin.CmdClause) ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hahah I was starring at monitor 1m and thinking why you removed Gossip for everyone in this PR... then seen that actually you added this function.... Github troll =D
Changes
This adds the very first pieces of the proposal to add receiving remote-write support to Thanos. The features are just having a single tsdb that remote write requests are written into and that the component joins a mesh network to enable querying. All other features are still outstanding but can be worked on in parallel once this lands.
Verification
As this is very small and primarily hooks up existing components, the extend of my verification is that I added this new component to the quickstart script and verified that indeed data replicated to the endpoint receiving remote-write calls can be queried via the Thanos querier. Happy to add this to the e2e test suite though.
@bwplotka @domgreen
cc @metalmatze @mxinden @squat @s-urbaniak @ant31