-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement stateless mode for Thanos Ruler (continue #4250) #4731
Conversation
prometheus/prometheus#9109 requires some modification on the vendored grafana agent wal package.
Maybe let's wait for the upgrade from the agent side? @rfratto |
Hey y'all, I wanted to give a quick update on this. I've put effort into updating the Grafana Agent's Prometheus dependency this week, but it's going to take some more time. The bump also requires a bump to github.com/prometheus/common, which is currently blocked by a few of the exporters we embed still using the (since removed) github.com/prometheus/common/log package. Such is the life of a project that has a lot of dependencies :) We've definitely made progress here, but it might take another week before I can update the WAL code. I'll still tag y'all when we have something. |
3a79087
to
2390e96
Compare
This pr is ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits and questions. Good work! Do we still need pkg/rules/remotewrite/{wal,series}.go
with prometheus/prometheus#8785 merged? ;P It says:
// TODO(idoqo): Migrate to prometheus package when https://github.com/prometheus/prometheus/pull/8785 is ready.
// TestRule_CanPersistWALData checks that in stateless mode, Thanos Ruler can persist rule evaluations | ||
// which couldn't be sent to the remote write endpoint (e.g because receiver isn't available). | ||
func TestRule_CanPersistWALData(t *testing.T) { | ||
//TODO: Implement test with unavailable remote-write endpoint(receiver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to implement this before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is not a blocker as we are using the upstream prom agent code already so data should be persisted. Not sure how to implement this right now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
5979b10
to
23988c5
Compare
It seems that we can totally remove the |
be3ab7e
to
ae51c39
Compare
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
This removes the need to pass a separate `remote-write` flag to ruler to enable stateless mode. Instead, we now check if a remote-write config is provided and automatically enables stateless mode based off that. Ruler test is also cleaned up to remove unnecessary tests (i.e those that have been performed by other e2e suites). Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: yeya24 <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
ae51c39
to
fb0dfba
Compare
@GiedriusS Let me know if there is anything left in this pr. If it is small nits then I guess we can clean up later. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few very small nits but after that, we can merge this IMO 👍
cmd/thanos/rule.go
Outdated
}, func(error) { | ||
close(done) | ||
}) | ||
db, err := agent.Open(logger, reg, remoteStore, walDir, agentOpts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please call this something other than db
to avoid shadowing the outer variable? This is confusing and I had to double check this place 😂 I suggest agentDB
or something like that, and to rename the outer db
variable into tsdbDB
.
cmd/thanos/rule.go
Outdated
grpcserver.WithTLSConfig(tlsCfg), | ||
} | ||
if db != nil { | ||
tsdbStore := store.NewTSDBStore(logger, db, component.Rule, conf.lset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future maybe we could also answer StoreAPI queries by using the data in the WAL? Perhaps worth adding a TODO here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. If the agent is able to query head data, then what's the difference between the agent and a prometheus with very low retention? I am not sure whether this is in their roadmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point 👍
// TestRule_CanPersistWALData checks that in stateless mode, Thanos Ruler can persist rule evaluations | ||
// which couldn't be sent to the remote write endpoint (e.g because receiver isn't available). | ||
func TestRule_CanPersistWALData(t *testing.T) { | ||
//TODO: Implement test with unavailable remote-write endpoint(receiver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Ben Ye <[email protected]>
Hi @GiedriusS, I have addressed the comments in the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, it became a very small pull request with all of the agent code in upstream Prometheus 💪 awesome!
Great work! 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @yeya24 !
Changes
This pr is based on top of the awesome pr #4250 by @idoqo and fixes E2E tests.
Verification
E2E tests should pass.