Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stateless mode for Thanos Ruler #4250

Closed
wants to merge 18 commits into from

Conversation

idoqo
Copy link
Contributor

@idoqo idoqo commented May 19, 2021

This implements the stateless mode for ruler described in the Scalable Rule Storage proposal

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

test/e2e/rule_test.go Outdated Show resolved Hide resolved
}

//nolint:staticcheck
a.w.bufPool.Put(buf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SA6002: argument should be pointer-like to avoid allocations
(at-me in a reply with help or ignore)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work!

This is actually super close to be done 💪🏽 🚀 Thanks!

test/e2e/rule_test.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
}

func (rc *ruleRWConfig) registerFlag(cmd extflag.FlagClause) *ruleRWConfig {
cmd.Flag("remote-write", "If true, directs ruler to remote-write evaluated samples to the server configured by 'remote-write.config'.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check if config is provided and only then enable remote write? ;p

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important! User will expect remote write can be added to the TSDB ruler also.. We need to make sure that is not the case, and if you provide remote write it will make ruler stateless (except the WAL buffer)

cmd/thanos/rule.go Outdated Show resolved Hide resolved
pkg/rules/remotewrite/remotewrite.go Outdated Show resolved Hide resolved
if err != nil {
return nil, fmt.Errorf("failed applying config to remote storage: %w", err)
}
fanoutStorage := storage.NewFanout(logger, walStore, remoteStore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/rules/remotewrite/series.go Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
@idoqo idoqo mentioned this pull request May 27, 2021
1 task
@idoqo idoqo force-pushed the remote-write-rule branch from e38865c to 17df499 Compare June 1, 2021 01:19
scripts/quickstart.sh Outdated Show resolved Hide resolved
series []record.RefSeries
}

func (c *walDataCollector) Append(samples []record.RefSample) bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U1000: func (*walDataCollector).Append is unused
(at-me in a reply with help or ignore)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just minor nits, tests, fixing build errors and metrics (:

@@ -75,6 +78,9 @@ type ruleConfig struct {
alertQueryURL *url.URL
alertRelabelConfigYAML []byte

rwConfig ruleRWConfig
rwConfigYAML []byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

@@ -75,6 +78,9 @@ type ruleConfig struct {
alertQueryURL *url.URL
alertRelabelConfigYAML []byte

rwConfig ruleRWConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rwConfig ruleRWConfig
rwConfig *extflag.PathOrContent

@@ -165,6 +172,14 @@ func registerRule(app *extkingpin.App) {
return errors.New("--query/--query.sd-files and --query.config* parameters cannot be defined at the same time")
}

// Parse and check remote-write config and enable stateless mode for ruler.
if conf.rwConfig.configPath != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user passes empty configuration?

cmd/thanos/rule.go Outdated Show resolved Hide resolved
@@ -165,6 +172,14 @@ func registerRule(app *extkingpin.App) {
return errors.New("--query/--query.sd-files and --query.config* parameters cannot be defined at the same time")
}

// Parse and check remote-write config and enable stateless mode for ruler.
if conf.rwConfig.configPath != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this

}
var rwCfg remotewrite.Config
if len(conf.rwConfigYAML) == 0 {
return errors.New("no --remote-write.config was given")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.New("no --remote-write.config was given")
return errors.New("empty --remote-write.config or empty file via --remote-write.config-file was given")

pkg/rules/remotewrite/remotewrite.go Outdated Show resolved Hide resolved
GlobalConfig: config.DefaultGlobalConfig,
RemoteWriteConfigs: []*config.RemoteWriteConfig{rwConfig.RemoteStore},
}); err != nil {
return nil, fmt.Errorf("failed applying config to remote storage: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("failed applying config to remote storage: %w", err)
return nil, errros.Wrapf(err, "applying config to remote storage")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Thanos still prefering errors.Wrapf over fmt.Errorf with %w?

func newStorageMetrics(r prometheus.Registerer) *storageMetrics {
m := storageMetrics{r: r}
m.numActiveSeries = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "agent_wal_storage_active_series",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Name: "agent_wal_storage_active_series",
Name: "wal_storage_active_series",

Let's start with wal...? Let's remove agent

test/e2e/rule_test.go Outdated Show resolved Hide resolved
if err := removeLockfileIfAny(logger, conf.dataDir); err != nil {
return errors.Wrap(err, "remove storage lock files")
}
if len(rwCfgYAML) > 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squat @bwplotka

In this case though, an empty --remote-write.config (or --remote-write.config-file)will fall back to using the TSDB-backed ruler. This is because len(rwCfgYAML) == 0 either means:

  • The flag was provided but it's an empty file or empty string OR
  • The flag was not provided at all.

I couldn't think of a reliable way to distinguish between them. I've also documented this behavior in the flag help, but I don't know if it is something we want. An alternative would be to provide a separate flag for enabling stateless mode (e.g --remote.write or --remote-write.enabled).

@idoqo idoqo force-pushed the remote-write-rule branch from c01154d to 7d6095c Compare June 10, 2021 14:00
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the conflict and linting issue.

@idoqo idoqo force-pushed the remote-write-rule branch from 7d6095c to 255f824 Compare July 6, 2021 00:33
}

//nolint:staticcheck
samplesPool.Put(v)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SA6002: argument should be pointer-like to avoid allocations
(at-me in a reply with help or ignore)

@bill3tt
Copy link
Contributor

bill3tt commented Jul 9, 2021

@idoqo feels like we are very close to being ready to rock & roll 🚀 just a couple of linters & conflicts to fix

@GiedriusS
Copy link
Member

@idoqo any updates?

This removes the need to pass a separate `remote-write` flag to ruler
to enable stateless mode. Instead, we now check if a remote-write config is provided
and automatically enables stateless mode based off that.

Ruler test is also cleaned up to remove unnecessary
tests (i.e those that have been performed by other e2e suites).

Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Signed-off-by: Michael Okoko <[email protected]>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I got panic at line 559. In the case of using remote write storage, db will be nil and cause panic later.

	tsdbStore := store.NewTSDBStore(logger, db, component.Rule, conf.lset)

We need to implement a TSDBReader based on the local Wal store and the remote storage as well. However, I don't think the two stores have real Queryable implementation so currently it is still not working.

cmd/thanos/rule.go Outdated Show resolved Hide resolved
@idoqo idoqo requested a review from yeya24 September 18, 2021 06:18
@yeya24
Copy link
Contributor

yeya24 commented Sep 18, 2021

@idoqo The failling E2E test is related to your changes. Can you take a look? Also please resolve the conflict.

@@ -522,7 +556,7 @@ func runRule(
)

// Start gRPC server.
{
if db != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed it last time. Even if it is stateless mode, we still need to start the gRPC server because we need the prober and the gRPC rules feature, we can just remove the StoreServer in this case.

Example code:

	// Start gRPC server.
	tlsCfg, err := tls.NewServerConfig(log.With(logger, "protocol", "gRPC"), conf.grpc.tlsSrvCert, conf.grpc.tlsSrvKey, conf.grpc.tlsSrvClientCA)
	if err != nil {
		return errors.Wrap(err, "setup gRPC server")
	}

	options := []grpcserver.Option{
		grpcserver.WithServer(thanosrules.RegisterRulesServer(ruleMgr)),
		grpcserver.WithListen(conf.grpc.bindAddress),
		grpcserver.WithGracePeriod(time.Duration(conf.grpc.gracePeriod)),
		grpcserver.WithTLSConfig(tlsCfg),
	}
	if db != nil {
		tsdbStore := store.NewTSDBStore(logger, db, component.Rule, conf.lset)
		options = append(options, grpcserver.WithServer(store.RegisterStoreServer(tsdbStore)))
	}
	// TODO: Add rules API implementation when ready.
	s := grpcserver.New(logger, reg, tracer, grpcLogOpts, tagOpts, comp, grpcProbe, options...)

	g.Add(func() error {
		statusProber.Ready()
		return s.ListenAndServe()
	}, func(err error) {
		statusProber.NotReady(err)
		s.Shutdown(err)
	})

@yeya24
Copy link
Contributor

yeya24 commented Oct 1, 2021

Hi @idoqo, are you still willing to finish this pr? Should be close to merge I think. If you are busy with other work then I am happy to take over.

@idoqo
Copy link
Contributor Author

idoqo commented Oct 1, 2021

Hi @idoqo, are you still willing to finish this pr? Should be close to merge I think. If you are busy with other work then I am happy to take over.

Hi @yeya24, I'm a bit held up by exams atm so, sure

GiedriusS pushed a commit that referenced this pull request Nov 8, 2021
* import wal package from grafana/agent (with string interning disabled)

Signed-off-by: Michael Okoko <[email protected]>

* Set up remote-write config and test skeleton

Signed-off-by: Michael Okoko <[email protected]>

* Setup fanout and related storages for stateless ruler

Signed-off-by: Michael Okoko <[email protected]>

* Optionally run ruler in stateless mode

Signed-off-by: Michael Okoko <[email protected]>

* Set up tests and implementations for configuring remote-write for ruler

Signed-off-by: Michael Okoko <[email protected]>

* Implement stub querier for WAL storage to fix nil pointer error

Signed-off-by: Michael Okoko <[email protected]>

* Setup e2e test for stateless ruler

Signed-off-by: Michael Okoko <[email protected]>

* Add copied code commentary to remotewrite packages

Signed-off-by: Michael Okoko <[email protected]>

* Use static addresses for am and querier

Signed-off-by: Michael Okoko <[email protected]>

* Remove need for separate remote-write flag for stateless ruler

This removes the need to pass a separate `remote-write` flag to ruler
to enable stateless mode. Instead, we now check if a remote-write config is provided
and automatically enables stateless mode based off that.

Ruler test is also cleaned up to remove unnecessary
tests (i.e those that have been performed by other e2e suites).

Signed-off-by: Michael Okoko <[email protected]>

* Generate docs for stateless ruler flags and fix tests

Signed-off-by: Michael Okoko <[email protected]>

* Use promauto for prometheus primitives

Signed-off-by: Michael Okoko <[email protected]>

* Group imports and satisfy go-lint

Signed-off-by: Michael Okoko <[email protected]>

* Always return empty series set from WAL storage

Signed-off-by: Michael Okoko <[email protected]>

* re-generate rule documentation

Signed-off-by: Michael Okoko <[email protected]>

* copyright headers to satisfy golint

Signed-off-by: Michael Okoko <[email protected]>

* Rename wal storage metrics

Signed-off-by: Michael Okoko <[email protected]>

* Use Prometheus' remote write config instead of rolling another

Signed-off-by: Michael Okoko <[email protected]>

* Fix E2E tests

Signed-off-by: Ben Ye <[email protected]>

* add changelog

Signed-off-by: yeya24 <[email protected]>

* remove wal related tests

Signed-off-by: Ben Ye <[email protected]>

* fix e2e

Signed-off-by: Ben Ye <[email protected]>

* remove unused structs in tests

Signed-off-by: Ben Ye <[email protected]>

* add licence header

Signed-off-by: Ben Ye <[email protected]>

* use upstream agent package

Signed-off-by: Ben Ye <[email protected]>

* address comments

Signed-off-by: Ben Ye <[email protected]>

Co-authored-by: Michael Okoko <[email protected]>
@yeya24
Copy link
Contributor

yeya24 commented Nov 8, 2021

Thanks for the work. Close it in favor of #4731

@yeya24 yeya24 closed this Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants