Skip to content

Commit

Permalink
store: Start metric and status probe HTTP server as earlier as possib…
Browse files Browse the repository at this point in the history
…le (#1656)

* Start metric and status probe server as soon as possible

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update changelog

Signed-off-by: Kemal Akkoyun <[email protected]>

* Schedule a separate goroutine to start server

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add InitSync to the rungroup

Signed-off-by: Kemal Akkoyun <[email protected]>

* Fix linter pointed issues

Signed-off-by: Kemal Akkoyun <[email protected]>

* Move InitSync to alreay existed run.Group

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove unnecessary changes and update CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add simple explanation for probes

Signed-off-by: Kemal Akkoyun <[email protected]>

* Make requested changes

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update CHANGELOG.md

Co-Authored-By: Martin Chodur <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
  • Loading branch information
2 people authored and GiedriusS committed Oct 28, 2019
1 parent e273cdf commit 1d8dc4e
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 35 deletions.
16 changes: 10 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ We use *breaking* word for marking changes that are not backward compatible (rel

- [#1660](https://github.com/thanos-io/thanos/pull/1660) Add a new `--prometheus.ready_timeout` CLI option to the sidecar to set how long to wait until Prometheus starts up.

### Fixed

- [#1656](https://github.com/thanos-io/thanos/pull/1656) Thanos Store now starts metric and status probe HTTP server earlier in its start-up sequence. `/-/healthy` endpoint now starts to respond with success earlier. `/metrics` endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the `/-/ready` endpoint rather than `/metrics`.

## [v0.8.1](https://github.com/thanos-io/thanos/releases/tag/v0.8.1) - 2019.10.14

### Fixed
Expand All @@ -23,12 +27,12 @@ We use *breaking* word for marking changes that are not backward compatible (rel
* NOTE: `thanos_store_nodes_grpc_connections` metric is now per `external_labels` and `store_type`. It is a recommended metric for Querier storeAPIs. `thanos_store_node_info` is marked as obsolete and will be removed in next release.
* NOTE2: Store Gateway is now advertising artificial: `"@thanos_compatibility_store_type=store"` label. This is to have the current Store Gateway compatible with Querier pre v0.8.0.
This label can be disabled by hidden `debug.advertise-compatibility-label=false` flag on Store Gateway.

## [v0.8.0](https://github.com/thanos-io/thanos/releases/tag/v0.8.0) - 2019.10.10

Lot's of improvements this release! Noteworthy items:
- First Katacoda tutorial! 🐱
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Store GW memory improvements (more to come!).
- Querier allows multiple deduplication labels.
- Both Compactor and Store Gateway can be **sharded** within the same bucket using relabelling!
Expand All @@ -42,7 +46,7 @@ both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/r

- [#1619](https://github.com/thanos-io/thanos/pull/1619) Thanos sidecar allows to limit min time range for data it exposes from Prometheus.
- [#1583](https://github.com/thanos-io/thanos/pull/1583) Thanos sharding:
- Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components.
- Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components.
Selecting blocks to serve depends on the result of block labels relabeling.
- For store gateway, advertise labels from "approved" blocks.
- [#1540](https://github.com/thanos-io/thanos/pull/1540) Thanos Downsample added `/-/ready` and `/-/healthy` endpoints.
Expand All @@ -55,8 +59,8 @@ Selecting blocks to serve depends on the result of block labels relabeling.
- [#1362](https://github.com/thanos-io/thanos/pull/1362) Optional `replicaLabels` param for `/query` and
`/query_range` querier endpoints. When provided overwrite the `query.replica-label` cli flags.
- [#1482](https://github.com/thanos-io/thanos/pull/1482) Thanos now supports Elastic APM as tracing provider.
- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag.
- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change.
- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag.
- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change.
- [#1613](https://github.com/thanos-io/thanos/pull/1613) Thanos Receive now traces forwarded requests.

### Changed
Expand All @@ -76,7 +80,7 @@ once for multiple deduplication labels like: `--query.replica-label=prometheus_r
- [#1544](https://github.com/thanos-io/thanos/pull/1544) Iterating over object store is resilient to the edge case for some providers.
- [#1469](https://github.com/thanos-io/thanos/pull/1469) Fixed Azure potential failures (EOF) when requesting more data then blob has.
- [#1512](https://github.com/thanos-io/thanos/pull/1512) Thanos Store fixed memory leak for chunk pool.
- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts.
- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts.

## [v0.7.0](https://github.com/thanos-io/thanos/releases/tag/v0.7.0) - 2019.09.02

Expand Down
59 changes: 33 additions & 26 deletions cmd/thanos/store.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,11 @@ func runStore(
selectorRelabelConf *extflag.PathOrContent,
advertiseCompatibilityLabel bool,
) error {
// Initiate HTTP listener providing metrics endpoint and readiness/liveness probes.
statusProber := prober.NewProber(component, logger, prometheus.WrapRegistererWithPrefix("thanos_", reg))
if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil {
return errors.Wrap(err, "schedule HTTP server")
}

confContentYaml, err := objStoreConfig.Content()
if err != nil {
Expand Down Expand Up @@ -185,29 +189,35 @@ func runStore(
return errors.Wrap(err, "create object storage store")
}

begin := time.Now()
level.Debug(logger).Log("msg", "initializing bucket store")
if err := bs.InitialSync(context.Background()); err != nil {
return errors.Wrap(err, "bucket store initial sync")
}
level.Debug(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String())

ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client")

err := runutil.Repeat(syncInterval, ctx.Done(), func() error {
if err := bs.SyncBlocks(ctx); err != nil {
level.Warn(logger).Log("msg", "syncing blocks failed", "err", err)
// bucketStoreReady signals when bucket store is ready.
bucketStoreReady := make(chan struct{})
{
ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client")

level.Info(logger).Log("msg", "initializing bucket store")
begin := time.Now()
if err := bs.InitialSync(ctx); err != nil {
close(bucketStoreReady)
return errors.Wrap(err, "bucket store initial sync")
}
return nil
level.Info(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String())
close(bucketStoreReady)

err := runutil.Repeat(syncInterval, ctx.Done(), func() error {
if err := bs.SyncBlocks(ctx); err != nil {
level.Warn(logger).Log("msg", "syncing blocks failed", "err", err)
}
return nil
})

runutil.CloseWithLogOnErr(logger, bs, "bucket store")
return err
}, func(error) {
cancel()
})

runutil.CloseWithLogOnErr(logger, bs, "bucket store")
return err
}, func(error) {
cancel()
})
}

l, err := net.Listen("tcp", grpcBindAddr)
if err != nil {
Expand All @@ -221,17 +231,14 @@ func runStore(
s := newStoreGRPCServer(logger, reg, tracer, bs, opts)

g.Add(func() error {
level.Info(logger).Log("msg", "Listening for StoreAPI gRPC", "address", grpcBindAddr)
<-bucketStoreReady
level.Info(logger).Log("msg", "listening for StoreAPI gRPC", "address", grpcBindAddr)
statusProber.SetReady()
return errors.Wrap(s.Serve(l), "serve gRPC")
}, func(error) {
s.Stop()
})

if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil {
return errors.Wrap(err, "schedule HTTP server")
}

level.Info(logger).Log("msg", "starting store node")
return nil
}
Expand Down
14 changes: 11 additions & 3 deletions docs/components/store.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,11 @@ Flags:

```

## Time based partioning
## Time based partitioning

By default Thanos Store Gateway looks at all the data in Object Store and returns it based on query's time range.

Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time.
Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time.

For example setting: `--min-time=-6w` & `--max-time==-2w` will make Thanos Store Gateway return metrics that fall within `now - 6 weeks` up to `now - 2 weeks` time range.

Expand All @@ -136,6 +136,14 @@ Thanos Store Gateway might not get new blocks immediately, as Time partitioning

We recommend having overlapping time ranges with Thanos Sidecar and other Thanos Store gateways as this will improve your resiliency to failures.

Thanos Querier deals with overlapping time series by merging them together.
Thanos Querier deals with overlapping time series by merging them together.

Filtering is done on a Chunk level, so Thanos Store might still return Samples which are outside of `--min-time` & `--max-time`.

## Probes

- Thanos Store exposes two endpoints for probing.
- `/-/healthy` starts as soon as initial setup completed.
- `/-/ready` starts after all the bootstrapping completed (e.g initial index building) and ready to serve traffic.

> NOTE: Metric endpoint starts immediately so, make sure you set up readiness probe on designated HTTP `/-/ready` path.

0 comments on commit 1d8dc4e

Please sign in to comment.