Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Start metric and status probe HTTP server as earlier as possible #1656

Merged
merged 10 commits into from
Oct 18, 2019
16 changes: 10 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ We use *breaking* word for marking changes that are not backward compatible (rel

- [#1660](https://github.com/thanos-io/thanos/pull/1660) Add a new `--prometheus.ready_timeout` CLI option to the sidecar to set how long to wait until Prometheus starts up.

### Fixed

- [#1656](https://github.com/thanos-io/thanos/pull/1656) Thanos Store now starts metric and status probe HTTP server earlier in its start-up sequence. `/-/healthy` endpoint now starts to respond with success earlier. `/metrics` endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the `/-/ready` endpoint rather than `/metrics`.

## [v0.8.1](https://github.com/thanos-io/thanos/releases/tag/v0.8.1) - 2019.10.14

### Fixed
Expand All @@ -23,12 +27,12 @@ We use *breaking* word for marking changes that are not backward compatible (rel
* NOTE: `thanos_store_nodes_grpc_connections` metric is now per `external_labels` and `store_type`. It is a recommended metric for Querier storeAPIs. `thanos_store_node_info` is marked as obsolete and will be removed in next release.
* NOTE2: Store Gateway is now advertising artificial: `"@thanos_compatibility_store_type=store"` label. This is to have the current Store Gateway compatible with Querier pre v0.8.0.
This label can be disabled by hidden `debug.advertise-compatibility-label=false` flag on Store Gateway.

## [v0.8.0](https://github.com/thanos-io/thanos/releases/tag/v0.8.0) - 2019.10.10

Lot's of improvements this release! Noteworthy items:
- First Katacoda tutorial! 🐱
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files.
- Store GW memory improvements (more to come!).
- Querier allows multiple deduplication labels.
- Both Compactor and Store Gateway can be **sharded** within the same bucket using relabelling!
Expand All @@ -42,7 +46,7 @@ both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/r

- [#1619](https://github.com/thanos-io/thanos/pull/1619) Thanos sidecar allows to limit min time range for data it exposes from Prometheus.
- [#1583](https://github.com/thanos-io/thanos/pull/1583) Thanos sharding:
- Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components.
- Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components.
Selecting blocks to serve depends on the result of block labels relabeling.
- For store gateway, advertise labels from "approved" blocks.
- [#1540](https://github.com/thanos-io/thanos/pull/1540) Thanos Downsample added `/-/ready` and `/-/healthy` endpoints.
Expand All @@ -55,8 +59,8 @@ Selecting blocks to serve depends on the result of block labels relabeling.
- [#1362](https://github.com/thanos-io/thanos/pull/1362) Optional `replicaLabels` param for `/query` and
`/query_range` querier endpoints. When provided overwrite the `query.replica-label` cli flags.
- [#1482](https://github.com/thanos-io/thanos/pull/1482) Thanos now supports Elastic APM as tracing provider.
- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag.
- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change.
- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag.
- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change.
- [#1613](https://github.com/thanos-io/thanos/pull/1613) Thanos Receive now traces forwarded requests.

### Changed
Expand All @@ -76,7 +80,7 @@ once for multiple deduplication labels like: `--query.replica-label=prometheus_r
- [#1544](https://github.com/thanos-io/thanos/pull/1544) Iterating over object store is resilient to the edge case for some providers.
- [#1469](https://github.com/thanos-io/thanos/pull/1469) Fixed Azure potential failures (EOF) when requesting more data then blob has.
- [#1512](https://github.com/thanos-io/thanos/pull/1512) Thanos Store fixed memory leak for chunk pool.
- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts.
- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts.

## [v0.7.0](https://github.com/thanos-io/thanos/releases/tag/v0.7.0) - 2019.09.02

Expand Down
59 changes: 33 additions & 26 deletions cmd/thanos/store.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,11 @@ func runStore(
selectorRelabelConf *extflag.PathOrContent,
advertiseCompatibilityLabel bool,
) error {
// Initiate HTTP listener providing metrics endpoint and readiness/liveness probes.
statusProber := prober.NewProber(component, logger, prometheus.WrapRegistererWithPrefix("thanos_", reg))
if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil {
kakkoyun marked this conversation as resolved.
Show resolved Hide resolved
return errors.Wrap(err, "schedule HTTP server")
}

confContentYaml, err := objStoreConfig.Content()
if err != nil {
Expand Down Expand Up @@ -185,29 +189,35 @@ func runStore(
return errors.Wrap(err, "create object storage store")
}

begin := time.Now()
level.Debug(logger).Log("msg", "initializing bucket store")
if err := bs.InitialSync(context.Background()); err != nil {
return errors.Wrap(err, "bucket store initial sync")
}
level.Debug(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String())

ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client")

err := runutil.Repeat(syncInterval, ctx.Done(), func() error {
if err := bs.SyncBlocks(ctx); err != nil {
level.Warn(logger).Log("msg", "syncing blocks failed", "err", err)
// bucketStoreReady signals when bucket store is ready.
bucketStoreReady := make(chan struct{})
{
ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client")

level.Info(logger).Log("msg", "initializing bucket store")
begin := time.Now()
if err := bs.InitialSync(ctx); err != nil {
kakkoyun marked this conversation as resolved.
Show resolved Hide resolved
close(bucketStoreReady)
return errors.Wrap(err, "bucket store initial sync")
}
return nil
level.Info(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String())
close(bucketStoreReady)

err := runutil.Repeat(syncInterval, ctx.Done(), func() error {
if err := bs.SyncBlocks(ctx); err != nil {
level.Warn(logger).Log("msg", "syncing blocks failed", "err", err)
}
return nil
})

runutil.CloseWithLogOnErr(logger, bs, "bucket store")
return err
}, func(error) {
cancel()
})

runutil.CloseWithLogOnErr(logger, bs, "bucket store")
return err
}, func(error) {
cancel()
})
}

l, err := net.Listen("tcp", grpcBindAddr)
if err != nil {
Expand All @@ -221,17 +231,14 @@ func runStore(
s := newStoreGRPCServer(logger, reg, tracer, bs, opts)

g.Add(func() error {
level.Info(logger).Log("msg", "Listening for StoreAPI gRPC", "address", grpcBindAddr)
<-bucketStoreReady
level.Info(logger).Log("msg", "listening for StoreAPI gRPC", "address", grpcBindAddr)
statusProber.SetReady()
return errors.Wrap(s.Serve(l), "serve gRPC")
}, func(error) {
s.Stop()
})

if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil {
return errors.Wrap(err, "schedule HTTP server")
}

level.Info(logger).Log("msg", "starting store node")
return nil
}
Expand Down
14 changes: 11 additions & 3 deletions docs/components/store.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,11 +122,11 @@ Flags:

```

## Time based partioning
## Time based partitioning

By default Thanos Store Gateway looks at all the data in Object Store and returns it based on query's time range.

Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time.
Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time.

For example setting: `--min-time=-6w` & `--max-time==-2w` will make Thanos Store Gateway return metrics that fall within `now - 6 weeks` up to `now - 2 weeks` time range.

Expand All @@ -136,6 +136,14 @@ Thanos Store Gateway might not get new blocks immediately, as Time partitioning

We recommend having overlapping time ranges with Thanos Sidecar and other Thanos Store gateways as this will improve your resiliency to failures.

Thanos Querier deals with overlapping time series by merging them together.
Thanos Querier deals with overlapping time series by merging them together.

Filtering is done on a Chunk level, so Thanos Store might still return Samples which are outside of `--min-time` & `--max-time`.

## Probes

- Thanos Store exposes two endpoints for probing.
kakkoyun marked this conversation as resolved.
Show resolved Hide resolved
- `/-/healthy` starts as soon as initial setup completed.
- `/-/ready` starts after all the bootstrapping completed (e.g initial index building) and ready to serve traffic.

> NOTE: Metric endpoint starts immediately so, make sure you set up readiness probe on designated HTTP `/-/ready` path.