Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Added /-/healthy and /-/ready endpoints to all thanos components #656

Closed
wants to merge 8 commits into from
Closed
45 changes: 23 additions & 22 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning
### Added

- [#1094](https://github.com/improbable-eng/thanos/pull/1094) Allow configuring the response header timeout for the S3 client.
- [#644](https://github.com/improbable-eng/thanos/issues/644) Added `/-/healthy` and `/-/ready` endpoints to all node types.

### Changed

Expand Down Expand Up @@ -105,28 +106,28 @@ Using cadvisor `container_memory_usage_bytes` metric could be misleading e.g: ht
New options:

New Store flags:

* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.

New Store metrics:

* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.

New Store tracing span:
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate.
- [#1016](https://github.com/improbable-eng/thanos/pull/1016) Added option for another DNS resolver (miekg/dns client).

- [#1016](https://github.com/improbable-eng/thanos/pull/1016) Added option for another DNS resolver (miekg/dns client).
Note that this is required to have SRV resolution working on [Golang 1.11+ with KubeDNS below v1.14](https://github.com/golang/go/issues/27546)

New Querier and Ruler flag: `-- store.sd-dns-resolver` which allows to specify resolver to use. Either `golang` or `miekgdns`

- [#986](https://github.com/improbable-eng/thanos/pull/986) Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks.
The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
In the same time, compactor precomputes the index cache file on every compaction.
The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
In the same time, compactor precomputes the index cache file on every compaction.

New Compactor flag: `--index.generate-missing-cache-file` was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it's only one-off step per bucket.

Expand All @@ -143,31 +144,31 @@ Note that this is required to have SRV resolution working on [Golang 1.11+ with
- [#1021](https://github.com/improbable-eng/thanos/pull/1021) Query API `series` now supports POST method.
- [#939](https://github.com/improbable-eng/thanos/pull/939) Query API `query_range` now supports POST method.

### Changed
### Changed

- [#970](https://github.com/improbable-eng/thanos/pull/970) Deprecated `partial_response_disabled` proto field. Added `partial_response_strategy` instead. Both in gRPC and Query API.
No `PartialResponseStrategy` field for `RuleGroups` by default means `abort` strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.

Metrics:

* Added `thanos_rule_evaluation_with_warnings_total` to Ruler.
* DNS `thanos_ruler_query_apis*` are now `thanos_ruler_query_apis_*` for consistency.
* DNS `thanos_querier_store_apis*` are now `thanos_querier_store_apis__*` for consistency.
* Query Gate `thanos_bucket_store_series*` are now `thanos_bucket_store_series_*` for consistency.
* Most of thanos ruler metris related to rule manager has `strategy` label.

Ruler tracing spans:

* `/rule_instant_query HTTP[client]` is now `/rule_instant_query_part_resp_abort HTTP[client]"` if request is for abort strategy.

- [#1009](https://github.com/improbable-eng/thanos/pull/1009): Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (`v0.4.0` to `v0.6.1`) deps.

Changes that affects Thanos:
* query:
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
* query:
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
* [BUGFIX] Fix panic when aggregator param is not a literal. #5290
* ruler:
* ruler:
* [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
* [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
* [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our [issue #1027](https://github.com/improbable-eng/thanos/issues/1027)
Expand All @@ -179,26 +180,26 @@ Note that this is required to have SRV resolution working on [Golang 1.11+ with
* [CHANGE] *breaking* Renamed flag `--sync-delay` to `--consistency-delay` [#1053](https://github.com/improbable-eng/thanos/pull/1053)

For ruler essentially whole TSDB CHANGELOG applies beween v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md

Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370
Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.

- [#868](https://github.com/improbable-eng/thanos/pull/868) Go has been updated to 1.12.
- [#1055](https://github.com/improbable-eng/thanos/pull/1055) Gossip flags are now disabled by default and deprecated.
- [#1055](https://github.com/improbable-eng/thanos/pull/1055) Gossip flags are now disabled by default and deprecated.
- [#964](https://github.com/improbable-eng/thanos/pull/964) repair: Repair process now sorts the series and labels within block.
- [#1073](https://github.com/improbable-eng/thanos/pull/1073) Store: index cache for requests. It now calculates the size properly (includes slice header), has anti-deadlock safeguard and reports more metrics.

### Fixed

- [#921](https://github.com/improbable-eng/thanos/pull/921) `thanos_objstore_bucket_last_successful_upload_time` now does not appear when no blocks have been uploaded so far.
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`
- [#848](https://github.com/improbable-eng/thanos/pull/848) Compact: now correctly works with time series with duplicate labels.
- [#894](https://github.com/improbable-eng/thanos/pull/894) Thanos Rule: UI now correctly shows evaluation time.
- [#865](https://github.com/improbable-eng/thanos/pull/865) Query: now properly parses DNS SRV Service Discovery.
- [#889](https://github.com/improbable-eng/thanos/pull/889) Store: added safeguard against merging posting groups segfault
- [#941](https://github.com/improbable-eng/thanos/pull/941) Sidecar: added better handling of intermediate restarts.
- [#933](https://github.com/improbable-eng/thanos/pull/933) Query: Fixed 30 seconds lag of adding new store to query.
- [#962](https://github.com/improbable-eng/thanos/pull/962) Sidecar: Make config reloader file writes atomic.
- [#962](https://github.com/improbable-eng/thanos/pull/962) Sidecar: Make config reloader file writes atomic.
- [#982](https://github.com/improbable-eng/thanos/pull/982) Query: now advertises Min & Max Time accordingly to the nodes.
- [#1041](https://github.com/improbable-eng/thanos/issues/1038) Ruler is now able to return long time range queries.
- [#904](https://github.com/improbable-eng/thanos/pull/904) Compact: Skip compaction for blocks with no samples.
Expand Down
4 changes: 2 additions & 2 deletions cmd/thanos/bucket.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ import (
"github.com/oklog/run"
"github.com/oklog/ulid"
"github.com/olekukonko/tablewriter"
opentracing "github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/tsdb/labels"
"golang.org/x/text/language"
"golang.org/x/text/message"
kingpin "gopkg.in/alecthomas/kingpin.v2"
"gopkg.in/alecthomas/kingpin.v2"
)

var (
Expand Down
24 changes: 13 additions & 11 deletions cmd/thanos/compact.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ import (
"github.com/improbable-eng/thanos/pkg/block/metadata"
"github.com/improbable-eng/thanos/pkg/compact"
"github.com/improbable-eng/thanos/pkg/compact/downsample"
"github.com/improbable-eng/thanos/pkg/component"
"github.com/improbable-eng/thanos/pkg/objstore"
"github.com/improbable-eng/thanos/pkg/objstore/client"
"github.com/improbable-eng/thanos/pkg/prober"
"github.com/improbable-eng/thanos/pkg/runutil"
"github.com/oklog/run"
opentracing "github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/tsdb"
kingpin "gopkg.in/alecthomas/kingpin.v2"
"gopkg.in/alecthomas/kingpin.v2"
)

var (
Expand Down Expand Up @@ -125,7 +127,6 @@ func registerCompact(m map[string]setupFunc, app *kingpin.Application, name stri
compact.ResolutionLevel5m: time.Duration(*retention5m),
compact.ResolutionLevel1h: time.Duration(*retention1h),
},
name,
*disableDownsampling,
*maxCompactionLevel,
*blockSyncConcurrency,
Expand All @@ -147,7 +148,6 @@ func runCompact(
wait bool,
generateMissingIndexCacheFiles bool,
retentionByResolution map[compact.ResolutionLevel]time.Duration,
component string,
disableDownsampling bool,
maxCompactionLevel int,
blockSyncConcurrency int,
Expand All @@ -168,12 +168,12 @@ func runCompact(

confContentYaml, err := objStoreConfig.Content()
if err != nil {
return err
return errors.Wrap(err, "loading flag content")
}

bkt, err := client.NewBucket(logger, confContentYaml, reg, component)
bkt, err := client.NewBucket(logger, confContentYaml, reg, component.Compact.String())
if err != nil {
return err
return errors.Wrap(err, "initializing bucket")
}

// Ensure we close up everything properly.
Expand Down Expand Up @@ -207,6 +207,11 @@ func runCompact(
return errors.Wrap(err, "create compactor")
}

readinessProber := prober.NewProber(component.Compact, logger)
if err := defaultHTTPListener(g, logger, reg, httpBindAddr, readinessProber); err != nil {
return errors.Wrap(err, "create readiness prober")
FUSAKLA marked this conversation as resolved.
Show resolved Hide resolved
}

var (
compactDir = path.Join(dataDir, "compact")
downsamplingDir = path.Join(dataDir, "downsample")
Expand Down Expand Up @@ -313,11 +318,8 @@ func runCompact(
cancel()
})

if err := metricHTTPListenGroup(g, logger, reg, httpBindAddr); err != nil {
return err
}

level.Info(logger).Log("msg", "starting compact node")
readinessProber.SetReady()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, as we are not rdy at this point. We are rdy once our HTTP listener is able to accept new connections ):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FUSAKLA what about this comment? It seems like this still hasn't been addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was removed at one point but then added back again eventually.

There is the issue that if you have sidecar it uses the defaultHTTPListener which should have set it healthy and ready right away. But then the sidecar is not ready when it cannot get the labels from prometheus instance so from start is should not be ready and wait till the first successful labels fetch.

So I'm not sure how to do this. I need these two conditions to be met at once to say that the sidecar is ready right?

There were two variants of having each component handling it's ready status on it's own or move setting of the ready status to the generic defaultHTTPListener.

WDYT?

return nil
}

Expand Down
8 changes: 4 additions & 4 deletions cmd/thanos/downsample.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ import (
"github.com/improbable-eng/thanos/pkg/runutil"
"github.com/oklog/run"
"github.com/oklog/ulid"
opentracing "github.com/opentracing/opentracing-go"
"github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/tsdb"
"github.com/prometheus/tsdb/chunkenc"
kingpin "gopkg.in/alecthomas/kingpin.v2"
"gopkg.in/alecthomas/kingpin.v2"
)

func registerDownsample(m map[string]setupFunc, app *kingpin.Application, name string) {
Expand All @@ -50,12 +50,12 @@ func runDownsample(
) error {
confContentYaml, err := objStoreConfig.Content()
if err != nil {
return err
return errors.Wrap(err, "loading flag content")
}

bkt, err := client.NewBucket(logger, confContentYaml, reg, component.Downsample.String())
if err != nil {
return err
return errors.Wrap(err, "initializing bucket")
}

// Ensure we close up everything properly.
Expand Down
3 changes: 1 addition & 2 deletions cmd/thanos/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import (

"github.com/pkg/errors"
"github.com/prometheus/common/model"
kingpin "gopkg.in/alecthomas/kingpin.v2"
"gopkg.in/alecthomas/kingpin.v2"
)

func regGRPCFlags(cmd *kingpin.CmdClause) (
Expand Down Expand Up @@ -114,7 +114,6 @@ func regCommonObjStoreFlags(cmd *kingpin.CmdClause, suffix string, required bool
}
}


func regCommonTracingFlags(app *kingpin.Application) *pathOrContent {
fileFlagName := fmt.Sprintf("tracing.config-file")
contentFlagName := fmt.Sprintf("tracing.config")
Expand Down
14 changes: 9 additions & 5 deletions cmd/thanos/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@ import (
gprom "github.com/armon/go-metrics/prometheus"
"github.com/go-kit/kit/log"
"github.com/go-kit/kit/log/level"
"github.com/grpc-ecosystem/go-grpc-middleware"
grpc_middleware "github.com/grpc-ecosystem/go-grpc-middleware"
grpc_recovery "github.com/grpc-ecosystem/go-grpc-middleware/recovery"
"github.com/grpc-ecosystem/go-grpc-prometheus"
grpc_prometheus "github.com/grpc-ecosystem/go-grpc-prometheus"
"github.com/improbable-eng/thanos/pkg/prober"
"github.com/improbable-eng/thanos/pkg/runutil"
"github.com/improbable-eng/thanos/pkg/tracing"
"github.com/improbable-eng/thanos/pkg/tracing/client"
Expand Down Expand Up @@ -306,11 +307,12 @@ func defaultGRPCServerOpts(logger log.Logger, reg *prometheus.Registry, tracer o
return append(opts, grpc.Creds(credentials.NewTLS(tlsCfg))), nil
}

// metricHTTPListenGroup is a run.Group that servers HTTP endpoint with only Prometheus metrics.
func metricHTTPListenGroup(g *run.Group, logger log.Logger, reg *prometheus.Registry, httpBindAddr string) error {
// defaultHTTPListener is a run.Group that servers HTTP endpoint with only Prometheus metrics.
func defaultHTTPListener(g *run.Group, logger log.Logger, reg *prometheus.Registry, httpBindAddr string, readinessProber *prober.Prober) error {
mux := http.NewServeMux()
registerMetrics(mux, reg)
registerProfile(mux)
readinessProber.RegisterInMux(mux)

l, err := net.Listen("tcp", httpBindAddr)
if err != nil {
Expand All @@ -319,8 +321,10 @@ func metricHTTPListenGroup(g *run.Group, logger log.Logger, reg *prometheus.Regi

g.Add(func() error {
level.Info(logger).Log("msg", "Listening for metrics", "address", httpBindAddr)
readinessProber.SetHealthy()
FUSAKLA marked this conversation as resolved.
Show resolved Hide resolved
return errors.Wrap(http.Serve(l, mux), "serve metrics")
}, func(error) {
}, func(err error) {
readinessProber.SetNotHealthy(err)
runutil.CloseWithLogOnErr(logger, l, "metric listener")
})
return nil
Expand Down
79 changes: 79 additions & 0 deletions cmd/thanos/main_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
package main
FUSAKLA marked this conversation as resolved.
Show resolved Hide resolved

import (
"context"
"fmt"
"net/http"
"path"
"testing"
"time"

"github.com/go-kit/kit/log"
"github.com/improbable-eng/thanos/pkg/prober"
"github.com/improbable-eng/thanos/pkg/runutil"
"github.com/improbable-eng/thanos/pkg/testutil"
"github.com/oklog/run"
"github.com/prometheus/client_golang/prometheus"
)

type TestComponent struct {
name string
}

func (c TestComponent) String() string {
return c.name
}

func queryHTTPGetEndpoint(ctx context.Context, t *testing.T, logger log.Logger, url string) (*http.Response, error) {
req, err := http.NewRequest("GET", fmt.Sprintf("http://%s", url), nil)
testutil.Ok(t, err)
return http.DefaultClient.Do(req.WithContext(ctx))
}

func TestGenericHttpEndpoints(t *testing.T) {
var g run.Group
logger := log.NewNopLogger()
metricsRegistry := prometheus.NewRegistry()
component := TestComponent{name: "sidecar"}
ctx := context.Background()

freePort, err := testutil.FreePort()
testutil.Ok(t, err)

serverAddress := fmt.Sprintf("127.0.0.1:%d", freePort)

p := prober.NewProber(component, logger)
err = defaultHTTPListener(&g, logger, metricsRegistry, serverAddress, p)
testutil.Ok(t, err)
go func() { _ = g.Run() }()

testutil.Ok(t, runutil.Retry(time.Second, ctx.Done(), func() error {
resp, err := queryHTTPGetEndpoint(ctx, t, log.NewNopLogger(), path.Join(serverAddress, "/metrics"))
testutil.Ok(t, err)
testutil.Equals(t, 200, resp.StatusCode)
FUSAKLA marked this conversation as resolved.
Show resolved Hide resolved
return err
}))

testutil.Ok(t, runutil.Retry(time.Second, ctx.Done(), func() error {
resp, err := queryHTTPGetEndpoint(ctx, t, log.NewNopLogger(), path.Join(serverAddress, "/debug/pprof/"))
testutil.Ok(t, err)
testutil.Equals(t, 200, resp.StatusCode)
return err
}))

// Verify that defaultHTTPListener sets the prober healthy.
testutil.Ok(t, runutil.Retry(time.Second, ctx.Done(), func() error {
resp, err := queryHTTPGetEndpoint(ctx, t, log.NewNopLogger(), path.Join(serverAddress, "/-/healthy/"))
testutil.Ok(t, err)
testutil.Equals(t, 200, resp.StatusCode)
return err
}))

// Verify that defaultHTTPListener does not set the prober ready.
testutil.Ok(t, runutil.Retry(time.Second, ctx.Done(), func() error {
resp, err := queryHTTPGetEndpoint(ctx, t, log.NewNopLogger(), path.Join(serverAddress, "/-/ready/"))
testutil.Ok(t, err)
testutil.Equals(t, 503, resp.StatusCode)
return err
}))
}
Loading