Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester: allow only POST method on /ingester/shutdown #7707

Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@
* [CHANGE] Distributor: the metric `cortex_distributor_sample_delay_seconds` has been deprecated and will be removed in Mimir 2.14. #7516
* [CHANGE] Query-frontend: The deprecated YAML setting `frontend.cache_unaligned_requests` has been moved to `limits.cache_unaligned_requests`. #7519
* [CHANGE] Querier: the CLI flag `-querier.minimize-ingester-requests` has been moved from "experimental" to "advanced". #7638
* [CHANGE] Ingester: allow only POST method on `/ingester/shutdown`, as previously it was too easy to accidentally trigger through GET requests. At the same time, add an option to keep the existing behavior by introducing an `-get-request-for-ingester-shutdown-enabled` flag. This flag will be removed in Mimir 2.15. #7707
* [FEATURE] Introduce `-server.log-source-ips-full` option to log all IPs from `Forwarded`, `X-Real-IP`, `X-Forwarded-For` headers. #7250
* [FEATURE] Introduce `-tenant-federation.max-tenants` option to limit the max number of tenants allowed for requests when federation is enabled. #6959
* [FEATURE] Cardinality API: added a new `count_method` parameter which enables counting active label names. #7085
Expand Down
11 changes: 11 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,17 @@
"fieldType": "boolean",
"fieldCategory": "deprecated"
},
{
"kind": "field",
"name": "get_request_for_ingester_shutdown_enabled",
"required": false,
"desc": "Enable GET requests to the /ingester/shutdown endpoint to trigger an ingester shutdown. This is a potentially dangerous operation and should only be enabled consciously.",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "get-request-for-ingester-shutdown-enabled",
"fieldType": "boolean",
"fieldCategory": "deprecated"
},
{
"kind": "field",
"name": "alertmanager_http_prefix",
Expand Down
2 changes: 2 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1249,6 +1249,8 @@ Usage of ./cmd/mimir/mimir:
Set to true to enable all Go runtime metrics, such as go_sched_* and go_memstats_*.
-flusher.exit-after-flush
Stop after flush has finished. If false, process will keep running, doing nothing. (default true)
-get-request-for-ingester-shutdown-enabled
[deprecated] Enable GET requests to the /ingester/shutdown endpoint to trigger an ingester shutdown. This is a potentially dangerous operation and should only be enabled consciously. (default false)
-h
Print basic help.
-help
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,12 @@ api:
# CLI flag: -distributor.enable-otlp-metadata-storage
[enable_otel_metadata_translation: <boolean> | default = true]

# (deprecated) Enable GET requests to the /ingester/shutdown endpoint to
# trigger an ingester shutdown. This is a potentially dangerous operation and
# should only be enabled consciously.
# CLI flag: -get-request-for-ingester-shutdown-enabled
[get_request_for_ingester_shutdown_enabled: <boolean> | default = false]

# (advanced) HTTP URL path under which the Alertmanager ui and api will be
# served.
# CLI flag: -http.alertmanager-http-prefix
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ Complete the following steps to scale down ingesters deployed in a single zone.

1. Scale down one ingester at a time:

a. Invoke the `/ingester/shutdown` API endpoint on the ingester to terminate.
a. Send a POST request to the `/ingester/shutdown` API endpoint on the ingester to terminate.

b. Wait until the API endpoint call has successfully returned and the ingester logged "finished flushing and shipping TSDB blocks".

Expand All @@ -134,7 +134,7 @@ To simplify the scale down process, you can leverage ingesters deployed in multi

For each zone, complete the following steps:

1. Invoke the `/ingester/shutdown` API endpoint on all ingesters that you want to terminate.
1. Send a POST request to the `/ingester/shutdown` API endpoint on all ingesters that you want to terminate.
1. Wait until the API endpoint calls have successfully returned and the ingester has logged "finished flushing and shipping TSDB blocks".
1. Send a `SIGINT` or `SIGTERM` signal to the processes of the ingesters that you want to terminate.
1. Wait until the blocks uploaded by terminated ingesters are available for querying before proceeding with the next zone.
Expand Down
4 changes: 2 additions & 2 deletions docs/sources/mimir/references/http-api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ This document groups API endpoints by service. Note that the API endpoints are e
| [HA tracker status](#ha-tracker-status) | Distributor | `GET /distributor/ha_tracker` |
| [Flush chunks / blocks](#flush-chunks--blocks) | Ingester | `GET,POST /ingester/flush` |
| [Prepare for Shutdown](#prepare-for-shutdown) | Ingester | `GET,POST,DELETE /ingester/prepare-shutdown` |
| [Shutdown](#shutdown) | Ingester | `GET,POST /ingester/shutdown` |
| [Shutdown](#shutdown) | Ingester | `POST /ingester/shutdown` |
| [Ingesters ring status](#ingesters-ring-status) | Distributor,Ingester | `GET /ingester/ring` |
| [Ingester tenants](#ingester-tenants) | Ingester | `GET /ingester/tenants` |
| [Ingester tenant TSDB](#ingester-tenant-tsdb) | Ingester | `GET /ingester/tsdb/{tenant}` |
Expand Down Expand Up @@ -409,7 +409,7 @@ This API endpoint is usually used by Kubernetes-specific scale down automations
### Shutdown

```
GET,POST /ingester/shutdown
POST /ingester/shutdown
```

This endpoint flushes in-memory time series data from ingesters to the long-term storage, and then shuts down the ingester service.
Expand Down
9 changes: 8 additions & 1 deletion pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ type Config struct {
// TODO: Remove option in Mimir 2.14.
EnableOtelMetadataStorage bool `yaml:"enable_otel_metadata_translation" category:"deprecated"`

// TODO: Remove option in Mimir 2.15.
GETRequestForIngesterShutdownEnabled bool `yaml:"get_request_for_ingester_shutdown_enabled" category:"deprecated"`

AlertmanagerHTTPPrefix string `yaml:"alertmanager_http_prefix" category:"advanced"`
PrometheusHTTPPrefix string `yaml:"prometheus_http_prefix" category:"advanced"`

Expand All @@ -71,6 +74,7 @@ type Config struct {
func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&cfg.SkipLabelNameValidationHeader, "api.skip-label-name-validation-header-enabled", false, "Allows to skip label name validation via X-Mimir-SkipLabelNameValidation header on the http write path. Use with caution as it breaks PromQL. Allowing this for external clients allows any client to send invalid label names. After enabling it, requests with a specific HTTP header set to true will not have label names validated.")
f.BoolVar(&cfg.EnableOtelMetadataStorage, "distributor.enable-otlp-metadata-storage", true, "If true, store metadata when ingesting metrics via OTLP. This makes metric descriptions and types available for metrics ingested via OTLP.")
f.BoolVar(&cfg.GETRequestForIngesterShutdownEnabled, "get-request-for-ingester-shutdown-enabled", false, "Enable GET requests to the /ingester/shutdown endpoint to trigger an ingester shutdown. This is a potentially dangerous operation and should only be enabled consciously.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On re-review, this should probably be:

Suggested change
f.BoolVar(&cfg.GETRequestForIngesterShutdownEnabled, "get-request-for-ingester-shutdown-enabled", false, "Enable GET requests to the /ingester/shutdown endpoint to trigger an ingester shutdown. This is a potentially dangerous operation and should only be enabled consciously.")
f.BoolVar(&cfg.GETRequestForIngesterShutdownEnabled, "ingester.get-request-for-shutdown-enabled", false, "Enable GET requests to the /ingester/shutdown endpoint to trigger an ingester shutdown. This is a potentially dangerous operation and should only be enabled consciously.")

And the corresponding changes to the docs.

Copy link
Contributor Author

@kevinmingtarja kevinmingtarja Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, should the yaml field name stay the same though?

Right now it's get_request_for_ingester_shutdown_enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, initially I did as you suggested, but then Peter suggested to make it consistent so I changed it: #7707 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I missed that conversation sorry. Looking at the distributor.enable-otlp-metadata-storage flag (just above) it exists as both distributor.enable-otlp-metadata-storage as the flag and enable_otel_metadata_translation as the yaml. @pstibrany Just want to double check your opinion here please.

Copy link
Member

@pstibrany pstibrany Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for confusion here. The problem here comes from the fact that config option is defined in api config block in YAML, but we're trying to use ingester. prefix for CLI flag. At the same time, we're triyng to make CLI flag and yaml field name consistent (this was goal of my previous comment, without realizing it's causing more confusion).

I think we should settle on:

  • -api.get-request-for-ingester-shutdown-enabled for CLI flag
  • get_request_for_ingester_shutdown_enabled as YAML field name (it's already under api section)

WDYT?

If we want to use -ingester. prefix for CLI flag, then we should put this into ingester config struct.

Unfortunately I think enable_otel_metadata_translation was a mistake, as it doesn't follow our practices. ("enabled" suffix, consistent CLI and YAML names)

Copy link
Member

@pstibrany pstibrany Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhesketh thanks for catching this. My suggestion was not to remove -ingester. prefix (only to add ingester in the middle of the name), and I missed that this is what happened. 🤦

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for the misunderstanding too. So should we keep going with the -ingester. prefix or -api. instead? As you mentioned, I think the latter makes more sense as it's already under the api section in the yaml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kevin, with Josh we agreed to go with -api.get-request-for-ingester-shutdown-enabled CLI flag and get_request_for_ingester_shutdown_enabled YAML field name. Thank you very much for your patience with us!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Peter, no worries! Let me quickly do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

cfg.RegisterFlagsWithPrefix("", f)
}

Expand Down Expand Up @@ -296,7 +300,10 @@ func (a *API) RegisterIngester(i Ingester) {
a.RegisterRoute("/ingester/flush", http.HandlerFunc(i.FlushHandler), false, true, "GET", "POST")
a.RegisterRoute("/ingester/prepare-shutdown", http.HandlerFunc(i.PrepareShutdownHandler), false, true, "GET", "POST", "DELETE")
a.RegisterRoute("/ingester/prepare-partition-downscale", http.HandlerFunc(i.PreparePartitionDownscaleHandler), false, true, "GET", "POST", "DELETE")
a.RegisterRoute("/ingester/shutdown", http.HandlerFunc(i.ShutdownHandler), false, true, "GET", "POST")
a.RegisterRoute("/ingester/shutdown", http.HandlerFunc(i.ShutdownHandler), false, true, "POST")
if a.cfg.GETRequestForIngesterShutdownEnabled {
a.RegisterDeprecatedRoute("/ingester/shutdown", http.HandlerFunc(i.ShutdownHandler), false, true, "GET")
}
a.RegisterRoute("/ingester/tsdb_metrics", http.HandlerFunc(i.UserRegistryHandler), true, true, "GET")

a.indexPage.AddLinks(defaultWeight, "Ingester", []IndexPageLink{
Expand Down
61 changes: 61 additions & 0 deletions pkg/api/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@ import (
"fmt"
"net"
"net/http"
"net/http/httptest"
"strconv"
"testing"

"github.com/go-kit/log"
"github.com/gorilla/mux"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/stretchr/testify/require"

"github.com/grafana/mimir/pkg/querier/tenantfederation"
Expand Down Expand Up @@ -193,6 +195,62 @@ func TestApiGzip(t *testing.T) {
})
}

type MockIngester struct {
Ingester
}

func (mi MockIngester) ShutdownHandler(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusNoContent)
}

func TestApiIngesterShutdown(t *testing.T) {
for _, tc := range []struct {
name string
setFlag bool
expectedStatusCode int
}{
{
name: "flag set to true, enable GET request for ingester shutdown",
setFlag: true,
expectedStatusCode: http.StatusNoContent,
},
{
name: "flag not set (default), disable GET request for ingester shutdown",
setFlag: false,
expectedStatusCode: http.StatusMethodNotAllowed,
},
} {
t.Run(tc.name, func(t *testing.T) {
cfg := Config{
GETRequestForIngesterShutdownEnabled: tc.setFlag,
}
serverCfg := getServerConfig(t)
federationCfg := tenantfederation.Config{}
srv, err := server.New(serverCfg)
require.NoError(t, err)

go func() { _ = srv.Run() }()
t.Cleanup(srv.Stop)

api, err := New(cfg, federationCfg, serverCfg, srv, log.NewNopLogger())
require.NoError(t, err)

api.RegisterIngester(&MockIngester{})

req := httptest.NewRequest("GET", "/ingester/shutdown", nil)
w := httptest.NewRecorder()
api.server.HTTP.ServeHTTP(w, req)
require.Equal(t, tc.expectedStatusCode, w.Code)

// for POST request, it should always return 204
req = httptest.NewRequest("POST", "/ingester/shutdown", nil)
w = httptest.NewRecorder()
api.server.HTTP.ServeHTTP(w, req)
require.Equal(t, http.StatusNoContent, w.Code)
})
}
}

// Generates server config, with gRPC listening on random port.
func getServerConfig(t *testing.T) server.Config {
grpcHost, grpcPortNum := getHostnameAndRandomPort(t)
Expand All @@ -206,6 +264,9 @@ func getServerConfig(t *testing.T) server.Config {
GRPCListenPort: grpcPortNum,

GRPCServerMaxRecvMsgSize: 1024,

// Use new registry to avoid using default one and panic due to duplicate metrics.
Registerer: prometheus.NewPedanticRegistry(),
}
require.NoError(t, cfg.LogLevel.Set("info"))
return cfg
Expand Down
2 changes: 1 addition & 1 deletion tools/migrate-ingester-statefulsets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ while [[ $INSTANCES_TO_DOWNSCALE -gt 0 ]]; do
# wget (BusyBox version) will fail, but we don't care ... important thing is that it has triggered shutdown.
# -T causes wget to wait only 5 seconds, otherwise /ingester/shutdown takes a long time.
# Preferably we would wait for /ingester/shutdown to return, but unfortunately that doesn't work (even with big timeout), wget complains with weird error.
kubectl exec "$POD_TO_SHUTDOWN" --namespace="$NAMESPACE" -- wget -T 5 http://localhost:8080/ingester/shutdown >/dev/null 2>/dev/null || true
kubectl exec "$POD_TO_SHUTDOWN" --namespace="$NAMESPACE" -- wget --method POST -T 5 http://localhost:8080/ingester/shutdown >/dev/null 2>/dev/null || true

# While request to /ingester/shutdown completes only after flushing has finished, it unfortunately returns 204 status code,
# which confuses wget. That is the reason why instead of waiting for /ingester/shutdown to complete, this script waits for
Expand Down
Loading