Error creating alert rule group #11

rayuduc · 2023-04-17T23:29:20Z

Terraform 1.2.3 and same issue with 1.4.5
on darwin_arm64

provider registry.terraform.io/fgouteroux/mimir v0.1.3

terraform module is working as expected until 04/03/2023

2023-04-17T16:22:40.588-0700 [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: not making a backup, because the new snapshot is identical to the old
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: no state changes since last snapshot
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2023-04-17T16:22:40.594-0700 [ERROR] vertex "module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]" error: Cannot create alerting rule group 'CloudSqlAlerts_test_dev_alert' - unexpected response code '400': unable to decode rule group
2023-04-17T16:22:40.594-0700 [TRACE] vertex "module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]": visit complete, with errors
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "module.alerts_cloudSQL_dev (close)" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "provider["registry.terraform.io/fgouteroux/mimir"] (close)" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "root" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: not making a backup, because the new snapshot is identical to the old
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: no state changes since last snapshot
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
╷
│ Error: Cannot create alerting rule group 'testapp_dev' - unexpected response code '400': unable to decode rule group

shybbko · 2023-04-20T10:46:21Z

I came here to report the same thing. This occurs for alerts with longer expressions and only for fgouteroux/mimir v0.1.3, so downgrade to 0.1.2 will fix it for you temporarily until the issue is fixed for good.

More details:

This will deploy fine for any version of the provider (including 0.1.3):

resource "mimir_rule_group_alerting" "testing1" {
  name      = "testing1"
  rule {
    alert = "testing1"
    expr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
    for  = "10m"
  }
}

while this will deploy fine for any version of the provider except 0.1.3:

resource "mimir_rule_group_alerting" "testing2" {
  name      = "testing2"
  rule {
    alert = "testing2"
    expr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
    for  = "10m"
  }
}

╷
│ Error: Cannot create alerting rule group 'testing2' - unexpected response code '400': unable to decode rule group
│
│
│   with module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing2,
│   on ../module-observability-mimir/Alerts/testing2.tf line 1, in resource "mimir_rule_group_alerting" "testing2":
│    1: resource "mimir_rule_group_alerting" "testing2" {
│
╵

I believe the reason is that for longer expressions the newest version of the provider tries to prettify them by breaking them into multi-line, which Mimir is not happy about. Probably some formatting issue, because in general Mimir should accept multi-line alerts (but I might be wrong here).

Anyway here's the difference in plans. Shorter expressions:

Terraform will perform the following actions:

  # module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing1 will be created
  + resource "mimir_rule_group_alerting" "testing1" {
      + id        = (known after apply)
      + name      = "testing1"
      + namespace = "default"

      + rule {
          + alert = "testing1"
          + expr  = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
          + for   = "10m"
        }
    }

Longer expressions:

  # module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing2 will be created
  + resource "mimir_rule_group_alerting" "testing2" {
      + id        = (known after apply)
      + name      = "testing2"
      + namespace = "default"

      + rule {
          + alert = "testing2"
          + expr  = <<-EOT
                  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                >
                  600
            EOT
          + for   = "10m"
        }
    }

fgouteroux · 2023-04-24T13:45:20Z

@rayuduc yes the version 0.1.3 introduce a bug caused by the prettifying promql expression.
@shybbko the begin of your analyse is good, with the prettifying promql expression it add new lines for longer expression, multi-lines strings is well accepted by mimir. The error comes from the spaces added at the begining of multine expression.
When request is yaml marshalled it add the pipe and number of space to preserve for indentation. expr: |4-

Example of http request from debug mode:

2023-04-24T10:46:11.191+0200 [INFO]  provider.terraform-provider-mimir_v0.1.3: 2023/04/24 10:46:11 REQUEST:
POST /prometheus/config/v1/rules/test HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Go-http-client/1.1
Content-Length: 355
Content-Type: application/yaml
X-Scope-Orgid: test
Accept-Encoding: gzip

name: test3
rules:
    - alert: TestFGX3
      expr: |4-
          up == 0
        unless
          my_very_very_long_useless_metric_that_mean_nothing_but_necessary_to_check_that_test > 600
      for: 1m
      labels:
        priority: SEV-5
        severity: info
      annotations:
        description: testfgx 3
        summary: Summary for TestFGX3: timestamp=2023-04-24T10:46:11.191+0200
2023-04-24T10:46:11.192+0200 [INFO]  provider.terraform-provider-mimir_v0.1.3: 2023/04/24 10:46:11 RESPONSE:
HTTP/1.1 400 Bad Request
Content-Length: 28
Content-Type: text/plain; charset=utf-8
Date: Mon, 24 Apr 2023 08:46:11 GMT
Vary: Accept-Encoding
X-Content-Type-Options: nosniff

unable to decode rule group: timestamp=2023-04-24T10:46:11.192+0200
2023-04-24T10:46:11.194+0200 [ERROR] vertex "module.mimir_alerting_rule.mimir_rule_group_alerting.alert" error: Cannot update alerting rule group 'test3' - unexpected response code '400': unable to decode rule group

That I don't really understand is why there is 4 spaces instead of 2 by default. (maybe related to go-yaml/yaml#864)). I made some tests with a custom encoder to force setting indentation to 2, but the code become more complex and could bring other unexpected issues. So I decided to just remove the spaces at the begining of expression.

Release v0.1.4 should fix this issue.

shybbko · 2023-05-04T12:03:13Z

Thanks @fgouteroux !

fgouteroux mentioned this issue Apr 24, 2023

Root resource was present, but now absent. #10

Closed

fgouteroux closed this as completed Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error creating alert rule group #11

Error creating alert rule group #11

rayuduc commented Apr 17, 2023 •

edited

Loading

shybbko commented Apr 20, 2023

fgouteroux commented Apr 24, 2023 •

edited

Loading

shybbko commented May 4, 2023

Error creating alert rule group #11

Error creating alert rule group #11

Comments

rayuduc commented Apr 17, 2023 • edited Loading

shybbko commented Apr 20, 2023

fgouteroux commented Apr 24, 2023 • edited Loading

shybbko commented May 4, 2023

rayuduc commented Apr 17, 2023 •

edited

Loading

fgouteroux commented Apr 24, 2023 •

edited

Loading