Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating alert rule group #11

Closed
rayuduc opened this issue Apr 17, 2023 · 3 comments
Closed

Error creating alert rule group #11

rayuduc opened this issue Apr 17, 2023 · 3 comments

Comments

@rayuduc
Copy link

rayuduc commented Apr 17, 2023

Terraform 1.2.3 and same issue with 1.4.5
on darwin_arm64

  • provider registry.terraform.io/fgouteroux/mimir v0.1.3

terraform module is working as expected until 04/03/2023

2023-04-17T16:22:40.588-0700 [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: not making a backup, because the new snapshot is identical to the old
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: no state changes since last snapshot
2023-04-17T16:22:40.588-0700 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2023-04-17T16:22:40.594-0700 [ERROR] vertex "module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]" error: Cannot create alerting rule group 'CloudSqlAlerts_test_dev_alert' - unexpected response code '400': unable to decode rule group
2023-04-17T16:22:40.594-0700 [TRACE] vertex "module.alerts_cloudSQL_dev.mimir_rule_group_alerting.CloudSQL_alerts["test_dev_alert"]": visit complete, with errors
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "module.alerts_cloudSQL_dev (close)" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "provider["registry.terraform.io/fgouteroux/mimir"] (close)" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] dag/walk: upstream of "root" errored, so skipping
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: not making a backup, because the new snapshot is identical to the old
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: no state changes since last snapshot
2023-04-17T16:22:40.594-0700 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

│ Error: Cannot create alerting rule group 'testapp_dev' - unexpected response code '400': unable to decode rule group

@shybbko
Copy link

shybbko commented Apr 20, 2023

I came here to report the same thing. This occurs for alerts with longer expressions and only for fgouteroux/mimir v0.1.3, so downgrade to 0.1.2 will fix it for you temporarily until the issue is fixed for good.

More details:

This will deploy fine for any version of the provider (including 0.1.3):

resource "mimir_rule_group_alerting" "testing1" {
  name      = "testing1"
  rule {
    alert = "testing1"
    expr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
    for  = "10m"
  }
}

while this will deploy fine for any version of the provider except 0.1.3:

resource "mimir_rule_group_alerting" "testing2" {
  name      = "testing2"
  rule {
    alert = "testing2"
    expr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
    for  = "10m"
  }
}
╷
│ Error: Cannot create alerting rule group 'testing2' - unexpected response code '400': unable to decode rule group
│
│
│   with module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing2,
│   on ../module-observability-mimir/Alerts/testing2.tf line 1, in resource "mimir_rule_group_alerting" "testing2":
│    1: resource "mimir_rule_group_alerting" "testing2" {
│
╵

I believe the reason is that for longer expressions the newest version of the provider tries to prettify them by breaking them into multi-line, which Mimir is not happy about. Probably some formatting issue, because in general Mimir should accept multi-line alerts (but I might be wrong here).

Anyway here's the difference in plans. Shorter expressions:

Terraform will perform the following actions:

  # module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing1 will be created
  + resource "mimir_rule_group_alerting" "testing1" {
      + id        = (known after apply)
      + name      = "testing1"
      + namespace = "default"

      + rule {
          + alert = "testing1"
          + expr  = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx > 600"
          + for   = "10m"
        }
    }

Longer expressions:

  # module.observability-mimir.module.alerts.mimir_rule_group_alerting.testing2 will be created
  + resource "mimir_rule_group_alerting" "testing2" {
      + id        = (known after apply)
      + name      = "testing2"
      + namespace = "default"

      + rule {
          + alert = "testing2"
          + expr  = <<-EOT
                  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                >
                  600
            EOT
          + for   = "10m"
        }
    }

@fgouteroux
Copy link
Owner

fgouteroux commented Apr 24, 2023

@rayuduc yes the version 0.1.3 introduce a bug caused by the prettifying promql expression.
@shybbko the begin of your analyse is good, with the prettifying promql expression it add new lines for longer expression, multi-lines strings is well accepted by mimir. The error comes from the spaces added at the begining of multine expression.
When request is yaml marshalled it add the pipe and number of space to preserve for indentation. expr: |4-

Example of http request from debug mode:

2023-04-24T10:46:11.191+0200 [INFO]  provider.terraform-provider-mimir_v0.1.3: 2023/04/24 10:46:11 REQUEST:
POST /prometheus/config/v1/rules/test HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Go-http-client/1.1
Content-Length: 355
Content-Type: application/yaml
X-Scope-Orgid: test
Accept-Encoding: gzip

name: test3
rules:
    - alert: TestFGX3
      expr: |4-
          up == 0
        unless
          my_very_very_long_useless_metric_that_mean_nothing_but_necessary_to_check_that_test > 600
      for: 1m
      labels:
        priority: SEV-5
        severity: info
      annotations:
        description: testfgx 3
        summary: Summary for TestFGX3: timestamp=2023-04-24T10:46:11.191+0200
2023-04-24T10:46:11.192+0200 [INFO]  provider.terraform-provider-mimir_v0.1.3: 2023/04/24 10:46:11 RESPONSE:
HTTP/1.1 400 Bad Request
Content-Length: 28
Content-Type: text/plain; charset=utf-8
Date: Mon, 24 Apr 2023 08:46:11 GMT
Vary: Accept-Encoding
X-Content-Type-Options: nosniff

unable to decode rule group: timestamp=2023-04-24T10:46:11.192+0200
2023-04-24T10:46:11.194+0200 [ERROR] vertex "module.mimir_alerting_rule.mimir_rule_group_alerting.alert" error: Cannot update alerting rule group 'test3' - unexpected response code '400': unable to decode rule group

That I don't really understand is why there is 4 spaces instead of 2 by default. (maybe related to go-yaml/yaml#864)). I made some tests with a custom encoder to force setting indentation to 2, but the code become more complex and could bring other unexpected issues. So I decided to just remove the spaces at the begining of expression.

Release v0.1.4 should fix this issue.

@shybbko
Copy link

shybbko commented May 4, 2023

Thanks @fgouteroux !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants