Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support partial Process Stats responses when the stats service (log-cache or metric-proxy) is unavailable #2250

Merged
merged 4 commits into from
Apr 30, 2021

Conversation

tcdowney
Copy link
Member

@tcdowney tcdowney commented Apr 28, 2021

This change rescues errors from log-cache and metric-proxy and passes them up as warnings to callers instead of exceptions so that the stats endpoint can continue to function on a best-effort basis. On /v3/processes/:guid/stats these warnings are included in a new "warnings" key on the response, on v2 they are ignored. Alternatively/additionally we could add these to the 'X-Cf-Warnings' header as @sethboyles mentioned on the issue.

The related issue has more information: #2227

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

@tcdowney tcdowney changed the title support partial Process Stats responses Support partial Process Stats responses when the stats service (log-cache or metric-proxy) is unavailable Apr 28, 2021
@sethboyles
Copy link
Member

so far looks good to me. Is the non-idiomatic part of this you mentioned the multiple return values? If so I don't have any qualms with it.

@tcdowney
Copy link
Member Author

so far looks good to me. Is the non-idiomatic part of this you mentioned the multiple return values? If so I don't have any qualms with it.

Yeah, that was mostly it. I don't think we've done much of it in the codebase so wanted opinions. I think it's better than trying to juggle the exceptions all the way up to the controller in this case, though.

@tcdowney
Copy link
Member Author

@sethboyles do you think we should go with the warning header instead of the "warnings" field in the response? I could see that being a better UX out of the box since the CLI already supports it.

@sethboyles
Copy link
Member

sethboyles commented Apr 28, 2021

I think for now that might be the better option. I am uncertain about the "warnings" property as it has only been used for jobs in the past, and I am wary of introducing it to other parts of the API without giving it enough thought or discussion.
Clients probably don't have existing ways of consuming non-job related warnings, and I'm unsure its a pattern we want to adopt generally as we probably don't want to allow partial successes/failures when we can avoid it.

We are going to discuss warnings in general as a team tomorrow, but I don't want to block this PR--how about we go with the header warning for now, and if we decide later on to use the JSON field, we can put an issue for it in our backlog?

@tcdowney
Copy link
Member Author

Sounds good. I'll make that change. 👍

@tcdowney
Copy link
Member Author

tcdowney commented Apr 29, 2021

Validating Steps

Here's how I tested this out manually.

CF for VMs

  1. cf push dora and observe the metrics work
  2. cf restart dora and observe the metrics work
  3. bosh ssh doppler
  4. sudo su -
  5. monit stop log-cache
  6. cf restart dora and observe that the workflow succeeds even though metrics are missing. My output looked like this:
$ cf restart dora
Restarting app dora in org org / space space as admin...

Stopping app...

Waiting for app to start...

Instances starting...
Instances starting...
Instances starting...
Instances starting...

Stats server temporarily unavailable.
Stats server temporarily unavailable.
Stats server temporarily unavailable.
Stats server temporarily unavailable.
Stats server temporarily unavailable.
name:              dora
requested state:   started
routes:            dora.diamond-legend.capi.land
last uploaded:     Thu 29 Apr 09:20:20 PDT 2021
stack:             cflinuxfs3
buildpacks:
	name             version   detect output   buildpack name
	ruby_buildpack   1.8.37    ruby            ruby

type:           web
sidecars:
instances:      1/1
memory usage:   256M
     state     since                  cpu    memory   disk     details
#0   running   2021-04-29T16:22:54Z   0.0%   0 of 0   0 of 0

type:           worker
sidecars:
instances:      0/0
memory usage:   256M
There are no running instances of this process.

Directly hitting the endpoint returns:

cf curl /v3/processes/8f79dd94-2f93-427c-b66c-9290b81db79d/stats
{
   "resources": [
      {
         "type": "web",
         "index": 0,
         "state": "RUNNING",
         "host": "10.244.0.138",
         "uptime": 258,
         "mem_quota": null,
         "disk_quota": null,
         "fds_quota": 16384,
         "isolation_segment": null,
         "details": null,
         "instance_ports": [
            {
               "external": 61004,
               "internal": 8080,
               "external_tls_proxy_port": 61006,
               "internal_tls_proxy_port": 61001
            },
            {
               "external": 61005,
               "internal": 2222,
               "external_tls_proxy_port": 61007,
               "internal_tls_proxy_port": 61002
            }
         ],
         "usage": {}
      }
   ]
}

cf curl also prints out the Stats server temporarily unavailable. warning after it finishes.

Even though the API is returning null/empty values, CLI is reporting 0s for the metrics that is the default numeric value in Go. Since we're returning empty values on the API side, the CLI could be enhanced to display something more meaningful if it wanted to, but I think this is fine for now.

CF for Kubernetes

On Kubernetes the CLI/API experience is the same, just the repro steps are a little different.

  1. cf push dora and observe the metrics work (on cf-for-k8s I always see the quotas missing even without this change)
  2. cf restart dora and observe the metrics work (on cf-for-k8s I always see the quotas missing even without this change)
  3. kubectl -n cf-system scale deployment metric-proxy --replicas=0
  4. cf restart dora and observe that the workflow succeeds even though metrics are missing.

- this change rescues errors from log-cache and metric-proxy and passes
them up as warnings to callers instead of exceptions so that the stats
endpoint can continue to function on a best-effort basis
- on /v3/processes/:guid/stats these warnings are included in the
'X-Cf-Warnings' header
- addresses #2227

Authored-by: Tim Downey <[email protected]>
- this temporary config flag was used to ignore server unavailable
errors coming back from log-cache (or metric-proxy in cf-for-k8s)
- given the change proposed in #2227 to rescue
more errors and serve partial stats responses this config is no longer
necessary

Authored-by: Tim Downey <[email protected]>
In #2227 we
updated the Process Stats endpoint to be able to return partial
responses (with a warning) when the stats server is unavailable.

Authored-by: Tim Downey <[email protected]>
@tcdowney tcdowney force-pushed the handle-log-cache-unavailability branch from 87be0c7 to ed75cf6 Compare April 29, 2021 17:16
@tcdowney
Copy link
Member Author

@sethboyles I rebased the original commit to use the 'X-Cf-Warnings' header instead of adding "warnings" to the response. Also added a commit to remove a now unused config flag and some light-weight documentation to point out these fields can now be empty or null (the mem_quota and disk_quota ones could have been null before this change fwiw).

I'm gonna take this PR out of "draft" state now, I think it's ready. 🙂

@tcdowney tcdowney marked this pull request as ready for review April 29, 2021 17:18
@tcdowney tcdowney requested a review from jspawar April 29, 2021 17:55
Copy link
Contributor

@jspawar jspawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I don't quite understand why we ignore the errors in v2, although I'm guessing this is because v2 endpoints should not be modified and they don't support that warnings header?

app/controllers/runtime/stats_controller.rb Outdated Show resolved Hide resolved
**mem_quota** | _integer_ | The current maximum memory allocated for the instance
**disk_quota** | _integer_ | The current maximum disk allocated for the instance
**mem_quota** | _integer_ | The current maximum memory allocated for the instance; the value is `null` when memory quota data is unavailable
**disk_quota** | _integer_ | The current maximum disk allocated for the instance; the value is `null` when disk quota data is unavailable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 thank you for updating the docs!

@sethboyles
Copy link
Member

LGTM!

± sb |handle-log-cache-unavailability → origin {2} ?:1 ✗| → cf curl v3/apps/$(cf app dora --guid)/processes/web/stats
{
   "resources": [
      {
         "type": "web",
         "index": 0,
         "state": "RUNNING",
         "host": "10.244.0.138",
         "uptime": 1525,
         "mem_quota": null,
         "disk_quota": null,
         "fds_quota": 16384,
         "isolation_segment": null,
         "details": null,
         "instance_ports": [
            {
               "external": 61028,
               "internal": 8080,
               "external_tls_proxy_port": 61030,
               "internal_tls_proxy_port": 61001
            },
            {
               "external": 61029,
               "internal": 2222,
               "external_tls_proxy_port": 61031,
               "internal_tls_proxy_port": 61002
            }
         ],
         "usage": {}
      }
   ]
}
Stats server temporarily unavailable.

@tcdowney Let me know if/when you add the warnings to v2, then I'll go ahead and merge this in. Thank you for the PR!

@tcdowney
Copy link
Member Author

19:04 $ cf curl /v2/apps/$(cf app dora --guid)/stats
{
   "0": {
      "state": "RUNNING",
      "isolation_segment": null,
      "stats": {
         "name": "dora",
         "uris": [
            "dora.diamond-legend.capi.land"
         ],
         "host": "10.244.0.138",
         "port": 61004,
         "uptime": 62,
         "fds_quota": 16384,
         "mem_quota": null,
         "disk_quota": null,
         "usage": {}
      }
   }
}
Stats server temporarily unavailable.

Done! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants