Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add driver.docker counter metric for OOM Killer events #4185

Merged

Conversation

jesusvazquez
Copy link
Contributor

Hi!

This PR is a proposal to add a new counter metric for OOM Killer events.

The current behaviour is that when a container is killed by the OOM Killer nomad logs the event. The changes I made emit a counter metric after that with some labels that I'd consider useful for later analysis.

  • Image
  • ImageID
  • ContainerID

I'm starting to learn golang so if I got something wrong give me some guidances.

Regards

@@ -17,12 +17,13 @@ import (
"time"

"github.com/armon/circbuf"
docker "github.com/fsouza/go-dockerclient"
"github.com/fsouza/go-dockerclient"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goimports suggested this change. Please let me know if you want me to rollback this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its because the imported package in that folder is already called docker :)

@@ -1924,6 +1925,21 @@ func (h *DockerHandle) run() {
h.logger.Printf("[ERR] driver.docker: failed to inspect container %s: %v", h.containerID, ierr)
} else if container.State.OOMKilled {
werr = fmt.Errorf("OOM Killed")
labels := []metrics.Label{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me the interesting labels would be job, group and task :)

Copy link
Contributor Author

@jesusvazquez jesusvazquez Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for me too, but i don't know how to access them from here. I'm happy to implement a solution but I'd need some help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jesusvazquez you could add those fields to the DockerHandle struct, and then populate them from the DriverContext (I'm preparing a PR so you can have those available).

h := &DockerHandle{
client: client,
waitClient: waitClient,
executor: exec,
pluginClient: pluginClient,
logger: d.logger,
Image: d.driverConfig.ImageName,
ImageID: d.imageID,
containerID: container.ID,
version: d.config.Version.VersionNumber(),
killTimeout: GetKillTimeout(task.KillTimeout, maxKill),
maxKillTimeout: maxKill,
doneCh: make(chan bool),
waitCh: make(chan *dstructs.WaitResult, 1),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #4196

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really nice @jvrplmlmn Ill update this PR as soon as I get back from my vacations.

jvrplmlmn pushed a commit to jvrplmlmn/nomad that referenced this pull request Apr 22, 2018
Adding this fields to the DriverContext object, will allow us to pass
them to the drivers.

An use case for this, will be to emit tagged metrics in the drivers,
which contain all relevant information:
- Job
- TaskGroup
- Task
- ...

Ref: hashicorp#4185
Value: h.ImageID,
},
{
Name: "ContainerID",
Copy link
Contributor

@dadgar dadgar Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a useful metric? Seems like the combination of image ID + job information captures all you would need and be a little less noisy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes now that #4196 introduces job information Ill update the labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadgar Please review the updated labels e7bd558

@jesusvazquez jesusvazquez force-pushed the add-counter-metric-for-oom-killer-events branch from 6f35d36 to 54e1788 Compare May 4, 2018 05:46
jobName: d.DriverContext.jobName,
taskGroupName: d.DriverContext.taskGroupName,
taskName: d.DriverContext.taskName,
allocID: d.DriverContext.allocID,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes on this:

  • AllocID: I've added the allocation id because then we have the full context from the job to the allocation. This is useful to provide user oriented alerts.

  • We're initializing DockerHandle with 4 out of 8 fields of the DriverContext struct. Then I question myself: Would it make sense to pass the whole struct to DockerHandle?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value: h.taskGroupName,
},
{
Name: "TaskName",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jippi this is what you meant right?

Copy link
Contributor

@jippi jippi May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - looks great!

I personally don't care for the allocation id below, its a very high cardinality field, making influxdb & friends unhappy fairly fast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that from the database and query point of view it probably has no value but from the alerting point of view the user can quickly run nomad alloc-status <alloc-id> and see what is happening.

What would you say here? Should we keep it or drop it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my personal preference would be to drop it - sounds like an alert in your logging rather than telemetry if you ask me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop it. As @jippi points out it will make many telemetry systems unhappy

Value: h.taskGroupName,
},
{
Name: "TaskName",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop it. As @jippi points out it will make many telemetry systems unhappy

@@ -1924,6 +1933,25 @@ func (h *DockerHandle) run() {
h.logger.Printf("[ERR] driver.docker: failed to inspect container %s: %v", h.containerID, ierr)
} else if container.State.OOMKilled {
werr = fmt.Errorf("OOM Killed")
labels := []metrics.Label{
{
Name: "JobName",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobName: d.DriverContext.jobName,
taskGroupName: d.DriverContext.taskGroupName,
taskName: d.DriverContext.taskName,
allocID: d.DriverContext.allocID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadgar
Copy link
Contributor

dadgar commented May 21, 2018

Ping on comments

@jesusvazquez
Copy link
Contributor Author

@dadgar

Need to add to ID and Open method: https://github.com/hashicorp/nomad/blob/master/client/driver/docker.go#L1769-L1798

I checked the ID method but I see that the dockerPID struct does not have job, task and task_group values, should I create them? Or maybe I got it wrong, could you give me more detail?

@jesusvazquez
Copy link
Contributor Author

Ping @dadgar ⬆️

@dadgar dadgar merged commit b7a7caf into hashicorp:master Jun 4, 2018
@github-actions
Copy link

github-actions bot commented Mar 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants