Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Docker Swarm Autoscale support #1497

Merged
merged 1 commit into from
Aug 4, 2017
Merged

Add Docker Swarm Autoscale support #1497

merged 1 commit into from
Aug 4, 2017

Conversation

nathanielc
Copy link
Contributor

@nathanielc nathanielc commented Jul 25, 2017

This PR builds off #1292 and #1425. Since significant change have been made to the way nodes are written the cahnges meritted a new PR.

@adityacs Can you take a look at the PR and possibly test it out in your environment? The PR is basically the same as what you already had but updated to reflect some major changes being made in Kapacitor.

I have tested this with my local Docker swarm cluster.

@adityacs
Copy link
Contributor

adityacs commented Jul 26, 2017

@nathanielc getting error

kapacitor define containerscale_alert_stream -type stream -tick /opt/kap/container_cpuswarm_alert.tick -dbrp telegraf.autogen
cannot use the swarmAutoscale node, could not create swarm client: unknown swarm cluster "", cannot get client

I see in the code that it's default taking tls enabled docker cluster. Is it possible to get client for a docker cluster without tls?

@nathanielc
Copy link
Contributor Author

nathanielc commented Jul 26, 2017

@adityacs Its complaining that it can't find the cluster with the ID of the empty string "". Either set the id = "" in the config or specify the cluster to use in the TICKscript like |swarmAutoscale().cluster('id of cluster').

TLS is controlled by the protocol specified in the list of servers whether its http or https.

@adityacs
Copy link
Contributor

adityacs commented Jul 26, 2017

@nathanielc getting below error

[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:54 D! setting replicas to 4 was 1 for "traefik"
[log] 2017/07/26 20:34:54 D! {"Name":"traefik","TaskTemplate":{"ContainerSpec":{"Image":"traefik:latest@sha256:e138501b457d2f5f5a9b22e11a2c558939308867b67310a127665e4aa4de09e0","Args":["--docker","--docker.swarmmode","--docker.domain=traefik","--docker.watch","--web","--loglevel=INFO"],"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Target":"/var/run/docker.sock"}],"DNSConfig":{}},"Resources":{"Limits":{},"Reservations":{}},"Placement":{"Constraints":["node.role==manager"]},"Networks":[{"Target":"lzrhwdc5z14dm8vwp2kknez3z"}],"ForceUpdate":0},"Mode":{"Replicated":{"Replicas":4}},"EndpointSpec":{"Mode":"vip","Ports":[{"Protocol":"tcp","TargetPort":80,"PublishedPort":80,"PublishMode":"ingress"},{"Protocol":"tcp","TargetPort":8080,"PublishedPort":8080,"PublishMode":"ingress"}]}}

[httpd] ::1 - - [26/Jul/2017:20:34:54 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e1d4c76d-7241-11e7-8830-000000000000 3996
[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:54 E! failed to apply scaling event: failed to set new replica count for "traefik": failed to understand swarm server error response: Code: 200
[httpd] ::1 - - [26/Jul/2017:20:34:55 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2782e8c-7241-11e7-8831-000000000000 1252
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2b81fbc-7241-11e7-8832-000000000000 751
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2e70396-7241-11e7-8833-000000000000 359
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2ff1227-7241-11e7-8834-000000000000 2877
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e30312b2-7241-11e7-8835-000000000000 2599
[containerscale_alert_stream:log8] 2017/07/26 20:34:56 I!  {"name":"docker_container_cpu","group":"com.docker.swarm.service.name=traefik","dimensions":{"ByName":false,"TagNames":["com.docker.swarm.service.name"]},"fields":{"stat":0.069840101010101},"tags":{"com.docker.swarm.service.name":"traefik"},"time":"2017-07-26T20:37:20Z"}

[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:56 D! setting replicas to 4 was 1 for "traefik"
[log] 2017/07/26 20:34:56 D! {"Name":"traefik","TaskTemplate":{"ContainerSpec":{"Image":"traefik:latest@sha256:e138501b457d2f5f5a9b22e11a2c558939308867b67310a127665e4aa4de09e0","Args":["--docker","--docker.swarmmode","--docker.domain=traefik","--docker.watch","--web","--loglevel=INFO"],"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Target":"/var/run/docker.sock"}],"DNSConfig":{}},"Resources":{"Limits":{},"Reservations":{}},"Placement":{"Constraints":["node.role==manager"]},"Networks":[{"Target":"lzrhwdc5z14dm8vwp2kknez3z"}],"ForceUpdate":0},"Mode":{"Replicated":{"Replicas":4}},"EndpointSpec":{"Mode":"vip","Ports":[{"Protocol":"tcp","TargetPort":80,"PublishedPort":80,"PublishMode":"ingress"},{"Protocol":"tcp","TargetPort":8080,"PublishedPort":8080,"PublishMode":"ingress"}]}}

It's failing to read docker response and it keeps on retrying to apply the replica value. However, scaling is working in Docker cluster. If I scale down manually on docker cluster, Kapacitor again retries immediately to scale. I believe it should not try to apply the scale event if it has already scaled within last 5mins or so.

@nathanielc
Copy link
Contributor Author

@adityacs What version of docker are you running? It seems to be returning a different http response code than expected. The code is written to consume the v1.30 API version.

I believe it should not try to apply the scale event if it has already scaled within last 5mins or so.

This is possible via the .increaseCooldown() and .decreaseCooldown() properties. See https://docs.influxdata.com/kapacitor/v1.3/nodes/k8s_autoscale_node/#decreasecooldown

@adityacs
Copy link
Contributor

adityacs commented Jul 28, 2017

@nathanielc I am using 17.06 version. The API version is also v1.30.

Since, the response is not correct, irrespective of .increaseCooldown() and .decreaseCooldown() settings it is still trying to apply the event repeatedly for every .period()

@nathanielc
Copy link
Contributor Author

@adityacs I have pushed up an update that should fix the error.

@adityacs
Copy link
Contributor

@nathanielc The error is fixed now. However, I am not seeing expected behavior for scaling.

  1. Kapacitor scales the containers to number of replicas say 4
  2. Then I manually scale the containers to 1.
  3. I am expecting containers to be scaled again as "target" is reached.

Kapacitor should try to scale the containers right? What is the expected behavior here without increase/decrease cooldown?

@nathanielc
Copy link
Contributor Author

@adityacs Hmm, what is the reason for manually scaling the service? For k8s Kapacitor assumes that is has full control over the scaling factor and so it caches it instead of making an API call to k8s for every evaluation of the node. As a result if the autoscale node doesn't make any changes to the scale factor then no new API calls will be made to the system to set a new scaling value. Docker swarm has inherited that assumption.

Also in a live system the autoscaling is probably set up as a feedback loop. If you manually change the scale factor that will likely cause the feedback loop to adjust accordingly.

We could add some kind of timeout for the cached values and periodically check, but that seems like it would create a confusing user experience. Thoughts?

@adityacs
Copy link
Contributor

@nathanielc Makes sense. It's good we give full control to Kapacitor. However, if we include cache timeout, we can make it as node property where user can set .cacheTimeout(). What say?

@adityacs
Copy link
Contributor

@nathanielc It's working as expected. I tried even with .increaseCooldown() and .decreaseCooldown() .
It's working as expected.

Thanks :)

@nathanielc
Copy link
Contributor Author

However, if we include cache timeout, we can make it as node property where user can set .cacheTimeout(). What say?

Sounds reasonable to me. Let's do it in a separate PR. Do you want to take a crack at it?

I'll clean up this PR for the rebase etc and get it merged.

@adityacs
Copy link
Contributor

adityacs commented Aug 1, 2017

Sounds reasonable to me. Let's do it in a separate PR. Do you want to take a crack at it?

Need some time. Will work on this.

Thanks for this PR.

@adityacs
Copy link
Contributor

adityacs commented Aug 1, 2017

@nathanielc One more thing. Is there any plan for supporting ec2 instances scaling or gce instances scaling with auto scaling? or do you think it's beyond the scope of Kapacitor?

@nathanielc
Copy link
Contributor Author

@adityacs There are no current plans but if someone wanted to add them its much easier now as the autoscale node is generalized.

@nathanielc nathanielc requested a review from desa August 1, 2017 18:07
@nathanielc
Copy link
Contributor Author

@desa Could you give this a quick review?

@adityacs
Copy link
Contributor

adityacs commented Aug 1, 2017

There are no current plans but if someone wanted to add them its much easier now as the autoscale node is generalized.

@nathanielc Cool. Would like to work on this. Will let know my progress.

@desa
Copy link
Contributor

desa commented Aug 2, 2017

@nathanielc I'm wondering if there should just be a generic autoscale node that acts like the alert node does.

This would be something like this

|autoscale()
    .k8s()
      .resourceNameTag('deployment')

or

|autoscale()
    .swarm()
      . serviceNameTag('deployment')

The way that it is currently implemented is a bit different than what I have seen else where in the Kapacitor code base (this could just be still not being 100% up to speed with how all of the nodes are implemented).

It'd also open up the ability to add more autoscaling "handlers" in a pattern similar to the alert handlers.

I think it would also be possible to make this change while still keeping backwards compatibility with the k8sAutoscale by implementing it as a macro like deadman.

Thoughts?

@nathanielc
Copy link
Contributor Author

@desa I dislike the way the alert handlers work, because everything in TICKscript is at most 2 deep, node -> property, except for the alert handlers which are 3 deep node-> handler -> property. I dislike that handlers are different from everything else and don't want to add more like them

Also the difference here is that it makes sense to have multiple handlers for a single alert, and it doesn't make as much sense to have multiple autoscale handlers since its likely you are only using one.

So, in short I want it to remain the way it is.

@desa
Copy link
Contributor

desa commented Aug 2, 2017

Makes sense.

Copy link
Contributor

@desa desa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions.

New int
}

type AutoscaleNode struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why ResourceID, Autoscaler, AutoscaleNode are exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, other than that's how other nodes work.

autoscale.go Outdated
currentField string
}

// Create a new AutoscaleNode which can trigger autoscale event for a Kubernetes cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updating not just Kubernetes any more :)

ID string `toml:"id" override:"id"`
Servers []string `toml:"servers" override:"servers"`
// Path to CA file
SSLCA string `toml:"ssl-ca" override:"ssl-ca"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these okay to expose via the API? I see that the other services do this as well, but it feels like there might be some security issue here? Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just paths to files on disk. That should not be sensitive information.

}

func (s *Cluster) Client() (client.Client, error) {
config := s.configValue.Load().(Config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit
I like the pattern of having a method that does the casting for you e.g.

func (s *Service) config() Config {
	return s.configValue.Load().(Config)
}

func NewCluster(c Config, l *log.Logger) (*Cluster, error) {
clientConfig, err := c.ClientConfig()
if err != nil {
return nil, errors.Wrap(err, "failed to create k8s client config")
Copy link
Contributor

@adityacs adityacs Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be "failed to create swarm client config"
?

}
cli, err := client.New(clientConfig)
if err != nil {
return nil, errors.Wrap(err, "failed to create k8s client")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be "failed to create swarm client config"
?

@nathanielc
Copy link
Contributor Author

Thanks @desa and @adityacs for the review. I have addressed the comments.

@nathanielc nathanielc merged commit 5dd2183 into master Aug 4, 2017
nathanielc added a commit that referenced this pull request Aug 4, 2017
Add Docker Swarm Autoscale support
@nathanielc nathanielc deleted the nc-autoscale branch August 4, 2017 15:25
@cfavacho
Copy link

@nathanielc

i use version kapacitor 1.51, and swarmAutoscale() not working, either enable or disable swarm service in configuration file, i got always message unknown swarm cluster ""

my configuration file:

[[swarm]]
enabled = true
id = "CL"
servers = ["http://10.0.2.244:2375"]
ssl-ca = ""
ssl-cert = ""
ssl-key = ""
insecure-skip-verify = true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants