Add Docker Swarm Autoscale support #1497

nathanielc · 2017-07-25T22:40:56Z

This PR builds off #1292 and #1425. Since significant change have been made to the way nodes are written the cahnges meritted a new PR.

@adityacs Can you take a look at the PR and possibly test it out in your environment? The PR is basically the same as what you already had but updated to reflect some major changes being made in Kapacitor.

I have tested this with my local Docker swarm cluster.

Rebased/mergable
Tests pass
CHANGELOG.md updated
Add example config entry
Merge Change over to Message Passing edge communication #1425 before merging this PR
Add tests from original PR. Combine tests to be generic where possible

adityacs · 2017-07-26T13:57:39Z

@nathanielc getting error

kapacitor define containerscale_alert_stream -type stream -tick /opt/kap/container_cpuswarm_alert.tick -dbrp telegraf.autogen
cannot use the swarmAutoscale node, could not create swarm client: unknown swarm cluster "", cannot get client

I see in the code that it's default taking tls enabled docker cluster. Is it possible to get client for a docker cluster without tls?

nathanielc · 2017-07-26T17:02:38Z

@adityacs Its complaining that it can't find the cluster with the ID of the empty string "". Either set the id = "" in the config or specify the cluster to use in the TICKscript like |swarmAutoscale().cluster('id of cluster').

TLS is controlled by the protocol specified in the list of servers whether its http or https.

adityacs · 2017-07-26T20:42:48Z

@nathanielc getting below error

[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:54 D! setting replicas to 4 was 1 for "traefik"
[log] 2017/07/26 20:34:54 D! {"Name":"traefik","TaskTemplate":{"ContainerSpec":{"Image":"traefik:latest@sha256:e138501b457d2f5f5a9b22e11a2c558939308867b67310a127665e4aa4de09e0","Args":["--docker","--docker.swarmmode","--docker.domain=traefik","--docker.watch","--web","--loglevel=INFO"],"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Target":"/var/run/docker.sock"}],"DNSConfig":{}},"Resources":{"Limits":{},"Reservations":{}},"Placement":{"Constraints":["node.role==manager"]},"Networks":[{"Target":"lzrhwdc5z14dm8vwp2kknez3z"}],"ForceUpdate":0},"Mode":{"Replicated":{"Replicas":4}},"EndpointSpec":{"Mode":"vip","Ports":[{"Protocol":"tcp","TargetPort":80,"PublishedPort":80,"PublishMode":"ingress"},{"Protocol":"tcp","TargetPort":8080,"PublishedPort":8080,"PublishMode":"ingress"}]}}

[httpd] ::1 - - [26/Jul/2017:20:34:54 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e1d4c76d-7241-11e7-8830-000000000000 3996
[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:54 E! failed to apply scaling event: failed to set new replica count for "traefik": failed to understand swarm server error response: Code: 200
[httpd] ::1 - - [26/Jul/2017:20:34:55 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2782e8c-7241-11e7-8831-000000000000 1252
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2b81fbc-7241-11e7-8832-000000000000 751
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2e70396-7241-11e7-8833-000000000000 359
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e2ff1227-7241-11e7-8834-000000000000 2877
[httpd] ::1 - - [26/Jul/2017:20:34:56 +0000] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" e30312b2-7241-11e7-8835-000000000000 2599
[containerscale_alert_stream:log8] 2017/07/26 20:34:56 I!  {"name":"docker_container_cpu","group":"com.docker.swarm.service.name=traefik","dimensions":{"ByName":false,"TagNames":["com.docker.swarm.service.name"]},"fields":{"stat":0.069840101010101},"tags":{"com.docker.swarm.service.name":"traefik"},"time":"2017-07-26T20:37:20Z"}

[containerscale_alert_stream:swarm_autoscale9] 2017/07/26 20:34:56 D! setting replicas to 4 was 1 for "traefik"
[log] 2017/07/26 20:34:56 D! {"Name":"traefik","TaskTemplate":{"ContainerSpec":{"Image":"traefik:latest@sha256:e138501b457d2f5f5a9b22e11a2c558939308867b67310a127665e4aa4de09e0","Args":["--docker","--docker.swarmmode","--docker.domain=traefik","--docker.watch","--web","--loglevel=INFO"],"Mounts":[{"Type":"bind","Source":"/var/run/docker.sock","Target":"/var/run/docker.sock"}],"DNSConfig":{}},"Resources":{"Limits":{},"Reservations":{}},"Placement":{"Constraints":["node.role==manager"]},"Networks":[{"Target":"lzrhwdc5z14dm8vwp2kknez3z"}],"ForceUpdate":0},"Mode":{"Replicated":{"Replicas":4}},"EndpointSpec":{"Mode":"vip","Ports":[{"Protocol":"tcp","TargetPort":80,"PublishedPort":80,"PublishMode":"ingress"},{"Protocol":"tcp","TargetPort":8080,"PublishedPort":8080,"PublishMode":"ingress"}]}}

It's failing to read docker response and it keeps on retrying to apply the replica value. However, scaling is working in Docker cluster. If I scale down manually on docker cluster, Kapacitor again retries immediately to scale. I believe it should not try to apply the scale event if it has already scaled within last 5mins or so.

nathanielc · 2017-07-27T23:14:19Z

@adityacs What version of docker are you running? It seems to be returning a different http response code than expected. The code is written to consume the v1.30 API version.

I believe it should not try to apply the scale event if it has already scaled within last 5mins or so.

This is possible via the .increaseCooldown() and .decreaseCooldown() properties. See https://docs.influxdata.com/kapacitor/v1.3/nodes/k8s_autoscale_node/#decreasecooldown

adityacs · 2017-07-28T05:27:19Z

@nathanielc I am using 17.06 version. The API version is also v1.30.

Since, the response is not correct, irrespective of .increaseCooldown() and .decreaseCooldown() settings it is still trying to apply the event repeatedly for every .period()

nathanielc · 2017-07-28T16:37:18Z

@adityacs I have pushed up an update that should fix the error.

adityacs · 2017-07-28T17:06:03Z

@nathanielc The error is fixed now. However, I am not seeing expected behavior for scaling.

Kapacitor scales the containers to number of replicas say 4
Then I manually scale the containers to 1.
I am expecting containers to be scaled again as "target" is reached.

Kapacitor should try to scale the containers right? What is the expected behavior here without increase/decrease cooldown?

nathanielc · 2017-07-28T17:16:38Z

@adityacs Hmm, what is the reason for manually scaling the service? For k8s Kapacitor assumes that is has full control over the scaling factor and so it caches it instead of making an API call to k8s for every evaluation of the node. As a result if the autoscale node doesn't make any changes to the scale factor then no new API calls will be made to the system to set a new scaling value. Docker swarm has inherited that assumption.

Also in a live system the autoscaling is probably set up as a feedback loop. If you manually change the scale factor that will likely cause the feedback loop to adjust accordingly.

We could add some kind of timeout for the cached values and periodically check, but that seems like it would create a confusing user experience. Thoughts?

adityacs · 2017-07-28T18:57:01Z

@nathanielc Makes sense. It's good we give full control to Kapacitor. However, if we include cache timeout, we can make it as node property where user can set .cacheTimeout(). What say?

adityacs · 2017-07-28T19:26:55Z

@nathanielc It's working as expected. I tried even with .increaseCooldown() and .decreaseCooldown() .
It's working as expected.

Thanks :)

nathanielc · 2017-07-31T21:51:04Z

However, if we include cache timeout, we can make it as node property where user can set .cacheTimeout(). What say?

Sounds reasonable to me. Let's do it in a separate PR. Do you want to take a crack at it?

I'll clean up this PR for the rebase etc and get it merged.

adityacs · 2017-08-01T07:55:25Z

Sounds reasonable to me. Let's do it in a separate PR. Do you want to take a crack at it?

Need some time. Will work on this.

Thanks for this PR.

adityacs · 2017-08-01T16:01:56Z

@nathanielc One more thing. Is there any plan for supporting ec2 instances scaling or gce instances scaling with auto scaling? or do you think it's beyond the scope of Kapacitor?

nathanielc · 2017-08-01T18:07:03Z

@adityacs There are no current plans but if someone wanted to add them its much easier now as the autoscale node is generalized.

nathanielc · 2017-08-01T18:07:48Z

@desa Could you give this a quick review?

adityacs · 2017-08-01T18:34:59Z

There are no current plans but if someone wanted to add them its much easier now as the autoscale node is generalized.

@nathanielc Cool. Would like to work on this. Will let know my progress.

desa · 2017-08-02T18:34:07Z

@nathanielc I'm wondering if there should just be a generic autoscale node that acts like the alert node does.

This would be something like this

|autoscale()
    .k8s()
      .resourceNameTag('deployment')

or

|autoscale()
    .swarm()
      . serviceNameTag('deployment')

The way that it is currently implemented is a bit different than what I have seen else where in the Kapacitor code base (this could just be still not being 100% up to speed with how all of the nodes are implemented).

It'd also open up the ability to add more autoscaling "handlers" in a pattern similar to the alert handlers.

I think it would also be possible to make this change while still keeping backwards compatibility with the k8sAutoscale by implementing it as a macro like deadman.

Thoughts?

nathanielc · 2017-08-02T18:40:04Z

@desa I dislike the way the alert handlers work, because everything in TICKscript is at most 2 deep, node -> property, except for the alert handlers which are 3 deep node-> handler -> property. I dislike that handlers are different from everything else and don't want to add more like them

Also the difference here is that it makes sense to have multiple handlers for a single alert, and it doesn't make as much sense to have multiple autoscale handlers since its likely you are only using one.

So, in short I want it to remain the way it is.

desa · 2017-08-02T18:41:21Z

Makes sense.

desa

Just a couple questions.

desa · 2017-08-02T19:14:08Z

autoscale.go

+	New int
+}
+
+type AutoscaleNode struct {


Is there a reason why ResourceID, Autoscaler, AutoscaleNode are exported?

Nope, other than that's how other nodes work.

desa · 2017-08-02T19:17:19Z

autoscale.go

+	currentField string
+}
+
+// Create a new AutoscaleNode which can trigger autoscale event for a Kubernetes cluster.


Comment needs updating not just Kubernetes any more :)

desa · 2017-08-02T20:13:17Z

services/swarm/config.go

+	ID      string   `toml:"id" override:"id"`
+	Servers []string `toml:"servers" override:"servers"`
+	// Path to CA file
+	SSLCA string `toml:"ssl-ca" override:"ssl-ca"`


Are all of these okay to expose via the API? I see that the other services do this as well, but it feels like there might be some security issue here? Am I wrong?

These are just paths to files on disk. That should not be sensitive information.

desa · 2017-08-02T20:21:48Z

services/swarm/cluster.go

+}
+
+func (s *Cluster) Client() (client.Client, error) {
+	config := s.configValue.Load().(Config)


nit
I like the pattern of having a method that does the casting for you e.g.

func (s *Service) config() Config { return s.configValue.Load().(Config) }

adityacs · 2017-08-03T20:28:57Z

services/swarm/cluster.go

+func NewCluster(c Config, l *log.Logger) (*Cluster, error) {
+	clientConfig, err := c.ClientConfig()
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to create k8s client config")


this should be "failed to create swarm client config"
?

adityacs · 2017-08-03T20:29:11Z

services/swarm/cluster.go

+	}
+	cli, err := client.New(clientConfig)
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to create k8s client")


this should be "failed to create swarm client config"
?

nathanielc · 2017-08-03T22:34:09Z

Thanks @desa and @adityacs for the review. I have addressed the comments.

Add Docker Swarm Autoscale support

cfavacho · 2018-08-27T14:00:26Z

@nathanielc

i use version kapacitor 1.51, and swarmAutoscale() not working, either enable or disable swarm service in configuration file, i got always message unknown swarm cluster ""

my configuration file:

[[swarm]]
enabled = true
id = "CL"
servers = ["http://10.0.2.244:2375"]
ssl-ca = ""
ssl-cert = ""
ssl-key = ""
insecure-skip-verify = true

nathanielc added the in progress label Jul 25, 2017

nathanielc force-pushed the nc-autoscale branch from e087ed9 to aa222b7 Compare July 31, 2017 21:51

nathanielc mentioned this pull request Jul 31, 2017

Docker Swarm mode autoscale #1292

Closed

5 tasks

nathanielc requested a review from desa August 1, 2017 18:07

nathanielc force-pushed the nc-autoscale branch from 7edc0a3 to 40170d7 Compare August 1, 2017 19:02

desa reviewed Aug 2, 2017

View reviewed changes

adityacs reviewed Aug 3, 2017

View reviewed changes

nathanielc force-pushed the nc-autoscale branch from 40170d7 to 163c3d4 Compare August 3, 2017 22:44

desa approved these changes Aug 4, 2017

View reviewed changes

Add Docker Swarm support to the autoscale node

5dd2183

nathanielc force-pushed the nc-autoscale branch from 163c3d4 to 5dd2183 Compare August 4, 2017 15:09

nathanielc merged commit 5dd2183 into master Aug 4, 2017

nathanielc added a commit that referenced this pull request Aug 4, 2017

Merged pull request #1497 from influxdata/nc-autoscale

afbe3c5

Add Docker Swarm Autoscale support

nathanielc deleted the nc-autoscale branch August 4, 2017 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Docker Swarm Autoscale support #1497

Add Docker Swarm Autoscale support #1497

nathanielc commented Jul 25, 2017 •

edited

Loading

adityacs commented Jul 26, 2017 •

edited

Loading

nathanielc commented Jul 26, 2017 •

edited

Loading

adityacs commented Jul 26, 2017 •

edited

Loading

nathanielc commented Jul 27, 2017

adityacs commented Jul 28, 2017 •

edited

Loading

nathanielc commented Jul 28, 2017

adityacs commented Jul 28, 2017

nathanielc commented Jul 28, 2017

adityacs commented Jul 28, 2017

adityacs commented Jul 28, 2017

nathanielc commented Jul 31, 2017

adityacs commented Aug 1, 2017

adityacs commented Aug 1, 2017

nathanielc commented Aug 1, 2017

nathanielc commented Aug 1, 2017

adityacs commented Aug 1, 2017

desa commented Aug 2, 2017 •

edited

Loading

nathanielc commented Aug 2, 2017

desa commented Aug 2, 2017

desa left a comment

desa Aug 2, 2017

nathanielc Aug 3, 2017

desa Aug 2, 2017

desa Aug 2, 2017

nathanielc Aug 3, 2017

desa Aug 2, 2017

adityacs Aug 3, 2017 •

edited

Loading

adityacs Aug 3, 2017

nathanielc commented Aug 3, 2017

cfavacho commented Aug 27, 2018

Add Docker Swarm Autoscale support #1497

Add Docker Swarm Autoscale support #1497

Conversation

nathanielc commented Jul 25, 2017 • edited Loading

adityacs commented Jul 26, 2017 • edited Loading

nathanielc commented Jul 26, 2017 • edited Loading

adityacs commented Jul 26, 2017 • edited Loading

nathanielc commented Jul 27, 2017

adityacs commented Jul 28, 2017 • edited Loading

nathanielc commented Jul 28, 2017

adityacs commented Jul 28, 2017

nathanielc commented Jul 28, 2017

adityacs commented Jul 28, 2017

adityacs commented Jul 28, 2017

nathanielc commented Jul 31, 2017

adityacs commented Aug 1, 2017

adityacs commented Aug 1, 2017

nathanielc commented Aug 1, 2017

nathanielc commented Aug 1, 2017

adityacs commented Aug 1, 2017

desa commented Aug 2, 2017 • edited Loading

nathanielc commented Aug 2, 2017

desa commented Aug 2, 2017

desa left a comment

Choose a reason for hiding this comment

desa Aug 2, 2017

Choose a reason for hiding this comment

nathanielc Aug 3, 2017

Choose a reason for hiding this comment

desa Aug 2, 2017

Choose a reason for hiding this comment

desa Aug 2, 2017

Choose a reason for hiding this comment

nathanielc Aug 3, 2017

Choose a reason for hiding this comment

desa Aug 2, 2017

Choose a reason for hiding this comment

adityacs Aug 3, 2017 • edited Loading

Choose a reason for hiding this comment

adityacs Aug 3, 2017

Choose a reason for hiding this comment

nathanielc commented Aug 3, 2017

cfavacho commented Aug 27, 2018

nathanielc commented Jul 25, 2017 •

edited

Loading

adityacs commented Jul 26, 2017 •

edited

Loading

nathanielc commented Jul 26, 2017 •

edited

Loading

adityacs commented Jul 26, 2017 •

edited

Loading

adityacs commented Jul 28, 2017 •

edited

Loading

desa commented Aug 2, 2017 •

edited

Loading

adityacs Aug 3, 2017 •

edited

Loading