Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

sync-loop context deadline exceeded while pushing into git a lot of changes #1857

Closed
dananichev opened this issue Mar 25, 2019 · 36 comments
Closed
Labels

Comments

@dananichev
Copy link

Once in a while we are forced to recreate all manifests inside git repository (limits tweaks, new pods and containers and so on). This leads to performing automated release by flux based on tag annotation for every manifest inside git repository (i believe we have ~1000 manifests or so on; maybe less). Which in turn leads to GIT sync. Which fails.

git commit: running git command: git [commit --no-verify -a -m Auto-release multiple images\n\n - gitlab.***.ru:4567/***:latest-4985eb31bb68b72c377f44dd3905c4213eba2f01\n - gitlab.***.ru:4567/***:latest-e92483f1319148a21624249f472d8b885fbedbe4\n - gitlab.***.ru:4567/***:latest-10a671033011403ce0ebad6701ccb96d9a5a0fb6\n - gitlab.***.ru:4567/***:latest-dee8e41f3f2a1413954050356cfc3c68112715aa\n - gitlab.***.ru:4567/***:latest-ed6b6773ae051e9808ddd837b80a3e4177973cec\n - gitlab.***.ru:4567/***:latest-eb816707b76f8589dc4b667331e1c655f4966696\n - gitlab.***.ru:4567/***:latest-34eb94841504daa85785910e60c1e2e7cdae439b\n]: context deadline exceeded

So far the only workaround i found:

  • mark every deployment as ignored/not automated
  • wait for flux to sync this
  • enable for some deployments automation
  • wait for sync and release
  • repeat previous 2 steps until there are no ignored/not automated deployments left

This workaround takes a lot of time to "fix" this issue.

Is there anything i can do to permamently fix it?

Thanks,
Dmitry

@dananichev dananichev changed the title sync-loop context deadline exceeded while pushing into git sync-loop context deadline exceeded while pushing into git a lot of changes Mar 25, 2019
@squaremo
Copy link
Member

So it's a git invocation that's timing out? You can try setting --git-timeout to a value which is long enough to admit the problematic operation. That's git.timeout for the chart. I think you'll need fluxd 1.11.0 for it to affect all operations.

@dananichev
Copy link
Author

@squaremo i tried setting it up to 2-4 minutes with 1.10 and it wasn't working (i mean i still got context deadline exceeded). A week or so ago i updated flux to 1.11 but haven't tried --git-timeout again. I'll give it a try.

@hiddeco
Copy link
Member

hiddeco commented Mar 25, 2019

I think this may not work as to my knowledge all releases are made through jobs and the job provides its own context with 60 second timeout.

https://github.com/weaveworks/flux/blob/267e0115d1a2a74956c94d37e844ffa531ce9f9c/daemon/daemon.go#L255

https://github.com/weaveworks/flux/blob/267e0115d1a2a74956c94d37e844ffa531ce9f9c/daemon/daemon.go#L450

@dananichev
Copy link
Author

dananichev commented Mar 25, 2019

@squaremo @hiddeco yep it didn't work (at least 5 minutes timeout didn't help -- still got context deadline exceeded). Can i ask why is it happening? I mean, i can change all the files locally and commit & push them without any trouble (using 1 commit) but Flux, for some reason, can't?

@squaremo
Copy link
Member

It's a bit strange isn't it. I can speculate on some explanations:

  • fluxd is running with way fewer resources than you have locally -- maybe disk is really slow, for example
  • in the fluxd code, there's a timeout for a git operation that accidentally also includes some other I/O bound operation
  • or there's an O(n) batch of git operations, all of which individually don't exceed the timeout

Can we have the full log lines, please? (I mean with the files and line numbers -- you can still redact sensitive details :-)

@dananichev
Copy link
Author

Yeah, sure.

But i don't believe it is about disk. I mean it is running on NVMe at Hetzner.
But, still, i can't be sure.

ts=2019-03-25T11:16:40.532730785Z caller=images.go:79 component=sync-loop service=*-production:deployment/lk container=lk repo=gitlab.**.ru:4567/** pattern=glob:* current=gitlab.**.ru:4567/**:latest info="added update to automation run" new=gitlab.**.ru:4567/**:latest-e92483f1319148a21624249f472d8b885fbedbe4 reason="latest latest-e92483f1319148a21624249f472d8b885fbedbe4 (2019-03-07 07:18:08.484928475 +0000 UTC) > current latest (filtered out or missing)"

ts=2019-03-25T11:16:40.533037817Z caller=loop.go:111 component=sync-loop jobID=a93d73a8-ba94-39c2-e951-e5fb70a21488 state=in-progress

ts=2019-03-25T11:16:59.582899707Z caller=releaser.go:58 component=sync-loop jobID=a93d73a8-ba94-39c2-e951-e5fb70a21488 type=release updates=520

ts=2019-03-25T11:20:55.051384481Z caller=loop.go:121 component=sync-loop jobID=a93d73a8-ba94-39c2-e951-e5fb70a21488 state=done success=false err="git commit: running git command: git [commit --no-verify -a -m Auto-release multiple images\n\n - gitlab.**.ru:4567/**:latest-4985eb31bb68b72c377f44dd3905c4213eba2f01\n - gitlab.**.ru:4567/**:latest-e92483f1319148a21624249f472d8b885fbedbe4\n - gitlab.**.ru:4567/**:latest-10a671033011403ce0ebad6701ccb96d9a5a0fb6\n - gitlab.**.ru:4567/**:latest-dee8e41f3f2a1413954050356cfc3c68112715aa\n - gitlab.**.ru:4567/**:latest-ed6b6773ae051e9808ddd837b80a3e4177973cec\n - gitlab.**.ru:4567/**:latest-eb816707b76f8589dc4b667331e1c655f4966696\n - gitlab.**.ru:4567/**-34eb94841504daa85785910e60c1e2e7cdae439b\n]: context deadline exceeded"

ts=2019-03-25T11:20:56.333989043Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.**.ru:kubernetes/deployments.git branch=master HEAD=7442b949bedcc2baee2d1d104c627b38723886dd

ts=2019-03-25T11:20:56.33717172Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.**.ru:kubernetes/deployments.git branch=master HEAD=7442b949bedcc2baee2d1d104c627b38723886dd

ts=2019-03-25T11:21:07.944431291Z caller=sync.go:417 component=cluster method=Sync cmd=apply args= count=1793

ts=2019-03-25T11:22:31.308213673Z caller=sync.go:483 component=cluster method=Sync cmd="kubectl apply -f -" took=1m23.361746534s err=null output="..."

ts=2019-03-25T11:22:31.400132711Z caller=images.go:17 component=sync-loop msg="polling images"

@dananichev
Copy link
Author

dananichev commented Mar 25, 2019

@squaremo it is standard log output. If i can somehow increase verbosity let me know

@squaremo
Copy link
Member

OK, I have a better guess: as Hidde suggests, it's running into the job timeout. The reason it looks like git commit times out is that the timeout is implemented by using a context.Context, and the git commit is the first operation that uses the context after it's passed the deadline -- but it's not what's actually taking all the time.

I suspect it's invoking kubeyaml for each update that takes up the bulk of the time before it hits the deadline. We expose some Prometheus metrics on how much time each stage takes, so we can try looking at those. If you're not scraping the metrics @dananichev , just requesting http://<fluxd pod IP>:3030/metrics (or possibly http://<fluxd pod IP>:3031/metrics) would get the data.

@dananichev
Copy link
Author

@squaremo ok i'll into it in an hour. Prometheus is not running in our cluster yet. Initial setup of Prometheus and Grafana was quite expensive and we decided to roll it back for a time being.

@dananichev
Copy link
Author

@squaremo

# HELP flux_cache_request_duration_seconds Duration of cache requests, in seconds.
# TYPE flux_cache_request_duration_seconds histogram
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.005"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.01"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.025"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.05"} 115163
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.1"} 116256
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.25"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.5"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="1"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="2.5"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="5"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="10"} 116259
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="+Inf"} 116259
flux_cache_request_duration_seconds_sum{method="GetKey",success="true"} 3097.0183910479727
flux_cache_request_duration_seconds_count{method="GetKey",success="true"} 116259
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.005"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.01"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.025"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.05"} 2855
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.1"} 3950
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.25"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.5"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="1"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="2.5"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="5"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="10"} 3955
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="+Inf"} 3955
flux_cache_request_duration_seconds_sum{method="SetKey",success="true"} 142.7382281970004
flux_cache_request_duration_seconds_count{method="SetKey",success="true"} 3955
# HELP flux_client_fetch_duration_seconds Duration of remote image metadata requests, in seconds
# TYPE flux_client_fetch_duration_seconds histogram
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.005"} 0
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.01"} 0
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.025"} 1
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.05"} 8
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.1"} 31
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.25"} 239
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.5"} 683
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="1"} 915
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="2.5"} 930
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="5"} 930
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="10"} 930
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="+Inf"} 930
flux_client_fetch_duration_seconds_sum{kind="metadata",success="true"} 380.83237022999987
flux_client_fetch_duration_seconds_count{kind="metadata",success="true"} 930
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.005"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.01"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.025"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.05"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.1"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.25"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.5"} 2659
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="1"} 2661
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="2.5"} 3024
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="5"} 3024
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="10"} 3025
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="+Inf"} 3025
flux_client_fetch_duration_seconds_sum{kind="tags",success="true"} 1482.5760372919947
flux_client_fetch_duration_seconds_count{kind="tags",success="true"} 3025
# HELP flux_daemon_job_duration_seconds Duration of job execution, in seconds.
# TYPE flux_daemon_job_duration_seconds histogram
flux_daemon_job_duration_seconds_bucket{success="false",le="0.1"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="0.5"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="1"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="2"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="5"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="10"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="15"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="20"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="30"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="45"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="60"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="120"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="+Inf"} 32
flux_daemon_job_duration_seconds_sum{success="false"} 8493.444840978
flux_daemon_job_duration_seconds_count{success="false"} 32
# HELP flux_daemon_queue_duration_seconds Duration of time spent in the job queue before execution, in seconds.
# TYPE flux_daemon_queue_duration_seconds histogram
flux_daemon_queue_duration_seconds_bucket{le="0.1"} 19
flux_daemon_queue_duration_seconds_bucket{le="0.5"} 19
flux_daemon_queue_duration_seconds_bucket{le="1"} 19
flux_daemon_queue_duration_seconds_bucket{le="2"} 19
flux_daemon_queue_duration_seconds_bucket{le="5"} 19
flux_daemon_queue_duration_seconds_bucket{le="10"} 19
flux_daemon_queue_duration_seconds_bucket{le="15"} 19
flux_daemon_queue_duration_seconds_bucket{le="20"} 19
flux_daemon_queue_duration_seconds_bucket{le="30"} 19
flux_daemon_queue_duration_seconds_bucket{le="45"} 19
flux_daemon_queue_duration_seconds_bucket{le="60"} 19
flux_daemon_queue_duration_seconds_bucket{le="120"} 31
flux_daemon_queue_duration_seconds_bucket{le="+Inf"} 32
flux_daemon_queue_duration_seconds_sum 1406.877209523
flux_daemon_queue_duration_seconds_count 32
# HELP flux_daemon_queue_length_count Count of jobs waiting in the queue to be run.
# TYPE flux_daemon_queue_length_count gauge
flux_daemon_queue_length_count 1
# HELP flux_daemon_sync_duration_seconds Duration of git-to-cluster synchronisation, in seconds.
# TYPE flux_daemon_sync_duration_seconds histogram
flux_daemon_sync_duration_seconds_bucket{success="false",le="0.5"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="5"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="10"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="20"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="30"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="40"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="50"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="60"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="75"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="90"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="120"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="240"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="+Inf"} 1
flux_daemon_sync_duration_seconds_sum{success="false"} 0.000105993
flux_daemon_sync_duration_seconds_count{success="false"} 1
flux_daemon_sync_duration_seconds_bucket{success="true",le="0.5"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="5"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="10"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="20"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="30"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="40"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="50"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="60"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="75"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="90"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="120"} 22
flux_daemon_sync_duration_seconds_bucket{success="true",le="240"} 28
flux_daemon_sync_duration_seconds_bucket{success="true",le="+Inf"} 28
flux_daemon_sync_duration_seconds_sum{success="true"} 3415.8765745569995
flux_daemon_sync_duration_seconds_count{success="true"} 28
# HELP flux_fluxsvc_release_duration_seconds Release method duration in seconds.
# TYPE flux_fluxsvc_release_duration_seconds histogram
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.005"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.01"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.025"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.05"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.1"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.25"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.5"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="1"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="2.5"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="5"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="10"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="+Inf"} 32
flux_fluxsvc_release_duration_seconds_sum{release_kind="execute",release_type="automated",success="true"} 8484.14158376
flux_fluxsvc_release_duration_seconds_count{release_kind="execute",release_type="automated",success="true"} 32
# HELP flux_fluxsvc_release_stage_duration_seconds Duration in seconds of each stage of a release, including dry-runs.
# TYPE flux_fluxsvc_release_stage_duration_seconds histogram
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.005"} 12
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.01"} 31
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.025"} 31
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.05"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.1"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.25"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.5"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="1"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="2.5"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="5"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="10"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="+Inf"} 32
flux_fluxsvc_release_stage_duration_seconds_sum{stage="verify_changes"} 0.20723454300000005
flux_fluxsvc_release_stage_duration_seconds_count{stage="verify_changes"} 32
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.005"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.01"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.025"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.05"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.1"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.25"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.5"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="1"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="2.5"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="5"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="10"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="+Inf"} 32
flux_fluxsvc_release_stage_duration_seconds_sum{stage="write_changes"} 7564.896252170999
flux_fluxsvc_release_stage_duration_seconds_count{stage="write_changes"} 32
# HELP flux_registry_fetch_duration_seconds Duration of image metadata requests (from cache), in seconds.
# TYPE flux_registry_fetch_duration_seconds histogram
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.005"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.01"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.025"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.05"} 668
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.1"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.25"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.5"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="1"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="2.5"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="5"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="10"} 704
flux_registry_fetch_duration_seconds_bucket{success="true",le="+Inf"} 704
flux_registry_fetch_duration_seconds_sum{success="true"} 20.767516113000013
flux_registry_fetch_duration_seconds_count{success="true"} 704
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.988e-05
go_gc_duration_seconds{quantile="0.25"} 0.000105973
go_gc_duration_seconds{quantile="0.5"} 0.000150623
go_gc_duration_seconds{quantile="0.75"} 0.000236643
go_gc_duration_seconds{quantile="1"} 0.008118526
go_gc_duration_seconds_sum 1.385224013
go_gc_duration_seconds_count 2155
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 26
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.10.8"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.076602e+08
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 4.5355103328e+10
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 3.789842e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 4.59431307e+08
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.0016254520271533808
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 8.531968e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.076602e+08
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 8.6253568e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.22970112e+08
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 1.189465e+06
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.4139392e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.0922368e+08
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5535255036875296e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 706759
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 4.60620772e+08
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3472
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 2.16448e+06
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 2.932736e+06
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.47552016e+08
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.107174e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.277952e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.277952e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.26879736e+08
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 13
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 759.69
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 14
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.25943552e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.55351245634e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.55696896e+08
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

@squaremo
Copy link
Member

Nice, thank you @dananichev.

[...]
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="10"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="+Inf"} 32
flux_fluxsvc_release_stage_duration_seconds_sum{stage="write_changes"} 7564.896252170999
flux_fluxsvc_release_stage_duration_seconds_count{stage="write_changes"} 32

This is what I was looking for. This metric measures how long it takes to write the changes back into the files. The histogram buckets (the le attribute) top out at 10s, which is a bit unfortunate for us, since we won't be able to see anything more precise than "took longer than 10s". But it does at least confirm that every single job takes longer than 10s, and the mean is 7564.8/32 = 236 seconds, just about four minutes (which tallies with the log timestamps from above).

For now it'll be tricky to do anything about the writes individually taking about half a second -- the problem, briefly, is that fluxd execs a Python program to do it, because Python has the only round-tripping (i.e., comment-preserving) YAML parser I could find.

We could give you the opportunity to configure the job timeout (current hard-wired value: 60s). This would be a bit of a sticking plaster solution, and may have side-effects, like delaying syncs.
More vaguely, perhaps automated updates could be partitioned so it's never trying to do the whole set of files at a time. (But partitioned how? Unclear).

@2opremio
Copy link
Contributor

Some other potential solutions:

  1. Selectively set the job deadline based on the size of the task to be performed (setting the image of a workload is going to take much shorter than setting the image of 20 workloads)
  2. Make the jobs shorter (through parallelization). Ordering ranks aside, we could apply the upgrades and sync actions in parallel.

@2opremio 2opremio added the bug label Mar 25, 2019
@dananichev
Copy link
Author

I guess, for now the only solution is to use the same workaround i described in first post?

@squaremo
Copy link
Member

I guess, for now the only solution is to use the same workaround i described in first post?

Yes, in currently released Flux, I think that's the only workaround -- sorry for the lump of extra work :-(

If you can find some way to ratchet forward the images in manifests when they are created (when "we are forced to recreate all manifests inside git repository" happens), so they are less likely to have an upgrade when deployed, that would help too of course.

@squaremo
Copy link
Member

Some other potential solutions:

  1. Selectively set the job deadline based on the size of the task to be performed (setting the image of a workload is going to take much shorter than setting the image of 20 workloads)

We don't always know the size of a job before it's started, since calculating what to do is usually part of the job.

Automated updates are an exception -- it figures some things out before it queues the job; so it could give the job longer to run, if we wanted.

  1. Make the jobs shorter (through parallelization). Ordering ranks aside, we could apply the upgrades and sync actions in parallel.

I guess we could parallelise the update of distinct files, by adding a bit of complexity.

@dananichev
Copy link
Author

dananichev commented Mar 28, 2019

@squaremo i guess the proper way would be to use some kind of yaml parser and change only whose values we need but we have not enough resources for this for now. As a workaround i've set half-manually all images to latest versions. But i hope there will some kind of native solution (from Flux) in future. And for now i will try to mitigate this on my side also.

@dananichev
Copy link
Author

@squaremo now i've got new error:

...
ts=2019-03-31T17:31:08.168210362Z caller=images.go:79 component=sync-loop service=pgs-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168262474Z caller=images.go:79 component=sync-loop service=trzh-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168312371Z caller=images.go:79 component=sync-loop service=yaprav1-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168505733Z caller=images.go:79 component=sync-loop service=simp-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.168610363Z caller=images.go:79 component=sync-loop service=yamalec-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)"
ts=2019-03-31T17:31:08.170085374Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.**.ru:*/*.git branch=master HEAD=bf833d219f0025ba21bc3efc977de99decb6aab8
ts=2019-03-31T17:31:08.170167927Z caller=loop.go:111 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d state=in-progress
ts=2019-03-31T17:31:22.631248812Z caller=releaser.go:58 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d type=release updates=136
ts=2019-03-31T17:32:03.452508149Z caller=loop.go:121 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d state=done success=false err="fork/exec /usr/bin/git: argument list too long"

... - a lot of similar to ts=2019-03-31T17:31:08.168210362Z caller=images.go:79 component=sync-loop service=pgs-production:deployment/work container=work repo=gitlab.**.ru:4567/**/clients pattern=glob:* current=gitlab.**.ru:4567/**/clients:latest-34eb94841504daa85785910e60c1e2e7cdae439b info="added update to automation run" new=gitlab.**.ru:4567/**/clients:latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 reason="latest latest-3b4fa4cbce9dba2fa6ac0aa1b5cab25f4e7ebb74 (2019-03-31 15:20:54.332303657 +0000 UTC) > current latest-34eb94841504daa85785910e60c1e2e7cdae439b (2019-03-22 07:58:16.913909237 +0000 UTC)" messages

And an error is ts=2019-03-31T17:32:03.452508149Z caller=loop.go:121 component=sync-loop jobID=ade62567-e227-47e9-2636-6b41c235775d state=done success=false err="fork/exec /usr/bin/git: argument list too long"

I didn't reset git repo's state or anything like that. Just usual Flux workflow (build new image -> push into registry -> wait for an update).

Metrics:


# HELP flux_cache_request_duration_seconds Duration of cache requests, in seconds.
# TYPE flux_cache_request_duration_seconds histogram
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.005"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.01"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.025"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.05"} 51
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.1"} 51
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.25"} 51
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="0.5"} 51
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="1"} 51
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="2.5"} 54
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="5"} 54
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="10"} 54
flux_cache_request_duration_seconds_bucket{method="GetKey",success="false",le="+Inf"} 54
flux_cache_request_duration_seconds_sum{method="GetKey",success="false"} 4.32377223
flux_cache_request_duration_seconds_count{method="GetKey",success="false"} 54
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.005"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.01"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.025"} 0
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.05"} 2.122619e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.1"} 2.142474e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.25"} 2.14255e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="0.5"} 2.142576e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="1"} 2.142579e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="2.5"} 2.142581e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="5"} 2.142581e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="10"} 2.142583e+06
flux_cache_request_duration_seconds_bucket{method="GetKey",success="true",le="+Inf"} 2.142584e+06
flux_cache_request_duration_seconds_sum{method="GetKey",success="true"} 56253.59087768201
flux_cache_request_duration_seconds_count{method="GetKey",success="true"} 2.142584e+06
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.005"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.01"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.025"} 0
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.05"} 38875
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.1"} 56605
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.25"} 56629
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="0.5"} 56631
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="1"} 56632
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="2.5"} 56632
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="5"} 56632
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="10"} 56632
flux_cache_request_duration_seconds_bucket{method="SetKey",success="true",le="+Inf"} 56632
flux_cache_request_duration_seconds_sum{method="SetKey",success="true"} 2078.227767002015
flux_cache_request_duration_seconds_count{method="SetKey",success="true"} 56632
# HELP flux_client_fetch_duration_seconds Duration of remote image metadata requests, in seconds
# TYPE flux_client_fetch_duration_seconds histogram
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.005"} 0
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.01"} 0
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.025"} 149
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.05"} 252
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.1"} 354
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.25"} 730
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="0.5"} 1666
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="1"} 2161
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="2.5"} 2185
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="5"} 2185
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="10"} 2185
flux_client_fetch_duration_seconds_bucket{kind="metadata",success="true",le="+Inf"} 2185
flux_client_fetch_duration_seconds_sum{kind="metadata",success="true"} 802.6345978859997
flux_client_fetch_duration_seconds_count{kind="metadata",success="true"} 2185
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.005"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.01"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.025"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.05"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.1"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.25"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="0.5"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="1"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="2.5"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="5"} 1
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="10"} 1
flux_client_fetch_duration_seconds_bucket{kind="tags",success="false",le="+Inf"} 1
flux_client_fetch_duration_seconds_sum{kind="tags",success="false"} 3.555996268
flux_client_fetch_duration_seconds_count{kind="tags",success="false"} 1
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.005"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.01"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.025"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.05"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.1"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.25"} 0
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="0.5"} 47897
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="1"} 47950
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="2.5"} 54436
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="5"} 54437
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="10"} 54441
flux_client_fetch_duration_seconds_bucket{kind="tags",success="true",le="+Inf"} 54446
flux_client_fetch_duration_seconds_sum{kind="tags",success="true"} 25808.558632211032
flux_client_fetch_duration_seconds_count{kind="tags",success="true"} 54446
# HELP flux_daemon_job_duration_seconds Duration of job execution, in seconds.
# TYPE flux_daemon_job_duration_seconds histogram
flux_daemon_job_duration_seconds_bucket{success="false",le="0.1"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="0.5"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="1"} 0
flux_daemon_job_duration_seconds_bucket{success="false",le="2"} 6
flux_daemon_job_duration_seconds_bucket{success="false",le="5"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="10"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="15"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="20"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="30"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="45"} 10
flux_daemon_job_duration_seconds_bucket{success="false",le="60"} 30
flux_daemon_job_duration_seconds_bucket{success="false",le="120"} 30
flux_daemon_job_duration_seconds_bucket{success="false",le="+Inf"} 30
flux_daemon_job_duration_seconds_sum{success="false"} 1140.3892538320001
flux_daemon_job_duration_seconds_count{success="false"} 30
flux_daemon_job_duration_seconds_bucket{success="true",le="0.1"} 0
flux_daemon_job_duration_seconds_bucket{success="true",le="0.5"} 0
flux_daemon_job_duration_seconds_bucket{success="true",le="1"} 0
flux_daemon_job_duration_seconds_bucket{success="true",le="2"} 0
flux_daemon_job_duration_seconds_bucket{success="true",le="5"} 16
flux_daemon_job_duration_seconds_bucket{success="true",le="10"} 24
flux_daemon_job_duration_seconds_bucket{success="true",le="15"} 33
flux_daemon_job_duration_seconds_bucket{success="true",le="20"} 35
flux_daemon_job_duration_seconds_bucket{success="true",le="30"} 35
flux_daemon_job_duration_seconds_bucket{success="true",le="45"} 35
flux_daemon_job_duration_seconds_bucket{success="true",le="60"} 35
flux_daemon_job_duration_seconds_bucket{success="true",le="120"} 35
flux_daemon_job_duration_seconds_bucket{success="true",le="+Inf"} 35
flux_daemon_job_duration_seconds_sum{success="true"} 264.4678252550001
flux_daemon_job_duration_seconds_count{success="true"} 35
# HELP flux_daemon_queue_duration_seconds Duration of time spent in the job queue before execution, in seconds.
# TYPE flux_daemon_queue_duration_seconds histogram
flux_daemon_queue_duration_seconds_bucket{le="0.1"} 44
flux_daemon_queue_duration_seconds_bucket{le="0.5"} 44
flux_daemon_queue_duration_seconds_bucket{le="1"} 44
flux_daemon_queue_duration_seconds_bucket{le="2"} 44
flux_daemon_queue_duration_seconds_bucket{le="5"} 44
flux_daemon_queue_duration_seconds_bucket{le="10"} 50
flux_daemon_queue_duration_seconds_bucket{le="15"} 51
flux_daemon_queue_duration_seconds_bucket{le="20"} 51
flux_daemon_queue_duration_seconds_bucket{le="30"} 51
flux_daemon_queue_duration_seconds_bucket{le="45"} 51
flux_daemon_queue_duration_seconds_bucket{le="60"} 51
flux_daemon_queue_duration_seconds_bucket{le="120"} 57
flux_daemon_queue_duration_seconds_bucket{le="+Inf"} 66
flux_daemon_queue_duration_seconds_sum 3151.204196682
flux_daemon_queue_duration_seconds_count 66
# HELP flux_daemon_queue_length_count Count of jobs waiting in the queue to be run.
# TYPE flux_daemon_queue_length_count gauge
flux_daemon_queue_length_count 0
# HELP flux_daemon_sync_duration_seconds Duration of git-to-cluster synchronisation, in seconds.
# TYPE flux_daemon_sync_duration_seconds histogram
flux_daemon_sync_duration_seconds_bucket{success="false",le="0.5"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="5"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="10"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="20"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="30"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="40"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="50"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="60"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="75"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="90"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="120"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="240"} 1
flux_daemon_sync_duration_seconds_bucket{success="false",le="+Inf"} 1
flux_daemon_sync_duration_seconds_sum{success="false"} 0.000790806
flux_daemon_sync_duration_seconds_count{success="false"} 1
flux_daemon_sync_duration_seconds_bucket{success="true",le="0.5"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="5"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="10"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="20"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="30"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="40"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="50"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="60"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="75"} 0
flux_daemon_sync_duration_seconds_bucket{success="true",le="90"} 7
flux_daemon_sync_duration_seconds_bucket{success="true",le="120"} 163
flux_daemon_sync_duration_seconds_bucket{success="true",le="240"} 694
flux_daemon_sync_duration_seconds_bucket{success="true",le="+Inf"} 984
flux_daemon_sync_duration_seconds_sum{success="true"} 201993.17092722794
flux_daemon_sync_duration_seconds_count{success="true"} 984
# HELP flux_fluxsvc_release_duration_seconds Release method duration in seconds.
# TYPE flux_fluxsvc_release_duration_seconds histogram
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.005"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.01"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.025"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.05"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.1"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.25"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="0.5"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="1"} 0
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="2.5"} 28
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="5"} 34
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="10"} 39
flux_fluxsvc_release_duration_seconds_bucket{release_kind="execute",release_type="automated",success="true",le="+Inf"} 65
flux_fluxsvc_release_duration_seconds_sum{release_kind="execute",release_type="automated",success="true"} 1293.7839178159998
flux_fluxsvc_release_duration_seconds_count{release_kind="execute",release_type="automated",success="true"} 65
# HELP flux_fluxsvc_release_stage_duration_seconds Duration in seconds of each stage of a release, including dry-runs.
# TYPE flux_fluxsvc_release_stage_duration_seconds histogram
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.005"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.01"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.025"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.05"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.1"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.25"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="0.5"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="1"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="2.5"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="5"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="10"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="verify_changes",le="+Inf"} 65
flux_fluxsvc_release_stage_duration_seconds_sum{stage="verify_changes"} 0.15697465600000007
flux_fluxsvc_release_stage_duration_seconds_count{stage="verify_changes"} 65
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.005"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.01"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.025"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.05"} 0
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.1"} 2
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.25"} 18
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="0.5"} 20
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="1"} 20
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="2.5"} 24
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="5"} 24
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="10"} 35
flux_fluxsvc_release_stage_duration_seconds_bucket{stage="write_changes",le="+Inf"} 55
flux_fluxsvc_release_stage_duration_seconds_sum{stage="write_changes"} 905.6846780740002
flux_fluxsvc_release_stage_duration_seconds_count{stage="write_changes"} 55
# HELP flux_registry_fetch_duration_seconds Duration of image metadata requests (from cache), in seconds.
# TYPE flux_registry_fetch_duration_seconds histogram
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.005"} 0
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.01"} 0
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.025"} 0
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.05"} 2
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.1"} 2
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.25"} 2
flux_registry_fetch_duration_seconds_bucket{success="false",le="0.5"} 2
flux_registry_fetch_duration_seconds_bucket{success="false",le="1"} 2
flux_registry_fetch_duration_seconds_bucket{success="false",le="2.5"} 3
flux_registry_fetch_duration_seconds_bucket{success="false",le="5"} 3
flux_registry_fetch_duration_seconds_bucket{success="false",le="10"} 3
flux_registry_fetch_duration_seconds_bucket{success="false",le="+Inf"} 3
flux_registry_fetch_duration_seconds_sum{success="false"} 1.0534498350000001
flux_registry_fetch_duration_seconds_count{success="false"} 3
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.005"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.01"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.025"} 0
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.05"} 26489
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.1"} 27800
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.25"} 27801
flux_registry_fetch_duration_seconds_bucket{success="true",le="0.5"} 27803
flux_registry_fetch_duration_seconds_bucket{success="true",le="1"} 27804
flux_registry_fetch_duration_seconds_bucket{success="true",le="2.5"} 27804
flux_registry_fetch_duration_seconds_bucket{success="true",le="5"} 27804
flux_registry_fetch_duration_seconds_bucket{success="true",le="10"} 27804
flux_registry_fetch_duration_seconds_bucket{success="true",le="+Inf"} 27804
flux_registry_fetch_duration_seconds_sum{success="true"} 786.7898362279985
flux_registry_fetch_duration_seconds_count{success="true"} 27804
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.292e-05
go_gc_duration_seconds{quantile="0.25"} 5.5112e-05
go_gc_duration_seconds{quantile="0.5"} 7.7427e-05
go_gc_duration_seconds{quantile="0.75"} 0.000120305
go_gc_duration_seconds{quantile="1"} 0.006348916
go_gc_duration_seconds_sum 363.538281969
go_gc_duration_seconds_count 45013
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 26
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.10.8"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 3.1497968e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.292975255832e+12
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 5.72345e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 1.2727555998e+10
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.0018631623963349941
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.441792e+07
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 3.1497968e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.14605568e+08
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 4.4138496e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 228836
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.00130816e+08
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.58744064e+08
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5540539119980552e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 8.92662e+06
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 1.2727784834e+10
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 3472
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 804992
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 5.095424e+06
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 3.8294048e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.520574e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.015808e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.015808e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 3.86533624e+08
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 13
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 13445.21
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.91745792e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.55367474126e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.15350784e+08
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

@fquinner
Copy link

I have hit this issue too unfortunately when building out a new ~40 deployment environment. A fluxd job timeout is something that feels like it should be configurable anyway if that's possible (sticky plaster or expected behaviour), but yes ideally it would be faster where it wouldn't hit this timeout.

My workaround was to release workloads individually until they one by one populated the patchfile. Note I'm using patchfile which I thought was initially the problem but in the target environment a build only takes a second or two. It's not clear though how many components would need to be updated though to make this a problem, but it at least seems to be tracking individual upstream images OK.

@tomjohnburton
Copy link

tomjohnburton commented Jan 16, 2020

Hi all, I am also hitting this problem

flux-784ff966b7-twl5b flux {"caller":"releaser.go:59","component":"sync-loop","jobID":"6ff2f3be-25ff-876f-7c37-bb9daf9d3106","ts":"2020-01-16T13:46:31.916289112Z","type":"release","updates":19}
error executing generator command \"mkdir -p /root/.config/kustomize/plugin/alteos/v1/;\\ncp -R ../../docker/kustomize-plugins/alteos/v1/sopssecret/ /root/.config/kustomize/plugin/alteos/v1/sopssecret;\\n\" from file \".flux.yaml\": context deadline exceeded\nerror

What is the general suggestion for a fix in the meantime?

@2opremio
Copy link
Contributor

Have you tried increasing --sync-timeout ?

@tomjohnburton
Copy link

Sync timeout is at 5m0s at the moment. Should I go even higher?

@2opremio
Copy link
Contributor

Maybe .... It would be good to know where and why it's getting stuck though.

@tomjohnburton
Copy link

Everytime I try add the --timeout flag to the patch.yaml it crashes. Is this command valid?

@2opremio
Copy link
Contributor

I am not sure what you mean

@tomjohnburton
Copy link

tomjohnburton commented Jan 16, 2020

        - name: flux
          args:
            - --manifest-generation=true
            - --memcached-hostname=memcached.${DEPLOYMENT_NAMESPACE}
            - --memcached-service=
            - --sync-timeout=5m0s
            - --log-format=json
            - --ssh-keygen-dir=/var/fluxd/keygen
            - --git-branch=${GITLAB_BRANCH}
            - --git-path=kustomize/preview
            - --git-user=preview-user
            - [email protected]:alteos/infrastructure/cluster.git
            - --sops
            - --k8s-default-namespace=${DEPLOYMENT_NAMESPACE}
            - --git-poll-interval=2m0s
            - --automation-interval=2m0s
            - --registry-exclude-image=*docker.io/*,*quay.io/*
            - --k8s-allow-namespace=${DEPLOYMENT_NAMESPACE}
            - --timeout=5m0s <----- is this valid?

@hiddeco
Copy link
Member

hiddeco commented Jan 16, 2020

@tomjohnburton the flag you are looking for is --sync-timeout, see also: https://docs.fluxcd.io/en/stable/references/daemon.html

@tomjohnburton
Copy link

Okay I've put a 10minute timeout,

@tomjohnburton
Copy link

AMAZING, that worked. Thanks all

@nabadger
Copy link

nabadger commented Apr 5, 2020

Hi all,

We're seeing a similar issue now, but it seems to be specific only to automated releases.

i.e. as a sync without any automated releases works fine with the default sync-timeout value of 1m, but by enabling automated releases on 3 existing workloads, we're finding we need to bump it to 5m to avoid:

flux-apps-7859587679-gdkg2 flux ts=2020-04-03T17:09:07.72427424Z caller=loop.go:151 component=sync-loop jobID=7708b94c-98f7-397b-a746-fa5267eecf8b state=done success=false err="loading resources after updates: error executing generator command \"cat *.yaml\" from file \"../../../../.flux.yaml\": context deadline exceeded\nerror output:\n\ngenerated output:\n"

I'm wondering whether #2805 might help, since we're using this. I'll try it anyway and report back.

@nabadger
Copy link

nabadger commented Apr 6, 2020

I was looking into this a bit more, and wanted to understand whether this is related to kubeyaml or not.

As far as I can tell, during an release (release.go), we call WriteUpdates, which calls SetWorkloadContainerImage which calls updateWorkloadContainer which can then make a sys call to kubeyaml.

I was wondering what was passed into stdin here for kubeyaml - is it just based on the generators in .flux.yaml?

In our case, that is basically a cat *.yaml of all our rendered manifests.

We have a monorepo, so there's a lot.

  • 2.7mb
  • 282 resources, of which 25 are deployments
  • Out of these workloads we are automating 4 workloads so far.

I'm not sure I've got the inputs to kubeyaml correct, but here's my timings so far:

time cat *.yaml \
  | python build/kubeyaml.py image --namespace=$MY_NAMESPACE --kind=Deployment --name=$MY_WORKLOAD --container=main --image=$MY_UPDATED_IMAGE" \
  > output.yaml
cat *.yaml  0.01s user 0.01s system 0% cpu 8.750 total
python build/kubeyaml.py image --namespace=portal --kind=Deployment    >   9.33s user 0.04s system 99% cpu 9.387 total

In this example it's 9 seconds for a single deployment - it's possible this runs multiple time right?

When I was hitting the context-deadline, it was updating 3 workloads.

Also, this was on my laptop, but our flux instance probably has less cpu than my laptop to be honest, so it could be slower in-cluster :)

@nabadger
Copy link

nabadger commented Apr 22, 2020

Looked at our input to kubeyaml a little more.

One issue I'm seeing is that we're including all our CustomResourceDefinitions (as part of the cat *.yaml).

These files are large - especially the upstream prometheus ones.

Here's one such example: https://github.com/coreos/prometheus-operator/blob/master/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml

With CRD's

cat *.yaml  0.01s user 0.01s system 0% cpu 12.381 total

Without it's much faster :)

cat *.yaml  0.01s user 0.00s system 0% cpu 4.508 total

In our case, CRD's make up only 16 out of 283~ manifests to parse.

Should be easy enough for us to ignore them, but not sure whether kubeyaml could explicitly ignore CRDs too.


Noted that whilst kubeyaml already tries to filter by the passed in kind value, loading of the document is required prior to testing for it (which is the slow bit). Not sure if this can be done with some kind of 'peek' behavior (i.e. check the kind without having to load the whole doc).

@alex-shpak
Copy link
Contributor

Seems like we also now hit this issue,
In first it was timeout, so we increased sync-timeout to 5m, now it is

ts=2020-06-16T14:06:12.217892013Z caller=loop.go:151 component=sync-loop jobID=2e650640-7405-54e9-1e77-43fa9663e01c state=done success=false err="fork/exec /usr/bin/git: argument list too long"

Seems like setting policy via CLI still happens, issue is with auto-updates.

@alex-shpak
Copy link
Contributor

Perhaps commit message gets too long 🤔 because it is generated

@lebenitza
Copy link

It is definitely a performance issue somewhere. I added 9 HelmReleases of a chart with 8 sub-charts and it had to do 72 updates in total, 8 tag updates per HelmRelease. With no CPU limit and 6 minutes it could not finish whatever I did.

ts=2020-06-25T16:20:58.154755447Z caller=loop.go:151 component=sync-loop jobID=54a507b1-f19c-92ec-9982-79581d722afc state=done success=false err="running git command: git [add -- .]: context deadline exceeded"

I'll try to investigate more and come back. For now disabling auto update on those 9 HelmReleases was the only thing that helped.

@jacksmith15
Copy link

Another workaround to this is to manually update the patchFile with the latest images as a one-off. Pushing this will cause flux to deploy the latest images, then the next batch of image updates should be smaller.

This is only helpful if you subsequently will be updating only a few images at a time.

@kingdonb
Copy link
Member

Possibly related to #3450

If we have an active report of this issue with someone who can reproduce it, we can follow it up, but according to the Migration Timetable Flux v1 is formally superseded since about this time last month. Bugs can still be fixed, but soon only CVE fixes will be accepted.

Closing for now. Please open a new report if you wish to pursue this. Thanks for using Flux!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests