Deploying harbor tile might fail due to canary timed out #3735

steven-zou · 2017-12-03T03:56:14Z

Always in the first deployment.

steven-zou · 2017-12-03T03:58:59Z

The root cause seems the timeout 10min is too short.

Although the bosh returns the error code, the harbor instance actually is already successfully started. We may need to increase the timeout threshold.

This failure depends on the opsman env If the network is poor, timeout will easily appear.

Issue is found in the local env and local VC. Not found in US env.

steven-zou · 2017-12-03T14:29:15Z

logs:

====

[Sun Dec 3 14:14:17 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Sun Dec 3 14:14:17 UTC 2017] Check if docker daemon is running
[Sun Dec 3 14:14:17 UTC 2017] Docker daemon is running
[Sun Dec 3 14:14:17 UTC 2017] Loading docker images ...
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Sun Dec 3 14:19:49 UTC 2017] Launching docker-compose up ...
[Sun Dec 3 14:19:49 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Sun Dec 3 14:21:10 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Sun Dec 3 14:21:10 UTC 2017] Check if docker daemon is running
[Sun Dec 3 14:21:10 UTC 2017] Docker daemon is running
[Sun Dec 3 14:21:10 UTC 2017] Loading docker images ...
[Sun Dec 3 14:21:46 UTC 2017] Error: Harbor Service failed to start in 120 seconds

steven-zou · 2017-12-03T14:29:58Z

Concourse ci log:

Task 608 | 14:08:07 | Preparing deployment: Preparing deployment (00:00:00)
Task 608 | 14:08:07 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 608 | 14:08:07 | Creating missing vms: harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0) (00:01:34)
Task 608 | 14:09:41 | Updating instance harbor-app: harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0) (canary) (00:09:40)
L Error: 'harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0)' is not running after update. Review logs for failed jobs: dockerd, harbor
Task 608 | 14:19:22 | Error: 'harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0)' is not running after update. Review logs for failed jobs: dockerd, harbor

Task 608 Started Sun Dec 3 14:08:07 UTC 2017
Task 608 Finished Sun Dec 3 14:19:22 UTC 2017
Task 608 Duration 00:11:15
Task 608 error

Updating deployment:
Expected task '608' to succeed but state is 'error'
Exit code 1
===== 2017-12-03 14:17:05 UTC Finished "bosh2 --no-color --non-interactive --tty --environment=10.112.122.21 --deployment=harbor-tile-82e62d65ebdecc1ca325 deploy /var/tempest/workspaces/default/deployments/harbor-tile-82e62d65ebdecc1ca325.yml"; Duration: 676s; Exit Status: 1
{"type": "step_finished", "id": "bosh.deploying.harbor-tile-82e62d65ebdecc1ca325"}
Exited with 1.
could not execute "apply-changes": installation was unsuccessful

steven-zou · 2017-12-03T14:31:35Z

Seems the timeout is always 10 minutes

steven-zou · 2017-12-03T14:32:52Z

The update watch time used in deploying harbor with tile seems is still the default values:

=====

update:
canaries: 1
canary_watch_time: 30000-300000
max_errors: 2
max_in_flight: 1
serial: true
update_watch_time: 30000-300000

steven-zou · 2017-12-04T05:07:00Z

Add more logs:

harbor-app/168b4d39-0633-4268-984e-3ad52887dca3:/var/vcap/sys/log/harbor$ cat ctl.stdout.log
[Mon Dec 4 04:46:56 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Mon Dec 4 04:46:56 UTC 2017] Check if docker daemon is running
[Mon Dec 4 04:46:56 UTC 2017] Docker daemon is running
[Mon Dec 4 04:46:56 UTC 2017] Loading docker images ...
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Mon Dec 4 04:52:38 UTC 2017] Launching docker-compose up ...
[Mon Dec 4 04:52:38 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Mon Dec 4 04:53:49 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Mon Dec 4 04:53:49 UTC 2017] Check if docker daemon is running
[Mon Dec 4 04:53:49 UTC 2017] Docker daemon is running
[Mon Dec 4 04:53:49 UTC 2017] Loading docker images ...
[Mon Dec 4 04:54:35 UTC 2017] Error: Harbor Service failed to start in 120 seconds
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Mon Dec 4 04:57:15 UTC 2017] Launching docker-compose up ...
[Mon Dec 4 04:57:15 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Mon Dec 4 04:57:34 UTC 2017] Harbor Service started successfully
[Mon Dec 4 04:57:34 UTC 2017] Add cron job for Harbor status check
[Mon Dec 4 04:57:34 UTC 2017] ctl start is successfully done!

jessehu · 2017-12-04T14:12:10Z

From the two ctl.stdout.log files, we can see that the time interval of two "Starting Harbor 1.2.0" log are exactly the same 6 minutes 53 seconds. There are probably some timeout setting which causes the 2nd invoke of harbor clt start while 1st harbor ctl start is still running.

reasonerjt · 2017-12-07T09:08:40Z

Per investigation,
The error happens when it takes very long to load docker images.
It seems in monit file, there is unnecessary checks:

  if failed port 80 for 3 cycles then restart
  if failed port 443 for 3 cycles then restart

Let's remove it and extend the timeout for ctl start

start program "/var/vcap/jobs/harbor/bin/ctl start" with timeout 600 seconds

extend it to 900 or 1200 seconds.

We believe with this change the "multiple start" problem will be fixed.

However there may be another issue that the canary timeout in bosh release will be overwritten by opsmanager to a hard coded 300 seconds

canary_watch_time: 30000-1200000

let's work with pivotal to see if there's a fix for that.

jessehu · 2017-12-08T11:24:25Z

This issue is resolved by these patches. We're running CI to verify this fix.

Load harbor docker images in harbor job prestart instead of job start.
https://git.eng.vmware.com/harbor/habo/commit/1d29f2875baccd1dce880b0a92d3fb0e28fe94b7
This can solve the canary watch timeout issue #3735, because the time
of prestart execution is not included in the canary_watch_time.
This patch moves docker related utils and bosh env vars to
src/common/utils.sh, and the key code for loading images is
in jobs/harbor/templates/bin/pre-start.erb.

Remove the restart logic in harbor monit spec
https://git.eng.vmware.com/harbor/habo/commit/73e2c7102e5acd64726bf8426d63fbda8f34575a

Wait for dockerd to start after starting the dockerd daemon
https://git.eng.vmware.com/harbor/habo/commit/53598784faf3801c886bdf98517e1f987a701be2
dockerd daemon needs 1 or 2 seconds to start accepting requests,
so add this wait logic, otherwise other jobs' ctl start script which
trying to connect to docker daemon might fail.

Restart harbor job when detecting harbor service failure
https://git.eng.vmware.com/harbor/habo/commit/bd9318e59e48ee72cf16bcafdc03296a8c5cd5e2
Because removing the harbor pid file only leads to harbor job start
rather than job restart, so the abnormal harbor containers can not
be destroyed and recreated.

jessehu · 2017-12-12T08:07:18Z

Verified in CI.

Unzip the harbor images tgz file into a tar file in packaging phase, then docker loading this tar file will be faster than loading the tgz file. This is to reduce the possibility of canary timeout issue #3735. Issue: goharbor/harbor#3735

This can solve the canary watch timeout issue #3735, because the time of prestart execution is not included in the canary_watch_time. This patch moves docker related utils and bosh env vars to src/common/utils.sh, and the key code for loading images is in jobs/harbor/templates/bin/pre-start.erb. Issue: goharbor/harbor#3735

steven-zou added area/tile kind/bug target/pks-0.8 labels Dec 3, 2017

steven-zou self-assigned this Dec 3, 2017

steven-zou added this to the Sprint 22 milestone Dec 3, 2017

reasonerjt mentioned this issue Dec 4, 2017

404 returned when calling harbor API #3734

Closed

reasonerjt assigned jessehu and unassigned steven-zou Dec 5, 2017

jessehu modified the milestones: Sprint 22, Sprint 23 Dec 6, 2017

jessehu mentioned this issue Dec 6, 2017

Deploying Harbor Tile sometimes failed #3719

Closed

jessehu changed the title ~~Failure might be occurred when deploying harbor with tile~~ Deploying harbor tile might fail due to canary timed out Dec 10, 2017

jessehu closed this as completed Dec 12, 2017

reasonerjt unassigned jessehu Aug 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying harbor tile might fail due to canary timed out #3735

Deploying harbor tile might fail due to canary timed out #3735

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017 •

edited

Loading

steven-zou commented Dec 4, 2017

jessehu commented Dec 4, 2017

reasonerjt commented Dec 7, 2017

jessehu commented Dec 8, 2017

jessehu commented Dec 12, 2017

Deploying harbor tile might fail due to canary timed out #3735

Deploying harbor tile might fail due to canary timed out #3735

Comments

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017

Concourse ci log:

steven-zou commented Dec 3, 2017

steven-zou commented Dec 3, 2017 • edited Loading

steven-zou commented Dec 4, 2017

Add more logs:

jessehu commented Dec 4, 2017

reasonerjt commented Dec 7, 2017

jessehu commented Dec 8, 2017

jessehu commented Dec 12, 2017

steven-zou commented Dec 3, 2017 •

edited

Loading