Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying harbor tile might fail due to canary timed out #3735

Closed
steven-zou opened this issue Dec 3, 2017 · 10 comments
Closed

Deploying harbor tile might fail due to canary timed out #3735

steven-zou opened this issue Dec 3, 2017 · 10 comments

Comments

@steven-zou
Copy link
Contributor

Always in the first deployment.

@steven-zou
Copy link
Contributor Author

The root cause seems the timeout 10min is too short.

Although the bosh returns the error code, the harbor instance actually is already successfully started. We may need to increase the timeout threshold.

This failure depends on the opsman env If the network is poor, timeout will easily appear.

Issue is found in the local env and local VC. Not found in US env.

@steven-zou steven-zou added this to the Sprint 22 milestone Dec 3, 2017
@steven-zou
Copy link
Contributor Author

logs:

====

[Sun Dec 3 14:14:17 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Sun Dec 3 14:14:17 UTC 2017] Check if docker daemon is running
[Sun Dec 3 14:14:17 UTC 2017] Docker daemon is running
[Sun Dec 3 14:14:17 UTC 2017] Loading docker images ...
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Sun Dec 3 14:19:49 UTC 2017] Launching docker-compose up ...
[Sun Dec 3 14:19:49 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Sun Dec 3 14:21:10 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Sun Dec 3 14:21:10 UTC 2017] Check if docker daemon is running
[Sun Dec 3 14:21:10 UTC 2017] Docker daemon is running
[Sun Dec 3 14:21:10 UTC 2017] Loading docker images ...
[Sun Dec 3 14:21:46 UTC 2017] Error: Harbor Service failed to start in 120 seconds

@steven-zou
Copy link
Contributor Author

Concourse ci log:

Task 608 | 14:08:07 | Preparing deployment: Preparing deployment (00:00:00)
Task 608 | 14:08:07 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 608 | 14:08:07 | Creating missing vms: harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0) (00:01:34)
Task 608 | 14:09:41 | Updating instance harbor-app: harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0) (canary) (00:09:40)
L Error: 'harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0)' is not running after update. Review logs for failed jobs: dockerd, harbor
Task 608 | 14:19:22 | Error: 'harbor-app/1839982e-09a8-4ee9-ade4-984bd9d72223 (0)' is not running after update. Review logs for failed jobs: dockerd, harbor

Task 608 Started Sun Dec 3 14:08:07 UTC 2017
Task 608 Finished Sun Dec 3 14:19:22 UTC 2017
Task 608 Duration 00:11:15
Task 608 error

Updating deployment:
Expected task '608' to succeed but state is 'error'
Exit code 1
===== 2017-12-03 14:17:05 UTC Finished "bosh2 --no-color --non-interactive --tty --environment=10.112.122.21 --deployment=harbor-tile-82e62d65ebdecc1ca325 deploy /var/tempest/workspaces/default/deployments/harbor-tile-82e62d65ebdecc1ca325.yml"; Duration: 676s; Exit Status: 1
{"type": "step_finished", "id": "bosh.deploying.harbor-tile-82e62d65ebdecc1ca325"}
Exited with 1.
could not execute "apply-changes": installation was unsuccessful

@steven-zou
Copy link
Contributor Author

Seems the timeout is always 10 minutes

@steven-zou
Copy link
Contributor Author

steven-zou commented Dec 3, 2017

The update watch time used in deploying harbor with tile seems is still the default values:

=====

  • update:
  • canaries: 1
  • canary_watch_time: 30000-300000
  • max_errors: 2
  • max_in_flight: 1
  • serial: true
  • update_watch_time: 30000-300000

@steven-zou
Copy link
Contributor Author

Add more logs:

harbor-app/168b4d39-0633-4268-984e-3ad52887dca3:/var/vcap/sys/log/harbor$ cat ctl.stdout.log
[Mon Dec 4 04:46:56 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Mon Dec 4 04:46:56 UTC 2017] Check if docker daemon is running
[Mon Dec 4 04:46:56 UTC 2017] Docker daemon is running
[Mon Dec 4 04:46:56 UTC 2017] Loading docker images ...
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Mon Dec 4 04:52:38 UTC 2017] Launching docker-compose up ...
[Mon Dec 4 04:52:38 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Mon Dec 4 04:53:49 UTC 2017] Starting Harbor 1.2.0 at https://testing.harbor.vmware.com
[Mon Dec 4 04:53:49 UTC 2017] Check if docker daemon is running
[Mon Dec 4 04:53:49 UTC 2017] Docker daemon is running
[Mon Dec 4 04:53:49 UTC 2017] Loading docker images ...
[Mon Dec 4 04:54:35 UTC 2017] Error: Harbor Service failed to start in 120 seconds
Loaded image: vmware/nginx-photon:1.11.13
Loaded image: vmware/notary-photon:signer-0.5.1
Loaded image: vmware/clair:v2.0.1-photon
Loaded image: vmware/harbor-ui:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/mariadb-photon:10.2.10
Loaded image: vmware/postgresql:9.6.5-photon
Loaded image: vmware/harbor-log:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-jobservice:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-adminserver:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/harbor-db:v1.3.0-rc1-11-gd13321f
Loaded image: vmware/registry:2.6.2-photon
Loaded image: vmware/photon:1.0
Loaded image: vmware/notary-photon:server-0.5.1
Loaded image: vmware/harbor-db-migrator:1.3
[Mon Dec 4 04:57:15 UTC 2017] Launching docker-compose up ...
[Mon Dec 4 04:57:15 UTC 2017] Waiting 120 seconds for Harbor Service to be ready ...
[Mon Dec 4 04:57:34 UTC 2017] Harbor Service started successfully
[Mon Dec 4 04:57:34 UTC 2017] Add cron job for Harbor status check
[Mon Dec 4 04:57:34 UTC 2017] ctl start is successfully done!

@jessehu
Copy link
Contributor

jessehu commented Dec 4, 2017

From the two ctl.stdout.log files, we can see that the time interval of two "Starting Harbor 1.2.0" log are exactly the same 6 minutes 53 seconds. There are probably some timeout setting which causes the 2nd invoke of harbor clt start while 1st harbor ctl start is still running.

@reasonerjt
Copy link
Contributor

Per investigation,
The error happens when it takes very long to load docker images.
It seems in monit file, there is unnecessary checks:

  if failed port 80 for 3 cycles then restart
  if failed port 443 for 3 cycles then restart

Let's remove it and extend the timeout for ctl start

start program "/var/vcap/jobs/harbor/bin/ctl start" with timeout 600 seconds

extend it to 900 or 1200 seconds.

We believe with this change the "multiple start" problem will be fixed.

However there may be another issue that the canary timeout in bosh release will be overwritten by opsmanager to a hard coded 300 seconds

canary_watch_time: 30000-1200000

let's work with pivotal to see if there's a fix for that.

@jessehu
Copy link
Contributor

jessehu commented Dec 8, 2017

This issue is resolved by these patches. We're running CI to verify this fix.

Load harbor docker images in harbor job prestart instead of job start.
https://git.eng.vmware.com/harbor/habo/commit/1d29f2875baccd1dce880b0a92d3fb0e28fe94b7
This can solve the canary watch timeout issue #3735, because the time
of prestart execution is not included in the canary_watch_time.
This patch moves docker related utils and bosh env vars to
src/common/utils.sh, and the key code for loading images is
in jobs/harbor/templates/bin/pre-start.erb.

Remove the restart logic in harbor monit spec
https://git.eng.vmware.com/harbor/habo/commit/73e2c7102e5acd64726bf8426d63fbda8f34575a

Wait for dockerd to start after starting the dockerd daemon
https://git.eng.vmware.com/harbor/habo/commit/53598784faf3801c886bdf98517e1f987a701be2
dockerd daemon needs 1 or 2 seconds to start accepting requests,
so add this wait logic, otherwise other jobs' ctl start script which
trying to connect to docker daemon might fail.

Restart harbor job when detecting harbor service failure
https://git.eng.vmware.com/harbor/habo/commit/bd9318e59e48ee72cf16bcafdc03296a8c5cd5e2
Because removing the harbor pid file only leads to harbor job start
rather than job restart, so the abnormal harbor containers can not
be destroyed and recreated.

@jessehu jessehu changed the title Failure might be occurred when deploying harbor with tile Deploying harbor tile might fail due to canary timed out Dec 10, 2017
@jessehu
Copy link
Contributor

jessehu commented Dec 12, 2017

Verified in CI.

@jessehu jessehu closed this as completed Dec 12, 2017
jessehu added a commit to vmware/harbor-boshrelease that referenced this issue Mar 30, 2018
Unzip the harbor images tgz file into a tar file in packaging phase,
then docker loading this tar file will be faster than loading the tgz
file. This is to reduce the possibility of canary timeout issue #3735.

Issue: goharbor/harbor#3735
jessehu added a commit to vmware/harbor-boshrelease that referenced this issue Mar 30, 2018
This can solve the canary watch timeout issue #3735, because the time
of prestart execution is not included in the canary_watch_time.
This patch moves docker related utils and bosh env vars to
src/common/utils.sh, and the key code for loading images is
in jobs/harbor/templates/bin/pre-start.erb.

Issue: goharbor/harbor#3735
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants