Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blobstore job fails to start after vm crash or reboot #25

Closed
sykesm opened this issue Jul 20, 2016 · 5 comments
Closed

blobstore job fails to start after vm crash or reboot #25

sykesm opened this issue Jul 20, 2016 · 5 comments
Labels

Comments

@sykesm
Copy link

sykesm commented Jul 20, 2016

Issue

The blobstore job fails to start after the VM was rebooted.

Context

The nginx.stderr.log from the failure shows the following line hundreds of times:

nginx: [emerg] open() "/var/vcap/sys/run/blobstore/nginx.pid" failed (2: No such file or directory)

The control script appears to rely on the pre-start script to setup that directory:

function setup_blobstore_directories {
  local run_dir=/var/vcap/sys/run/blobstore
  local log_dir=/var/vcap/sys/log/blobstore
  local data=/var/vcap/store/shared
  local tmp_dir=$data/tmp/uploads
  local nginx_webdav_dir=/var/vcap/packages/nginx_webdav

  mkdir -p $run_dir
  mkdir -p $log_dir
  mkdir -p $data
  mkdir -p $tmp_dir
  chown -R vcap:vcap $run_dir $log_dir $data $tmp_dir $nginx_webdav_dir "${nginx_webdav_dir}/.."
}

According to the time stamps from the log, the pre-start script did run 3-days before the reboot:

-rw-r--r-- 1 vcap vcap 41200 Jul 20 14:21 nginx.stderr.log
-rw-r--r-- 1 vcap vcap     0 Jul 17 17:23 nginx.stdout.log
-rw-r----- 1 vcap vcap   622 Jul 17 17:23 pre-start.stderr.log
-rw-r----- 1 vcap vcap     0 Jul 17 17:23 pre-start.stdout.log

Unfortunately, most of the directories that are created by that script live on temporary file systems that bosh sets up. In particular, /var/vcap/sys/run:

# df /var/vcap/data/sys/run
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024    16      1008   2% /var/vcap/data/sys/run

Since it's a tmpfs, it's memory only file system and all data gets lost on a reboot. That means that the directory used for the nginx pidfile is gone when the blobstore control script starts.

Steps to Reproduce

  1. Deploy cloud foundry
  2. Reboot the blobstore_z1 job

Expected result

The blobstore job recovers when the reboot is complete.

Current result

The blobstore job fails to recover. This causes the cloud controllers, cloud controller workers, and the runtimes to fail.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/126686223

The labels on this github issue will be updated when the story is started.

@sykesm
Copy link
Author

sykesm commented Jul 20, 2016

It appears the same problem exists with the cloud controller:

[2016-07-20 14:21:03+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:21:03 UTC 2016 --------------
[2016-07-20 14:21:03+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory
[2016-07-20 14:22:43+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:22:43 UTC 2016 --------------
[2016-07-20 14:22:43+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_1.pid: No such file or directory
[2016-07-20 14:23:13+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:23:13 UTC 2016 --------------
[2016-07-20 14:23:13+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory
[2016-07-20 14:24:53+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:24:53 UTC 2016 --------------
[2016-07-20 14:24:53+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_1.pid: No such file or directory
[2016-07-20 14:25:23+0000] ------------ STARTING cloud_controller_worker_ctl at Wed Jul 20 14:25:23 UTC 2016 --------------
[2016-07-20 14:25:23+0000] /var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_worker_ctl: line 40: /var/vcap/sys/run/cloud_controller_ng/cloud_controller_worker_2.pid: No such file or directory

@sykesm
Copy link
Author

sykesm commented Jul 20, 2016

Similar issues in consul: cloudfoundry-attic/consul-release#31

@sax
Copy link
Contributor

sax commented Jul 21, 2016

We've made fixes in these two commits:
cloudfoundry/cloud_controller_ng@232f748
2cc5a9b

This will ensure that the directories in /var/vcap/sys/run are recreated when services start after a reboot.

When we switched our start commands to run as non-root, we moved all directory creation into pre-start, because we ran into problems with /var/vcap/sys/log not giving write permission to non-root users. It turns out that /var/vcap/sys/run gives write permission to the vcap group, so we can make those directories in start scripts.

Let us know if this fixes the issue for you, and close the issue if it does!

@sax && @adowns01

@sykesm
Copy link
Author

sykesm commented Jul 23, 2016

Thanks.

@sykesm sykesm closed this as completed Jul 23, 2016
capi-bot added a commit that referenced this issue Jun 22, 2023
Bump src/code.cloudfoundry.org/tps
  dependabot[bot]:
     Bump code.cloudfoundry.org/lager/v3 from 3.0.1 to 3.0.2 (#25)
     Bump github.com/cloudfoundry/dropsonde from 1.0.0 to 1.1.0 (#24)
capi-bot added a commit that referenced this issue Jun 22, 2023
Bump src/code.cloudfoundry.org/tps
  dependabot[bot]:
     Bump github.com/lib/pq from 1.10.7 to 1.10.9 (#28)
     Bump code.cloudfoundry.org/lager/v3 from 3.0.1 to 3.0.2 (#25)
     Bump github.com/cloudfoundry/dropsonde from 1.0.0 to 1.1.0 (#24)
capi-bot added a commit that referenced this issue Oct 26, 2023
…/tps

Bump src/code.cloudfoundry.org/cc-uploader
  dependabot[bot]:
     Bump github.com/onsi/gomega from 1.28.1 to 1.29.0 (#25)
Bump src/code.cloudfoundry.org/tps
  dependabot[bot]:
     Bump github.com/onsi/gomega from 1.28.1 to 1.29.0 (#43)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants