Test coverage for healthcheck mechanism #303

philwinder · 2015-09-15T15:02:13Z

We have sporadic evidence that the healthcheck timeout doesn't work properly.

Seems to be caused by a few subtle bugs.

Refactor ClusterMonitor to allow testing
Ensure that a new item isn't added to the cluster when re-starting with old tasks (addNewTaskToCluster)
Test that the correct number of items in is the cluster and is being monitored
Refactor cluster state out of ClusterMonitor. Has-A, not manages-a
Remove check for too many executors method. Should be in the scheduler

Normally, if you kill an executor, mesos detects it and sends a TASK_KILLED message straight away. Reconciliation works fine.

But sometimes, like when zookeeper has the wrong state (state says executor is running, but it is not running) it seems like the healthchecks get stuck somewhere. Suggest writing system test, based upon this, to see if we can replicate.

Possibly seen in:
#285

Tymofii's test cluster.

philwinder · 2015-09-21T08:00:54Z

fixed in #320

philwinder added the bug label Sep 15, 2015

philwinder modified the milestones: Backlog, 0.4.2 Sep 15, 2015

philwinder self-assigned this Sep 16, 2015

This was referenced Sep 16, 2015

When executors are lost, the status is not propagated back to zookeeper #310

Closed

Executor timeout large value #318

Closed

philwinder closed this as completed Sep 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test coverage for healthcheck mechanism #303

Test coverage for healthcheck mechanism #303

philwinder commented Sep 15, 2015

philwinder commented Sep 21, 2015

Test coverage for healthcheck mechanism #303

Test coverage for healthcheck mechanism #303

Comments

philwinder commented Sep 15, 2015

philwinder commented Sep 21, 2015