-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate "machine was allocated without proper switch connections" #21
Comments
@Gerrit91 was #31 related to this ? cant remember why, maybe @mwindower has some helpful input as well |
IMHO we should add a validation of the reported registration data and prevent the metal-hammer to enter the wait phase when for example the neighbor condition cannot be verified from the metal-api perspective. |
It was not related to #31. |
also covered a bit with #256 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There are two possibilities to get into a state where you have a machine that you cannot reach over the network after the allocation:
In both cases, we cannot find out to which switch a machine is connected to.
This can lead to the following failure state:
Can we prevent this state? As this is actually confusing... the resulting machines are unusable for a user.
For scenario (1) you can get the switch connection after rebooting the machine and everything would be fine.
Both problems can be mitigated an assertion like this: the machine report should fail if there are not two switches visible from the machines.
This will cause the report to fail more often and the t1-small servers won't get to the waiting state any more.
To be honest, it is not so likely to get into this state. The last time this happened was because we updated the metal-core, the metal-api and wiped the rethinkdb. However, it's better for the robustness if we prevent these states anyway as they are possibly easy to prevent.
The problem is: The metal-api does not care if there are two switch connections to the machine or not. It will allow machine allocation without this condition fulfilled. The metal-hammer could actually report some wild stuff about switch neighbors to the metal-api, the api would say "fine" and when you allocate it, you would end up with an unusable machine. And this is what happened: The "machine connections" got lost because we had new switches registered at the api, but the machines behind the switches were already in the waiting state. The metal-api should at least validate if it is actually able to construct a proper switch configuration before allowing machine allocation.
--
Ideally, such a machine should not even be able to enter the wait table. This would cause a reboot of the machine re-reporting the connections + not having a user allocate such a machine.
The text was updated successfully, but these errors were encountered: