-
Notifications
You must be signed in to change notification settings - Fork 47
packet: disable syncing allowed SSH keys on nodes #471
packet: disable syncing allowed SSH keys on nodes #471
Conversation
Big question here is, do we want to test it now, and if yes, how? |
By "test", do you mean "test automatically"? AFAICT all we have to do to test this change manually is deploy a cluster with one key specified in Ignition, unless I've completely missed your point. |
Yes, I meant "test automatically", testing manually does not count :) Anyway, I have an idea how to test that. CI takes SSH key from
|
Where will this logic sit? Inside a Go test? |
Yes, in our e2e test suite. |
Implemented tests for that, it was fun! |
22d1aa3
to
efd1e56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits.
efd1e56
to
9a73a36
Compare
This doesn't seem to be tested:
I think the right env var is |
9a73a36
to
48be61d
Compare
Oops, fixed, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments (the main one is missing articles in comment strings).
I can't get the test to pass - the DS pods are crashlooping and there is no output in the logs. I've created a Packet cluster and used the following commands to run the test:
export PUB_KEY=$(cat key.pub)
go test -v -tags e2e,packet ./test/system -run TestNoExtraSSHKeysOnNodes
Output:
=== RUN TestNoExtraSSHKeysOnNodes
=== PAUSE TestNoExtraSSHKeysOnNodes
=== CONT TestNoExtraSSHKeysOnNodes
TestNoExtraSSHKeysOnNodes: util.go:65: using KUBECONFIG=/home/johannes/kinvolk/clusters/packet-ssh-keys/assets/cluster-assets/auth/kubeconfig
TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
TestNoExtraSSHKeysOnNodes: util.go:127: daemonset: test-ssh-keys-b5p8f, replicas: 3/3
...
TestNoExtraSSHKeysOnNodes: util.go:134: error while waiting for the daemonset: timed out waiting for the condition
--- FAIL: TestNoExtraSSHKeysOnNodes (300.25s)
FAIL
FAIL github.com/kinvolk/lokomotive/test/system 300.257s
FAIL
test/components/util/util.go
Outdated
const ( | ||
// RetryInterval is the default retry interval for tests using retry strategy, so they | ||
// don't have to specify their own local intervals. | ||
RetryInterval = time.Second * 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why put these consts here? They are used in test/system/ssh_keys_test.go
so I think they should live as close as possible to where they are used. I think we want to minimize our reliance on "util" packages, not extend it.
I'm assuming your rationale is to avoid duplication when new code is added in the future which may need to use retries. My suggestion is to do the refactoring if/when that happens, because then you will have the necessary info regarding the best place to put these consts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prepares a ground for a better code, by adding the const and then actually using them. Refactoring is out of scope of this PR. And I want to avoid adding yet another duplication.
This way, newly added code should use those const, meaning once we actually do the refactoring, we have less work to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against adding the constants, I'm against adding them in a separate file because we might need to reuse them at some unknown point in the future.
Would you consider doing the change when it is actually needed rather than based on a theoretical predication of the future? The price we are paying by doing this right now is that we expand our reliance on "util" packages. If we put any code piece which we may be able to reuse in the future in a "util" package, we'd gradually be moving most of our logic there.
If you do the optimization once you encounter a concrete problem (e.g. "oh, I need to duplicate this code to make it work"), you will have a better idea where to put this code based on the concrete case you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to duplicate this code already:
$ git grep 'time.Minute\*5' test/
test/components/cert-manager/cert-manager_test.go: testutil.WaitForDeployment(t, client, namespace, test.deployment, time.Second*5, time.Minute*5)
test/components/contour/contour_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/contour/contour_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/coredns/coredns_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/dex/dex_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/external-dns/external-dns_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/flatcar-linux-update-operator/fluo_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/flatcar-linux-update-operator/fluo_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/gangway/gangway_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/metallb/metallb_test.go: testutil.WaitForDaemonSet(t, client, namespace, daemonset, time.Second*5, time.Minute*5)
test/components/metallb/metallb_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/metrics-server/metrics_server_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/openebs-operator/openebs_operator_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
test/components/rook/rook_test.go: testutil.WaitForDeployment(t, client, namespace, deployment, time.Second*5, time.Minute*5)
This change will prevent exactly that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you are bothered by "duplicating" the expression time.Minute*5
inside test/components
. I don't understand how the test/components
package is related to this PR (which touches SSH keys on Packet). I also don't think that repeating time.Minute*5
can be considered code duplication, unless we are assuming that all these tests must use the same value there. If the "duplication" in test/components
is what we're discussing here, I'd say it's completely out of scope and move that part into a separate PR.
FWIW, if we decide that all tests should use the same time.Minute*5
value, the way I would solve it is by defining the constant in test/components
and avoid the util
package. To be precise, I'd put the following code in test/components/components.go
(or even util.go
if you insist):
package components
const (
RetryInterval = time.Second * 5
)
For this PR I would not use a constant that's intended for using by components, because the code in this PR isn't related to components. In case you are arguing that we should use one global value for all tests, I would simply define the constant in the test
package and import it into any "child" package under test/
.
Lastly, introducing some duplication is IMO much cheaper than adding an import between two otherwise-unrelated packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case I wasn't able to convince you with the above, please go ahead with whatever works for you. We need this change to get merged and I don't want to block it any further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I agree with @johananl here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the constants.
test/system/ssh_keys_test.go
Outdated
// +build aws aws_edge baremetal packet | ||
// +build e2e | ||
|
||
package system_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we call this package system
. It's already under test/
so test/system_test
repeats "test" unnecessarily (see the "do not stutter" recommendation). Also, underscores or camel case are usually discouraged in the Go community. If you feel it's absolutely necessary to include "test" in the package name, I would consider systest
or systemtest
to make the package name more idiomatic. Relevant info: https://blog.golang.org/package-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to satisfy new linter, which IMO we should incorporate also for other code: https://github.com/maratori/testpackage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we agreed on using this new linter? I'm not familiar with this tool. Anyway, if that's what this linter is suggesting, I'd say we need to reconsider using it :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, we should be following best practices and this is one of them. The linter enforces them. If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.
In this particular case, using this package name for unit tests enforces only testing exported functions, which allows to do refactoring in the future. It does not make much sense for e2e tests though, but I don't see a reason why we should make a distinction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, we should be following best practices and this is one of them. The linter enforces them.
Right. The relevant Go best practice is here: https://blog.golang.org/package-names
Where is the best practice which says system_test
is preferred? :-/
If we don't agree to follow some, we make a team decision and we disable some linters. This allows writing more consistent code for everyone involved in the project.
Following that logic, shouldn't we have a team decision to add a linter, too? In any case, IMO the official Go recommendations are much stronger than the opinion of a community member who wrote a linter. Do you think otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go without _test
.
I see the point of _test
convention to force only exported functions to be tested (it's briefly mentioned here https://tip.golang.org/cmd/go/#hdr-Test_packages and it's used in the go standard library).
In this case it doesn't really matter because we're in a separate package anyway and we're not doing unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @iaguis. I wasn't aware _test
in package names was a thing, and couldn't figure this out from the discussion above. So IIUC _test
is a special case, i.e. there is no recommendation to call a package cluster_configuration
. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, IIUC this is only for tests, not for package names in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I should go without _test
(though I think it does not matter here and it's a good habit to keep it), what should I do with the linter? Do we want to disable the rule globally? Or should each new test file get //nolint:testpackage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added //nolint:testpackage
for now.
Looks like the cleanup doesn't work - I still have the DS on the cluster following a failed test run. Could this be caused by the following? lokomotive/test/components/util/util.go Line 124 in 2962730
|
I've checked Also, we can probably improve the test: right now we are determining pass/fail based on the convergence of the DS. This has the following problems:
I suggest the following changes:
|
Ah, the problem is that the |
Ah, with unit disabled, we no longer get this line. I was modifying the file manually to make the test pass. I didn't have time to look up CI results. Will remove that then.
IIRC the cleanup only takes place if test succeeds, so you can investigate it. Do you think we should always do the cleanup? Regarding the tests failing, I think we must run on at least one worker and one controller node, as there are 2 knobs in the code, which are responsible for it (#449). The easiest way to do that is via Regarding the failing test reason, given that we run tests on CI, I wouldn't print the content of authorized_keys in the test logs, as it may leak e.g. email addresses. I think just reporting SHA512 mismatch and skipping the cleanup is reasonable here, so you can go and investigate.
I wanted to avoid doing more interactive things, as they seem overly complicated to use in testing for me.
This would make sense, but then I need to iterate over all the nodes, create pods there etc. |
8048022
to
992b55b
Compare
Also, we should be optimizing the passing case, not failing, as ideally it should never be failing. If it fails, IMO it's fine for it to happen slowly. |
992b55b
to
ceb2e46
Compare
Tests are passing, except flaky ARM job. |
e459649
to
e8df46e
Compare
e8df46e
to
859f9eb
Compare
Now all of them fail on bootkube 🤔 |
I disagree. A slow test affects the run time of all CI builds. If we have an easy way to fail fast, I don't see why we should wait 5 minutes. |
I guess we can skip the cleanup on failure as long as doing so doesn't affect other tests. |
You won't have to iterate. AFAICT you only have to test one worker node and one controller node. |
This means I need to find the node name and then run 2 independent tests, though very similar. This seems fragile to me and unnecessarily complicated.
But unless you break things, you won't notice that, so it will only affect small % of the builds.
I don't think it should affect anything. |
@invidian please go ahead with whatever makes sense to you regarding the test. My comments regarding the DaemonSet weren't meant as a "hard block". They were rather suggestions on how to make the test better in case you're still working on improving it. |
859f9eb
to
6b6f3d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm when green.
This commit disables syncing authorized SSH keys for core user on nodes from Packet's user's keys and project's keys, so only keys which are specified in the Lokomotive configuration are actually authorized, as having more keys allowed than specified in the configuration might be a potential security threat. Closes #465 Signed-off-by: Mateusz Gozdek <[email protected]>
This commit adds e2e test, to verify, that there is no extra SSH keys added to the Kubernetes nodes, other than one specified in the SSH_KEY environment variable, which is used by the CI for provisioning the clusters. Ideally, the test should pull the keys directly from the cluster configuration, however this is currently not trivial to implement. Refs #465 Signed-off-by: Mateusz Gozdek <[email protected]>
6b6f3d8
to
72079dd
Compare
@iaguis tests are passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm , with a suggestion.
// Set the right arguments from the manifest with the desired SHA512 sum. | ||
ds.Spec.Template.Spec.Containers[0].Args = []string{ | ||
"-c", | ||
fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Sprintf("sha512sum --status -c <(echo %s /home/core/.ssh/authorized_keys) && exec tail -f /dev/null", sum), //nolint:lll | |
fmt.Sprintf("sha512sum \" + | |
"--status -c <(echo %s /home/core/.ssh/authorized_keys) && \" | |
"exec tail -f /dev/null", sum | |
), |
would something like this be better to avoid //nolint:lll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that improves the readability 😄
@invidian please re-trigger the job, baremetal job failed for some weird reason. |
It's done, it runs here: https://yard.lokomotive-k8s.net/teams/main/pipelines/lokomotive-oss-bare-metal-pr/jobs/test-lokomotive-deployment/builds/611.1 |
This commit disables syncing authorized SSH keys for core user on nodes
from Packet's user's keys and project's keys, so only keys which are
specified in the Lokomotive configuration are actually authorized, as
having more keys allowed than specified in the configuration might be a
potential security threat.
Closes #465
Signed-off-by: Mateusz Gozdek [email protected]