-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Conversation
Related: #1810 |
Yes, though that won't be merged without substantial changes. I intend this PR to obviate the other. |
Is it possible to predict whether an image is stored in an ECR registry? (I know ECR images use a special URL, but I don't know if there are exceptions to that). If so, I think a better solution is to do the AWS detection lazily (only after we are sure we need to warm up an image registry in ECR). I think that would make all users happy:
Sorry for not coming up with this before. |
Absolutely; the hostname will be a subdomain of
Not a bad idea. The trade-off, I suppose, is that the warning should it come, will be in an unpredictable place in the logs. But probably fairly early, and, well, I would keep the |
ETOOMANYKNOBS ? I don't strongly oppose to it, but IMO having a flag just for that seems a bit overkill. |
The alternative is that it fails quietly, which is worse than it is now (since at least it'll log a messy error at the top of the log, as things stand). |
I think that's just the result of having treated ECR specially. Why is an AWS-credentials error more important than, say, any other registry access error? (those are still buried in the logs and not at the beginning). |
It'd be good to be able to escalate other registry access errors, though it's somewhat harder to do for the piece-wise imagePullSecrets. I'm all for adding The pivot is which of these you think is a better trade-off:
In a sense I'm leaning towards 2) by proposing The main argument against 2) is that it makes it harder to have a general purpose installation -- which Weave Cloud is heavily weighted towards (effectively, 2. is impossible to support for Weave Cloud installations, at present). |
Would it be an idea to put it behind a |
I would also go for 2 if I had to choose between them. However, to avoid flags, how about
Admittedly this won't cover the case of incorporating ECR images in later commits, but I think it's a good compromise between error reporting quality and adding extra knobs. As an additional advantage, we won't be giving special treatment to ECR or any other cloud registry. |
We don't have all known images at startup; that's done asynchronously in another loop, and left running independently. With some rearrangement though, an initial run could be worked into the startup sequence. But at that point I think we'd have a more serious problem: we won't know what's supposed to work, and what is not. At present, people can leave the flags at defaults, and most of their images will be scanned. If there's some that are expected to fail, those will fail and be missing, and the others will be OK. If we fail entirely for anything that can't be scanned at startup, it means everyone has to get their configuration exactly right. For example, there are some infrastructural images used by EKS that are published by the EKS team under a particular account. The ECR-specific code ignores that account by default. If we made all image scans mandatory, and that account changed, all the fluxd instances on EKS would start crashlooping. Another variation would be to scan for AWS access if there are any ECR images discovered at startup; but it's not a given that you must be running on AWS if you are using ECR images (or GCP if using GCR). That's why I proposed adding flags that assert what needs to be available. I am converging on
(and, in the fullness of time, similar arrangements for other environments, in other PRs) |
OK, I agree, let's do that |
I would also like to apply the
The only possible answer is "allow all regions", but this is inconsistent with the other situations, so may be surprising. Perhaps I shouldn't worry about that. |
Not quite true; I could
|
OK, I've done what I think will be the least trouble for users, whether they use AWS or not. Taking the different situations one by one: Using AWS and expecting ECR auth to just work
Not using AWS, not using ECR
Not using AWS, but using some ECR images
|
599a792
to
18abe4a
Compare
@2opremio I'm happy with how this works for the end user (though still open to other ideas); I am not convinced of the implementation, though it achieves the goals* and is fairly concise. *the goals:
|
) | ||
|
||
func optionalVar(fs *pflag.FlagSet, value ssh.OptionalValue, name, usage string) ssh.OptionalValue { | ||
fs.Var(value, name, usage) | ||
return value | ||
} | ||
|
||
type stringset []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify this with the StringSet
defined in warming.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes. They have different methods, and not quite compatible representations, but (for example) I could move StringSet
into a neutral package, add Contains
for main.go to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. There is also a stringset in policies.go . It would be good to unify all of them. However, I understand you may not want to do it in this PR, so I will leave it up to you.
@2opremio Were you happy aside from the |
Yes |
At present fluxd will check initially for access to the AWS API (actually the EC2 metadata service), and log the error it gets if it's not accessible. For people running outside AWS, this can look like something is failing, and cause undue anxiety. On the other hand, if you _are_ running in AWS you probably expect it to just work, and you _do_ want to know about any failure -- otherwise, you might not notice that, say, automation has stopped upgrading images. To cover that second case, and defang the first, the command-line argument `--registry-require=ecr` will make it a fatal error if fluxd can't reach the EC2 metadata service (which I'm using as an approximation to being able to access the AWS API in general). By using the flag, you can escalate a silent failure into a crash, which -- we can hope -- you are more likely to notice.
It is useful to be able to constrain the image repos scanned, even when you don't need or want to use the AWS authentication. This commit makes the constraints operate independently of AWS authentication, by using a "pre-flight check" to determine whether the AWS auth should be used, but applying the constraints either way. The preflight check is also used to exit from main() if `--registry-require=ecr` is set and the AWS API is not available.
b6d5d08
to
82e2f0c
Compare
At present fluxd will check initially for access to the AWS API (actually the EC2 metadata service), and log the error it gets if it's not accessible.
For people running outside AWS, this can look like something is failing, and cause undue anxiety. On the other hand, if you are running in AWS you probably expect it to just work, and you do want to know about any failure -- otherwise, you might not notice that, say, automation has stopped upgrading images.
To cover that second case, and defang the first, the flag
--registry-ecr-mandatory
makes it a fatal error if fluxd can't reach the EC2 metadata service (which I'm using as an approximation to being able to access the AWS API in general). By using the flag, you can escalate a silent failure into a crash, which -- we can hope -- you are more likely to notice.Without the flag, if AWS is not available, fluxd will just log the "pre-flight check failed" warning, and not the full error.
I consulted with Adam on this, since he's been through the experience of using fluxd with ECR repos, and in particular using
kiam
which can cause AWS to be unavailable from pods. He said he anticipated there would be problems withkiam
, so got the fix (using annotations) in ahead of time. However, making the ECR support mandatory means that if that ever breaks, we are more likely to find out.