-
Notifications
You must be signed in to change notification settings - Fork 533
feat: collect remote resource status when enabled #1292
feat: collect remote resource status when enabled #1292
Conversation
I'll have a look at the tests. |
fde81cd
to
710f39a
Compare
It seems the in-memory controllers test is failing 🔍 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits and questions along the way
``` | ||
|
||
An example of a `FederatedDeployment` federated in a `cluster1` where the feature | ||
collected the resource remote status `remoteStatus`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unclear...
- does the example below define the full schema of
remoteStatus
? - does the structure in
remoteStatus
depend on the targettype? - Is there a list of acceptable targetTypes?
- using 1 example...
status.clusters["cluster1"].replicas
(not exact jsonpath, but hopefully clear). What is the value here if the targettype does support replicas? - what does `conditions[*].status == "True" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status: "True"
is part of the status of the DeploymentStatus core resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first pass, it's mostly nits. On Monday I'll do another one as I think I understand the flow of the feature much better now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, did my second pass. I think the general direction makes sense (doing this in the sync controller where propagation status is already updated) - so 👍 .
I have couple of questions/concerns though and because of that not ready to approve.
@@ -207,18 +207,20 @@ func (f *FederatedTypeConfig) GetFederatedType() metav1.APIResource { | |||
return apiResourceToMeta(f.Spec.FederatedType, f.GetFederatedNamespaced()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please be a little bit more verbose with the docs here especially now that things are a bit more complicated
note that statustype is invalid when rawresourcecollection is enabled, also not that without rawresourcecollection, statusenabled works only for services. I would personally already go as far as marking the statustype as deprecated.
this kind of information should be available on the type level https://github.com/kubernetes-sigs/kubefed/blob/11cbdf9171b19eeabdc56a6c2e1f29f6c851eee9/pkg/apis/core/v1beta1/federatedtypeconfig_types.go#L49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you propose to mark statusType as deprecated, where we agreed on that. I don't think we mentioned in the KEP to remove the code from the old mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I just wanted for this PR to enhance the docs above that field. Not remove anything
|
||
d.unmanagedDispatcher.RemoveManagedLabel(clusterName, clusterObj) | ||
} | ||
|
||
func (d *managedDispatcherImpl) RecordClusterError(propStatus status.PropagationStatus, clusterName string, err error) { | ||
d.fedResource.RecordError(string(propStatus), err) | ||
d.RecordStatus(clusterName, propStatus) | ||
d.RecordStatus(clusterName, propStatus, "UnknownClusterError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this good idea? That means that instead of a status resource of known structure we'll report string and that can break people parsing the status expecting it to have form of the resource status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource status for the federated resource does not have a known structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats not right. The resource status will be an unstructured object which is supposed to be of status type of that particular resource. Users can be expected to decode it into a typed structure using the type informartion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not have explicit schema but as a consumer of the Federated api you would expect the schema to be the same as schema of the status of the underlying resource, right?
} | ||
|
||
func (d *managedDispatcherImpl) recordOperationError(propStatus status.PropagationStatus, clusterName, operation string, err error) util.ReconciliationStatus { | ||
d.recordError(clusterName, operation, err) | ||
d.RecordStatus(clusterName, propStatus) | ||
d.RecordStatus(clusterName, propStatus, "UnknownOperationError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, not a big fan of the string here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving it empty? :-) why is propagation status not enough in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil :-)
and the error should surface in conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think empty is much better than string from consumer standpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, yeah. I like the fact that conditions will be used for such cases! So +1
pkg/controller/sync/status/status.go
Outdated
Name string `json:"name"` | ||
Status PropagationStatus `json:"status,omitempty"` | ||
RemoteStatus interface{} `json:"remoteStatus,omitempty"` | ||
// Conditions []*metav1.Condition `json:"conditions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so are conditions part of this PR or not? KEP still mentions it but I don't see them implemented and I think they are a bit redundant now with the full status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is part of the KEP, but how stated in the PR description This work is a first iteration
. Anyway I could remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, my second part of the question is whether they are still needed now that we federate the whole status. The original idea was to federate only conditions but we stepped away from that.
} | ||
return status.CollectedPropagationStatus{ | ||
StatusMap: statusMap, | ||
ResourcesUpdated: d.resourcesUpdated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this rather be a field of CollectedPropagationStatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more the reasoning behind this decision ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it's a field with the same value. Actually it looks like ResourcesUpdated
on CollectedResourceStatus
is never used. Not sure if it's bug or if it's just redundant. Might be redundant because it always has the same value as the other one 🤷
c46c070
to
e5b8ce8
Compare
@hectorj2f I noticed that the default federated CRDs don't have |
|
||
Name string `json:"name"` | ||
Status PropagationStatus `json:"status,omitempty"` | ||
RemoteStatus interface{} `json:"remoteStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deviates from the proposal (KEP) that you originally posted, the intent of which was to be able to precisely capture the state of a given resource in each cluster here. In other words mainly derive at the state as ready
or notReady
. Its understood, that state can only be derived from individual resource status per cluster, I have my apprehensions in storing the status (in etcd) as part of the individual federated resources. What you have added in this PR can already be done via the status controller as I did mention in this comment on your KEP, but as a separate status resource. The only change that would be needed would be to remove this condition. I am quite fine with capturing the raw status as a different resource right now.
The concerns that I have with storing the raw status as part of the federated resource:
- Performance: This is more of a question then anything else (as I don't have concrete numbers myself). I guess the CRUD on individual resources will be much more expensive, given all the additional statuses stored in it. This can be a problem as the sync reconciler already does a lot and is a very regular process.
- Manually unreadable: Think getting a federated resource list and then trying to figure out federated fields among all the statuses.
- Confusing statuses/so many statuses at the same place: As a user what I am really interested in is:- if my resource was propagated fine to all clusters and then, is my resource in ready/not ready state in my clusters, with some place I can go an refer to (for example the separate resourceStatus) if I see a probalematic state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said this, I would also add to please treat this as a trigger for a conversation to arrive at the best solution.
Another problem, that I do remember with the raw status collection is user reports of cluster etcd being filled up to its limits, the reason why we did put this condition in place.
@jimmidyson @RainbowMango
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, it will be collected only where you wanna collect it - you still have to turn it on per type. That said, maybe this perf problems will be partially solved by a solution we were also discussing and that's not federating whole status but just a selected subset via jsonpath. That at least makes it possible to lower the performance impact.
But overall I feel your concerns regarding performance even though this is opt-in so current installations won't be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @irfanurrehman
What you have added in this PR can already be done via the status controller as I did mention in this comment on your KEP, but as a separate status resource.
@irfanurrehman Yes, we've deviated from the original KEP due to some recent requirements, but we also updated the original KEP as part of this PR. The reason is to store the status of the federated resources in the target clusters instead of using the original approach FederatedServiceStatus
. As mentioned in the description, this PR is a first iteration of the original proposal, fetching the status from the federated resources does not determine its readiness, so there is some work to be done.
@irfanurrehman The main reason to exclude the current approach is the inefficiency of having to store/manage the status of each resource as an additional resource type and resource in the datastore. I will share our thoughts about your concerns in the following:
Performance: This is more of a question then anything else (as I don't have concrete numbers myself). I guess the CRUD on individual resources will be much more expensive, given all the additional statuses stored in it. This can be a problem as the sync reconciler already does a lot and is a very regular process.
Firstly, the feature is disabled by default for all the resources. When enabled, it'd be done for those specific resources. In addition to that, I decided to omit the status controller to carry on this new task, and instead rely on the dispatch controller which is already fetching the resource. We are already periodically fetching the resource, so we could simply reuse the same logic to consume its status.
Another problem, that I do remember with the raw status collection is user reports of cluster etcd being filled up to its limits, the reason why we did put this condition in place.
This sounds more like a problem we had with etcd v2 where bolt stored data locally on each node. Etcdv3 has improved this behavior and the size of the database is relatively improved with new boltdb improvements.
In any case we are planning to evaluate the performance cost of this change.
Manually unreadable: Think getting a federated resource list and then trying to figure out federated fields among all the statuses.
I agree but I assume this is a decision of the user to desire having the status of the resources as part of the federated resources.
Confusing statuses/so many statuses at the same place: As a user what I am really interested in is:- if my resource was propagated fine to all clusters and then, is my resource in ready/not ready state in my clusters, with some place I can go an refer to (for example the separate resourceStatus) if I see a problematic state.
Yes, I agree with the terms here. But I see this feature as an addition to have a better visibility of the status of the federated resources to be aware of consume their status in centralized manner. I can see value out of this feature when having the status as part of the FederatedResource.
pkg/controller/sync/controller.go
Outdated
collectedStatus := dispatcher.CollectedStatus() | ||
return s.setFederatedStatus(fedResource, status.AggregateSuccess, &collectedStatus) | ||
collectedStatus, collectedResourceStatus := dispatcher.CollectedStatus() | ||
klog.Infof("Setting the federating status '%v'", collectedResourceStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to have this at loglevel 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/federating/federated
pkg/controller/sync/controller.go
Outdated
return true, nil | ||
} | ||
|
||
klog.Infof("Updating status for %s %q", kind, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we don't want a flood of messages at loglevel 0 for each resource status updated.
|
||
d.unmanagedDispatcher.RemoveManagedLabel(clusterName, clusterObj) | ||
} | ||
|
||
func (d *managedDispatcherImpl) RecordClusterError(propStatus status.PropagationStatus, clusterName string, err error) { | ||
d.fedResource.RecordError(string(propStatus), err) | ||
d.RecordStatus(clusterName, propStatus) | ||
d.RecordStatus(clusterName, propStatus, "UnknownClusterError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats not right. The resource status will be an unstructured object which is supposed to be of status type of that particular resource. Users can be expected to decode it into a typed structure using the type informartion.
} | ||
|
||
func (d *managedDispatcherImpl) recordOperationError(propStatus status.PropagationStatus, clusterName, operation string, err error) util.ReconciliationStatus { | ||
d.recordError(clusterName, operation, err) | ||
d.RecordStatus(clusterName, propStatus) | ||
d.RecordStatus(clusterName, propStatus, "UnknownOperationError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil :-)
and the error should surface in conditions
e5b8ce8
to
331a14a
Compare
@hectorj2f I see that you have pushed new changes. Have you completed your review rework? |
@irfanurrehman Yes. But I want to share some graphs and numbers from a test we performed to compare existing vs current approach. I'll ping you this week, thanks for keeping one eye on the PR. |
331a14a
to
02f40ac
Compare
@hectorj2f Thank you so much for that in-depth performance benchmarking. I don't see anything of concern there. One thing that I don't see covered though is when resources are propagated to multiple clusters. It would be good to see what happens when propagating resources to 10, 50, 100, 200 clusters. Even without that, I am happy with this approach and would like to see it merged. |
Signed-off-by: Hector Fernandez <[email protected]>
02f40ac
to
154d5f6
Compare
@jimmidyson I agree those tests are missing. But the degradation would be proportional to the amount of clusters in a linear manner to how the system behaves today under these circumstances. |
I will look at the tests, since I rebased last time it is failing 🤔. |
Thanks @hectorj2f for doing this and apologies for missing your update earlier. |
75b03b2
to
1ea9fac
Compare
@irfanurrehman Yes, it is done. Please, feel free to re-review it. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I would love to do some follow-ups around naming stuff but otherwise this looks great. I also tested it locally and it's working 👏
Signed-off-by: Hector Fernandez <[email protected]>
1ea9fac
to
6c72bac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hectorj2f Thanks for doing this. I have one observation which remains open. Please update that and then we can merge.
Signed-off-by: Hector Fernandez <[email protected]>
29eee26
to
9e71719
Compare
Thanks @hectorj2f for doing this. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alenkacz, hectorj2f, irfanurrehman, makkes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR aims to implement this feature https://github.com/kubernetes-sigs/kubefed/blob/ba8f9a44e85951e460454ab7875cb0a551750124/docs/keps/20200619-federated-resource-status.md. This work is a first iteration towards improving the troubleshooting of the status of federated resources.
In this branch, we add a new property to the FederatedResources
remoteStatus
which stores the status of the created object on each federated cluster. An example of a FederatedDeployment and its status is shown below:To enable this feature, you need to add the feature gate in the kubefedconfig spec:
In addition to that step, you will need to enable the
statusCollection
for the FederatedTypeConfig resources you wish to collect theremoteStatus
. To do so, you set the propertystatusCollection
toEnabled
.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
I am testing some changes in the test suite so additional polishing actions are expected.
Obviously this feature could potentially cause some performance degradation if our scenario federates many resources and we enable the status collection for all them. We are working on measuring and improving this behavior.