-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change to use the FilteredPodInformer to only watch Tekton relevant pods #3725
Conversation
Hi @zhangtbj. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
6189954
to
8fd0b96
Compare
8fd0b96
to
04669c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks for this contribution, and for sharing the thorough analysis that you collected. I think this is going to be really useful to users running Tekton on clusters alongside other workloads, which makes it hard to discover in our tests.
@@ -109,6 +111,7 @@ func main() { | |||
log.Fatal(http.ListenAndServe(":"+port, mux)) | |||
}() | |||
|
|||
ctx = filteredinformerfactory.WithSelectors(ctx, v1beta1.ManagedByLabelKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about how this works, so I'd like to check that my understanding is correct.
This WithSelectors
modifies the ctx
to make it possible to filter on the managed-by label, which we take advantage of later in taskrun/controller.go#55 to create an informer with that ability. Is that correct?
That's a bit... subtle... but I don't know of a better way myself either, and the benefit seems clear.
Is there any kind of test you could add that would demonstrate that the controller doesn't get informed of a Pod it's not managing? That would help ensure that the logic is correct, and guard against future regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Jason!
This WithSelectors modifies the ctx to make it possible to filter on the managed-by label, which we take advantage of later in taskrun/controller.go#55 to create an informer with that ability. Is that correct?
Yes, correct.
- First of all, we need to create a FilteredInformerFactory to setup the FilteredPodInformer, we can pass an label array to filter and cache resources by using multiple labels.
- Then in the taskrun controller logic, when getting pod resource from informer (taskrun/controller.go#55), the Tekton controller logic can choose ONE of label cached resource from informer store, like the managed-by label we use
Is there any kind of test you could add that would demonstrate that the controller doesn't get informed of a Pod it's not managing? That would help ensure that the logic is correct, and guard against future regressions.
Actually, I tried to add an unit test to cover the fail case. but I found it is hard to me to do that. :(
First of all, it is a framework change, not a functional change
I tried to create the context but without selector by using the knative.dev/pkg/reconciler/testing
original SetupFakeContext
function. Yes, it will report an error:
=== RUN TestFilteredPodInformer
logger.go:130: 2021-01-29T11:02:28.254+0800 PANIC fake/fake_filtered_factory.go:47 Unable to fetch labelkey from context.
--- FAIL: TestFilteredPodInformer (0.00s)
panic: Unable to fetch labelkey from context. [recovered]
panic: Unable to fetch labelkey from context.
But it is a panic
, the SetupFakeContext
function cannot return an error so that I can check if it is the error of Unable to fetch labelkey from context.
So we gave up and assume, if in future we miss the WithSelector in context when setting up the filetered pod informer, we will see the same error and all related unit test will fail too. Like the tests which I modified:
- pkg/reconciler/pipelinerun/metrics_test.go
- pkg/reconciler/taskrun/metrics_test.go
- test/v1alpha1/wait_test.go
Or please let me know if you have any good idea! Thanks! :)
/lgtm Thanks again for tracking this down, and for the thorough analysis. I updaded the |
/retest |
Excuse me all, Does anyone what I can do for the I already tried to run
Does anyone see this error before? Or how to solve the problem? |
Interesting, when you ran the generation script, for the openapi part, it did generate it with relative path 🤔. @zhangtbj where is your working copy cloned at (in GOPATH or not) ? There might be something weird happening because of that, and we need to fix it 😅 |
Hi @vdemeester , The GOPATH in my IDE is not the cloned folder in my workspace. And I tried to provide another PR to switch GOPATH: https://github.com/tektoncd/pipeline/pull/3732/files It seems doesn't work (Have the same problem).... :( Do I need to change something on my side, or you will help to fix something on Tekton side? Thanks! |
/test check-pr-has-kind-label |
All checks have passed now. Thanks a lot all! |
I am finding it weird, this is not merged yet 🤔 |
Yep, not sure if anyone can help on it :) |
There is one more such PR #3736, both have required checks and reviews and says both are in merge pool but havent merged since past 13 or so hours 🤔 @tektoncd/core-maintainers any idea? |
/lgtm cancel |
Thanks for the solution, Vincent! I have fixed the conflict in my branch. The test passed again. Please help review if it is ok again. Thanks! |
@zhangtbj can you squash your commit into one 🙏🏼 |
cbc347b
to
717c080
Compare
Something weird happened with the commits. To be honest, it escapes me what exactly went wrong. So I (kind of) cherry-picked the commits again based on the latest |
interesting, I didn't see it either 🤔 |
This is interesting, the integration tests failing to fetch
@vdemeester I am guessing all the PRs will face this failure. We will have to change the default branch to |
Or may be its time to create a test repo under |
oh my.. I missed that… 🤦🏼♂️ I will prepare a fix asap 😉 |
lets try one more time after @vdemeester fixes on branch renaming 😄 /test pull-tekton-pipeline-integration-tests |
There was an outage in our environment, where the Tekton controller were impacted. Most of the controllers were killed by OOM, and restarted continuously. The root cause is one end-user ran a large number of pods, as a result there were ~8k completed pods in one namespace. And one pod had 100K size as there was a big env variable in the pod spec. Most of the controllers loaded all pods in the cluster into controller informer, which resulted in the aforementioned OOM scenario. Switch to filtered pod informers and use Tekton specific label to only load Tekton managed pods into the informer cache. Signed-off-by: Matthias Diester <[email protected]>
717c080
to
94df5dd
Compare
Rebased feature branch to pull in latest changes from master. |
Thanks @HeavyWombat and all! We have squash into one commit and the test passed again. Please review |
All fixed yay, thanks @zhangtbj |
Thank you a lot all! :) |
Fixed for the issue: #3561
we had an outage happens in our environment, the Tekton controller are impacted, most of the controllers were killed by OOM, and restart continuously.
The root cause is one end-user run a large number of pods, as result there are ~8k completed pods in one namespace. And one pod has 100K size as there is a big env variable in the pod spec. And currently, most of the controllers load all pods in the cluster into controller informer, that is why controllers run into OOM. Besides memory increasing in controllers, ETCD usage is also increased to 2 G at that time.
We already understand the problem by using the issue: #3561 (For Tekton, we should just filter and cache the Tekton related pods, not all pods), and the PR was merged: knative/pkg#1940 from Knative pkg side, we also need to add it on the Tekton side to avoid this kind of problem in future.
@HeavyWombat and I provided the PR and verified it works:
[Original] The controller memory usage is increased after there are a deployment with 500 pods (non-Tekton related)
[Original] The controller memory usage is increased after there are 500 taskrun pods (Tekton related)
[After use new FilteredPodInformer] The controller memory usage is NOT increased after there are a deployment with 500 pods (non-Tekton related)
[After use new FilteredPodInformer] The controller memory usage is increased after there are 500 taskrun pods (Tekton related)
So the code works well.
Some comments for my PR:
v0.0.0-20210107022335-51c72e24c179
tov0.0.0-20210120200253-8cd47b5af35d
ctx = filteredinformerfactory.WithSelectors(ctx, v1beta1.ManagedByLabelKey)
in thecmd/controller/main.go
before codesharedmain.MainWithConfig
to make sure the filtered label key is added to context, without it the init will report failure:Unable to fetch labelkey from context.
: https://github.com/knative/pkg/blob/master/client/injection/kube/informers/factory/filtered/filtered_factory.go#L56)SetupFakeContext
function in /pkg/reconciler/testing/logger.gofor testing so that I can also add
LabelKeyfor the fake filtered informer:
ctx = filteredinformerfactory.WithSelectors(ctx, v1beta1.ManagedByLabelKey)`, it is similiar like what Kantive serving team does: https://github.com/knative/serving/pull/10266/files/kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes
Also cc @afrittoli
Thanks for the review!