Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to use the FilteredPodInformer to only watch Tekton relevant pods #3725

Merged
merged 1 commit into from
Feb 4, 2021

Conversation

zhangtbj
Copy link
Contributor

@zhangtbj zhangtbj commented Jan 28, 2021

Fixed for the issue: #3561

we had an outage happens in our environment, the Tekton controller are impacted, most of the controllers were killed by OOM, and restart continuously.

The root cause is one end-user run a large number of pods, as result there are ~8k completed pods in one namespace. And one pod has 100K size as there is a big env variable in the pod spec. And currently, most of the controllers load all pods in the cluster into controller informer, that is why controllers run into OOM. Besides memory increasing in controllers, ETCD usage is also increased to 2 G at that time.

caabee80-250d-11eb-95a4-7a37e4890776

We already understand the problem by using the issue: #3561 (For Tekton, we should just filter and cache the Tekton related pods, not all pods), and the PR was merged: knative/pkg#1940 from Knative pkg side, we also need to add it on the Tekton side to avoid this kind of problem in future.

@HeavyWombat and I provided the PR and verified it works:

[Original] The controller memory usage is increased after there are a deployment with 500 pods (non-Tekton related)
Screen Shot 2021-01-23 at 2 51 03 PM

[Original] The controller memory usage is increased after there are 500 taskrun pods (Tekton related)
Screen Shot 2021-01-23 at 2 18 05 PM

[After use new FilteredPodInformer] The controller memory usage is NOT increased after there are a deployment with 500 pods (non-Tekton related)
Screen Shot 2021-01-23 at 5 44 50 PM

[After use new FilteredPodInformer] The controller memory usage is increased after there are 500 taskrun pods (Tekton related)
Screen Shot 2021-01-23 at 6 25 29 PM

So the code works well.

Some comments for my PR:

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Performance improvement: Only watch Pods managed by Tekton, instead of all Pods on the cluster

Also cc @afrittoli

Thanks for the review!

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jan 28, 2021
@tekton-robot tekton-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2021
@tekton-robot
Copy link
Collaborator

Hi @zhangtbj. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 28, 2021
@afrittoli
Copy link
Member

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 28, 2021
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2021
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for this contribution, and for sharing the thorough analysis that you collected. I think this is going to be really useful to users running Tekton on clusters alongside other workloads, which makes it hard to discover in our tests.

@@ -109,6 +111,7 @@ func main() {
log.Fatal(http.ListenAndServe(":"+port, mux))
}()

ctx = filteredinformerfactory.WithSelectors(ctx, v1beta1.ManagedByLabelKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about how this works, so I'd like to check that my understanding is correct.

This WithSelectors modifies the ctx to make it possible to filter on the managed-by label, which we take advantage of later in taskrun/controller.go#55 to create an informer with that ability. Is that correct?

That's a bit... subtle... but I don't know of a better way myself either, and the benefit seems clear.

Is there any kind of test you could add that would demonstrate that the controller doesn't get informed of a Pod it's not managing? That would help ensure that the logic is correct, and guard against future regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Jason!

This WithSelectors modifies the ctx to make it possible to filter on the managed-by label, which we take advantage of later in taskrun/controller.go#55 to create an informer with that ability. Is that correct?

Yes, correct.

  • First of all, we need to create a FilteredInformerFactory to setup the FilteredPodInformer, we can pass an label array to filter and cache resources by using multiple labels.
  • Then in the taskrun controller logic, when getting pod resource from informer (taskrun/controller.go#55), the Tekton controller logic can choose ONE of label cached resource from informer store, like the managed-by label we use

Is there any kind of test you could add that would demonstrate that the controller doesn't get informed of a Pod it's not managing? That would help ensure that the logic is correct, and guard against future regressions.

Actually, I tried to add an unit test to cover the fail case. but I found it is hard to me to do that. :(
First of all, it is a framework change, not a functional change
I tried to create the context but without selector by using the knative.dev/pkg/reconciler/testing original SetupFakeContext function. Yes, it will report an error:

=== RUN   TestFilteredPodInformer
    logger.go:130: 2021-01-29T11:02:28.254+0800	PANIC	fake/fake_filtered_factory.go:47	Unable to fetch labelkey from context.
--- FAIL: TestFilteredPodInformer (0.00s)
panic: Unable to fetch labelkey from context. [recovered]
	panic: Unable to fetch labelkey from context.

But it is a panic, the SetupFakeContext function cannot return an error so that I can check if it is the error of Unable to fetch labelkey from context. So we gave up and assume, if in future we miss the WithSelector in context when setting up the filetered pod informer, we will see the same error and all related unit test will fail too. Like the tests which I modified:

  • pkg/reconciler/pipelinerun/metrics_test.go
  • pkg/reconciler/taskrun/metrics_test.go
  • test/v1alpha1/wait_test.go

Or please let me know if you have any good idea! Thanks! :)

@imjasonh
Copy link
Member

/lgtm

Thanks again for tracking this down, and for the thorough analysis.

I updaded the release-note section of your initial report, please feel free to update/reword it however you want.

@tekton-robot tekton-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Jan 29, 2021
@zhangtbj
Copy link
Contributor Author

/retest

@zhangtbj
Copy link
Contributor Author

zhangtbj commented Jan 29, 2021

Excuse me all,

Does anyone what I can do for the pull-tekton-pipeline-build-tests failure? :(

I already tried to run ./hack/update-codegen.sh several times, there is no any code change now, and I ran ./hack/verify-codegen.sh, there is also no any error:

➜  pipeline git:(pod_informer) ./hack/verify-codegen.sh
Generating deepcopy funcs for resource:v1alpha1
Generating clientset for resource:v1alpha1 at github.com/tektoncd/pipeline/pkg/client/resource/clientset
Generating listers for resource:v1alpha1 at github.com/tektoncd/pipeline/pkg/client/resource/listers
Generating informers for resource:v1alpha1 at github.com/tektoncd/pipeline/pkg/client/resource/informers
Generating deepcopy funcs for pipeline:v1alpha1,v1beta1
Generating clientset for pipeline:v1alpha1,v1beta1 at github.com/tektoncd/pipeline/pkg/client/clientset
Generating listers for pipeline:v1alpha1,v1beta1 at github.com/tektoncd/pipeline/pkg/client/listers
Generating informers for pipeline:v1alpha1,v1beta1 at github.com/tektoncd/pipeline/pkg/client/informers
Generating injection for resource:v1alpha1 at github.com/tektoncd/pipeline/pkg/client/resource/injection
Generating injection for pipeline:v1alpha1,v1beta1 at github.com/tektoncd/pipeline/pkg/client/injection
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/controller/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/controller/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/entrypoint/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/entrypoint/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/git-init/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/git-init/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/imagedigestexporter/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/imagedigestexporter/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/kubeconfigwriter/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/kubeconfigwriter/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/nop/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/nop/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/pullrequest-init/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/pullrequest-init/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/webhook/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/webhook/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/controller/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/controller/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/entrypoint/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/entrypoint/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/git-init/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/git-init/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/imagedigestexporter/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/imagedigestexporter/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/kubeconfigwriter/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/kubeconfigwriter/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/nop/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/nop/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/pullrequest-init/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/pullrequest-init/kodata/third_party
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/webhook/kodata/refs
warning: ignoring symlink /Users/jordan/gopath/src/github.com/tektoncd/pipeline/cmd/webhook/kodata/third_party
Generating OpenAPI specification ...
Generating swagger file ...
Diffing /Users/jordan/gopath/src/github.com/tektoncd/pipeline against freshly generated codegen
/Users/jordan/gopath/src/github.com/tektoncd/pipeline up to date.

Does anyone see this error before? Or how to solve the problem?
Thanks a lot!

@vdemeester
Copy link
Member

@@ -761,7 +761,7 @@ func schema_pkg_apis_pipeline_v1beta1_Em
 								Schema: &spec.Schema{
 									SchemaProps: spec.SchemaProps{
 										Default: map[string]interface{}{},
-										Ref:     ref("github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1.WorkspaceDeclaration"),
+										Ref:     ref("./pkg/apis/pipeline/v1beta1.WorkspaceDeclaration"),
 									},
 								},
 							},

Interesting, when you ran the generation script, for the openapi part, it did generate it with relative path 🤔. @zhangtbj where is your working copy cloned at (in GOPATH or not) ? There might be something weird happening because of that, and we need to fix it 😅

@zhangtbj zhangtbj changed the title Change to use the FilteredPodInformer to only watch Tekton relevant pods WIP: Change to use the FilteredPodInformer to only watch Tekton relevant pods Jan 29, 2021
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2021
@zhangtbj zhangtbj changed the title WIP: Change to use the FilteredPodInformer to only watch Tekton relevant pods Change to use the FilteredPodInformer to only watch Tekton relevant pods Jan 29, 2021
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2021
@zhangtbj
Copy link
Contributor Author

zhangtbj commented Jan 29, 2021

Hi @vdemeester ,
Thanks for help!

The GOPATH in my IDE is not the cloned folder in my workspace.

And I tried to provide another PR to switch GOPATH: https://github.com/tektoncd/pipeline/pull/3732/files

It seems doesn't work (Have the same problem).... :(

Do I need to change something on my side, or you will help to fix something on Tekton side?

Thanks!

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2021
@afrittoli
Copy link
Member

/test check-pr-has-kind-label

@zhangtbj
Copy link
Contributor Author

zhangtbj commented Feb 3, 2021

All checks have passed now.

Thanks a lot all!

@pritidesai
Copy link
Member

I am finding it weird, this is not merged yet 🤔

@zhangtbj
Copy link
Contributor Author

zhangtbj commented Feb 3, 2021

Yep, not sure if anyone can help on it :)

@pritidesai
Copy link
Member

There is one more such PR #3736, both have required checks and reviews and says both are in merge pool but havent merged since past 13 or so hours 🤔 @tektoncd/core-maintainers any idea?

@vdemeester
Copy link
Member

vdemeester commented Feb 3, 2021

/lgtm cancel
This PR needs a rebase ("This branch cannot be rebased due to conflicts" appears above the merge button). Sometimes this happens, GitHub refuses to merge but prow doesn't see it as a rebase problem so… the queue gets "stuck".

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 3, 2021
@zhangtbj
Copy link
Contributor Author

zhangtbj commented Feb 3, 2021

Thanks for the solution, Vincent!

I have fixed the conflict in my branch. The test passed again.

Please help review if it is ok again. Thanks!

@vdemeester
Copy link
Member

@zhangtbj can you squash your commit into one 🙏🏼

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 3, 2021
@HeavyWombat
Copy link
Contributor

@zhangtbj can you squash your commit into one 🙏🏼

Something weird happened with the commits. To be honest, it escapes me what exactly went wrong. So I (kind of) cherry-picked the commits again based on the latest master and then squashed everything together. I manually verified that the diff shows no delta to the previous one and verify-codegen.sh also went through without issues on my local machine.

@pritidesai
Copy link
Member

pritidesai commented Feb 3, 2021

("This branch cannot be rebased due to conflicts" appears above the merge button)

interesting, I didn't see it either 🤔

@pritidesai
Copy link
Member

This is interesting, the integration tests failing to fetch master branch in community repo since there is no master branch now 🙃

+ /ko-app/git-init -url https://github.com/tektoncd/community.git -revision master -refspec  -path /workspace/output/application '-sslVerify=true' '-submodules=true' -depth 1
        {"level":"error","ts":1612372701.073239,"caller":"git/git.go:54","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force master]: exit status 128\nfatal: couldn't find remote ref master\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:54\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:145\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:52\nruntime.main\n\truntime/proc.go:204"}
        {"level":"fatal","ts":1612372701.073441,"caller":"git-init/main.go:53","msg":"Error fetching git repository: failed to fetch [master]: exit status 128","stacktrace":"main.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:204"}

@vdemeester I am guessing all the PRs will face this failure. We will have to change the default branch to main in examples.

@pritidesai
Copy link
Member

Or may be its time to create a test repo under tektoncd and utilize that instead of cloning pipeline or community repo in our examples?

@vdemeester
Copy link
Member

This is interesting, the integration tests failing to fetch master branch in community repo since there is no master branch now

+ /ko-app/git-init -url https://github.com/tektoncd/community.git -revision master -refspec  -path /workspace/output/application '-sslVerify=true' '-submodules=true' -depth 1
        {"level":"error","ts":1612372701.073239,"caller":"git/git.go:54","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force master]: exit status 128\nfatal: couldn't find remote ref master\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:54\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:145\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:52\nruntime.main\n\truntime/proc.go:204"}
        {"level":"fatal","ts":1612372701.073441,"caller":"git-init/main.go:53","msg":"Error fetching git repository: failed to fetch [master]: exit status 128","stacktrace":"main.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:204"}

@vdemeester I am guessing all the PRs will face this failure. We will have to change the default branch to main in examples.

oh my.. I missed that… 🤦🏼‍♂️ I will prepare a fix asap 😉

@pritidesai
Copy link
Member

lets try one more time after @vdemeester fixes on branch renaming 😄

/test pull-tekton-pipeline-integration-tests

There was an outage in our environment, where the Tekton controller were
impacted. Most of the controllers were killed by OOM, and restarted
continuously. The root cause is one end-user ran a large number of pods,
as a result there were ~8k completed pods in one namespace. And one pod
had 100K size as there was a big env variable in the pod spec. Most of
the controllers loaded all pods in the cluster into controller informer,
which resulted in the aforementioned OOM scenario.

Switch to filtered pod informers and use Tekton specific label to only
load Tekton managed pods into the informer cache.

Signed-off-by: Matthias Diester <[email protected]>
@HeavyWombat
Copy link
Contributor

Rebased feature branch to pull in latest changes from master.

@zhangtbj
Copy link
Contributor Author

zhangtbj commented Feb 4, 2021

Thanks @HeavyWombat and all!

We have squash into one commit and the test passed again.

Please review

@vdemeester
Copy link
Member

All fixed yay, thanks @zhangtbj
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2021
@tekton-robot tekton-robot merged commit fbceaae into tektoncd:master Feb 4, 2021
@zhangtbj
Copy link
Contributor Author

zhangtbj commented Feb 4, 2021

Thank you a lot all! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants