Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-43106: [release-4.17] sriov-network-metrics-exporter fix for multiple nodes #1016

Merged

Conversation

zeeke
Copy link
Contributor

@zeeke zeeke commented Oct 11, 2024

4.17 fix for sriov-network-metrics-exporter when multiple nodes assign the same pciAddress to pods.

@openshift-ci openshift-ci bot requested review from pliurh and SchSeba October 11, 2024 16:22
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2024
@zeeke
Copy link
Contributor Author

zeeke commented Oct 11, 2024

/jira cherrypick OCPBUGS-41711
/test e2e-telco5g-sriov

@openshift-ci-robot
Copy link
Contributor

@zeeke: Jira Issue OCPBUGS-41711 has been cloned as Jira Issue OCPBUGS-43106. Will retitle bug to link to clone.
/retitle OCPBUGS-43106: [release-4.17] sriov-network-metrics-exporter fix for multiple nodes

In response to this:

/jira cherrypick OCPBUGS-41711
/test e2e-telco5g-sriov

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title [release-4.17] sriov-network-metrics-exporter fix for multiple nodes OCPBUGS-43106: [release-4.17] sriov-network-metrics-exporter fix for multiple nodes Oct 11, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 11, 2024
@openshift-ci-robot
Copy link
Contributor

@zeeke: This pull request references Jira Issue OCPBUGS-43106, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected dependent Jira Issue OCPBUGS-41711 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

4.17 fix for sriov-network-metrics-exporter when multiple nodes assign the same pciAddress to pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 11, 2024
Exposed metrics can be verified by scraping the prometheus
endpoint on the `sriov-network-metrics-exporter` pod.
Add a test that spawns an SR-IOV consuming pod and verifies
its receiving counter increase when the interface is pinged from
outside.

Signed-off-by: Andrea Panattoni <[email protected]>
PrometheusRules allow recording pre-defined queries. Use
`sriov_kubepoddevice` metric to add `pod|namespace` pair
to the sriov metrics.

Feature is enabled via the `METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULE`
environment variable.

Signed-off-by: Andrea Panattoni <[email protected]>
When the `metricsExporter` feature is turned off, deployed resources
should be removed. These changes fix the error:

```
│ 2024-08-28T14:07:57.699760017Z    ERROR    controller/controller.go:266    Reconciler error    {"controller": "sriovoperatorconfig", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovOperatorConfig", "SriovOperatorConfig": {"name":"default","namespace":"openshift-sriov-network-operator"},  │
│ "namespace": "openshift-sriov-network-operator", "name": "default", "reconcileID": "fa841c50-dbb8-4c4c-9ddd-b98624fd2a24", "error": "failed to delete object &{map[apiVersion:monitoring.coreos.com/v1 kind:ServiceMonitor metadata:map[name:sriov-network-metrics-exporter namespace:openshift-sriov-network-operator]  │
│ spec:map[endpoints:[map[bearerTokenFile:/var/run/secrets/kubernetes.io/serviceaccount/token honorLabels:true interval:30s port:sriov-network-metrics scheme:https tlsConfig:map[caFile:/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt insecureSkipVerify:false serverName:sriov-network-metrics-expor │
│ ter-service.openshift-sriov-network-operator.svc]]] namespaceSelector:map[matchNames:[openshift-sriov-network-operator]] selector:map[matchLabels:map[name:sriov-network-metrics-exporter-service]]]]} with err: could not delete object (monitoring.coreos.com/v1, Kind=ServiceMonitor) openshift-sriov-network-operato │
│ r/sriov-network-metrics-exporter: servicemonitors.monitoring.coreos.com \"sriov-network-metrics-exporter\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot delete resource \"servicemonitors\" in API group \"monitoring.coreos.com\" in the namespace \"ope │
│ nshift-sriov-network-operator\""}
```

Signed-off-by: Andrea Panattoni <[email protected]>
It might happen that two SR-IOV pods, deployed on different node, are using devices
with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error:

```

Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation:
    [
        {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.60:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }, {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.230:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }
    ];many-to-many matching not allowed: matching labels must be unique on one side
```

Configure the ServiceMonitor resource to add a `node` label to all metrics.
The right query to get metrics, as updated in the PrometheusRule, will be:

```
sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice
```

Also remove `pod`,  `namespace` and `container` label from the `sriov_vf_*` metrics, as they were
wrongly set to `sriov-network-metrics-exporter-zj2n9`, `openshift-sriov-network-operator`, `kube-rbac-proxy`

[1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38

Signed-off-by: Andrea Panattoni <[email protected]>
When using a node selector with boolean values, e.g.:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
spec:
  configDaemonNodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
```

the value needs to be quoted before forwarding it to the metrics-exporter
node selector field.

Fixes openshift#766

Signed-off-by: Andrea Panattoni <[email protected]>
Make the operator creating PrometheusRules to browse
metrics in the Developer Console.

refs:
- k8snetworkplumbingwg/sriov-network-operator#732

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke zeeke force-pushed the ds/417/OCPBUGS-41711 branch from 6491c3c to 6361999 Compare December 10, 2024 15:47
@zeeke
Copy link
Contributor Author

zeeke commented Dec 10, 2024

/jira refresh
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Dec 10, 2024
@openshift-ci-robot
Copy link
Contributor

@zeeke: This pull request references Jira Issue OCPBUGS-43106, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh
/label backport-risk-assessed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zeeke
Copy link
Contributor Author

zeeke commented Dec 10, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Dec 10, 2024

@zeeke: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@asood-rh
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Dec 11, 2024
@zeeke
Copy link
Contributor Author

zeeke commented Dec 12, 2024

/jira refresh

@SchSeba please take a look at this backport

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2024
@openshift-ci-robot
Copy link
Contributor

@zeeke: This pull request references Jira Issue OCPBUGS-43106, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.z) matches configured target version for branch (4.17.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-41711 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-41711 targets the "4.18.0" version, which is one of the valid target versions: 4.18.0
  • bug has dependents

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

/jira refresh

@SchSeba please take a look at this backport

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from zhaozhanqi December 12, 2024 10:29
@SchSeba
Copy link
Contributor

SchSeba commented Jan 6, 2025

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2025
Copy link
Contributor

openshift-ci bot commented Jan 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SchSeba, zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 9a9ba1b into openshift:release-4.17 Jan 6, 2025
11 checks passed
@openshift-ci-robot
Copy link
Contributor

@zeeke: Jira Issue OCPBUGS-43106: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-43106 has been moved to the MODIFIED state.

In response to this:

4.17 fix for sriov-network-metrics-exporter when multiple nodes assign the same pciAddress to pods.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: sriov-network-webhook
This PR has been included in build sriov-network-webhook-container-v4.17.0-202501060837.p0.g9a9ba1b.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: sriov-network-operator
This PR has been included in build sriov-network-operator-container-v4.17.0-202501060837.p0.g9a9ba1b.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: sriov-network-config-daemon
This PR has been included in build sriov-network-config-daemon-container-v4.17.0-202501060837.p0.g9a9ba1b.assembly.stream.el9.
All builds following this will include this PR.

@zeeke
Copy link
Contributor Author

zeeke commented Feb 4, 2025

/jira backport release-4.16

@openshift-ci-robot
Copy link
Contributor

@zeeke: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.16

In response to this:

/jira backport release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #1054

In response to this:

@zeeke: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.16

In response to this:

/jira backport release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.