Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.16] OCPBUGS-49808: sriov-network-metrics-exporter fix for multiple nodes #1054

Open
wants to merge 7 commits into
base: release-4.16
Choose a base branch
from

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #1016

/assign openshift-ci-robot

zeeke added 7 commits February 4, 2025 11:14
Exposed metrics can be verified by scraping the prometheus
endpoint on the `sriov-network-metrics-exporter` pod.
Add a test that spawns an SR-IOV consuming pod and verifies
its receiving counter increase when the interface is pinged from
outside.

Signed-off-by: Andrea Panattoni <[email protected]>
PrometheusRules allow recording pre-defined queries. Use
`sriov_kubepoddevice` metric to add `pod|namespace` pair
to the sriov metrics.

Feature is enabled via the `METRICS_EXPORTER_PROMETHEUS_DEPLOY_RULE`
environment variable.

Signed-off-by: Andrea Panattoni <[email protected]>
When the `metricsExporter` feature is turned off, deployed resources
should be removed. These changes fix the error:

```
│ 2024-08-28T14:07:57.699760017Z    ERROR    controller/controller.go:266    Reconciler error    {"controller": "sriovoperatorconfig", "controllerGroup": "sriovnetwork.openshift.io", "controllerKind": "SriovOperatorConfig", "SriovOperatorConfig": {"name":"default","namespace":"openshift-sriov-network-operator"},  │
│ "namespace": "openshift-sriov-network-operator", "name": "default", "reconcileID": "fa841c50-dbb8-4c4c-9ddd-b98624fd2a24", "error": "failed to delete object &{map[apiVersion:monitoring.coreos.com/v1 kind:ServiceMonitor metadata:map[name:sriov-network-metrics-exporter namespace:openshift-sriov-network-operator]  │
│ spec:map[endpoints:[map[bearerTokenFile:/var/run/secrets/kubernetes.io/serviceaccount/token honorLabels:true interval:30s port:sriov-network-metrics scheme:https tlsConfig:map[caFile:/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt insecureSkipVerify:false serverName:sriov-network-metrics-expor │
│ ter-service.openshift-sriov-network-operator.svc]]] namespaceSelector:map[matchNames:[openshift-sriov-network-operator]] selector:map[matchLabels:map[name:sriov-network-metrics-exporter-service]]]]} with err: could not delete object (monitoring.coreos.com/v1, Kind=ServiceMonitor) openshift-sriov-network-operato │
│ r/sriov-network-metrics-exporter: servicemonitors.monitoring.coreos.com \"sriov-network-metrics-exporter\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot delete resource \"servicemonitors\" in API group \"monitoring.coreos.com\" in the namespace \"ope │
│ nshift-sriov-network-operator\""}
```

Signed-off-by: Andrea Panattoni <[email protected]>
It might happen that two SR-IOV pods, deployed on different node, are using devices
with the same PCI address. In such cases, the query suggested [1] by the sriov-network-metrics-exporter produces the error:

```

Error loading values found duplicate series for the match group {pciAddr="0000:3b:02.4"} on the right hand-side of the operation:
    [
        {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.60:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-cnfdr22.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }, {
            __name__="sriov_kubepoddevice",
            container="test",
            dev_type="openshift.io/intelnetdevice",
            endpoint="sriov-network-metrics",
            instance="10.1.98.230:9110",
            job="sriov-network-metrics-exporter-service",
            namespace="cnf-4916",
            pciAddr="0000:3b:02.4",
            pod="pod-dhcp-98-230.telco5g.eng.rdu2.redhat.com",
            prometheus="openshift-monitoring/k8s",
            service="sriov-network-metrics-exporter-service"
        }
    ];many-to-many matching not allowed: matching labels must be unique on one side
```

Configure the ServiceMonitor resource to add a `node` label to all metrics.
The right query to get metrics, as updated in the PrometheusRule, will be:

```
sriov_vf_tx_packets * on (pciAddr,node) group_left(pod,namespace,dev_type) sriov_kubepoddevice
```

Also remove `pod`,  `namespace` and `container` label from the `sriov_vf_*` metrics, as they were
wrongly set to `sriov-network-metrics-exporter-zj2n9`, `openshift-sriov-network-operator`, `kube-rbac-proxy`

[1] https://github.com/k8snetworkplumbingwg/sriov-network-metrics-exporter/blob/0f6a784f377ede87b95f31e569116ceb9775b5b9/README.md?plain=1#L38

Signed-off-by: Andrea Panattoni <[email protected]>
When using a node selector with boolean values, e.g.:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default
spec:
  configDaemonNodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
```

the value needs to be quoted before forwarding it to the metrics-exporter
node selector field.

Fixes openshift#766

Signed-off-by: Andrea Panattoni <[email protected]>
Make the operator creating PrometheusRules to browse
metrics in the Developer Console.

refs:
- k8snetworkplumbingwg/sriov-network-operator#732

Signed-off-by: Andrea Panattoni <[email protected]>
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: Detected clone of Jira Issue OCPBUGS-43106 with correct target version. Will retitle the PR to link to the clone.
/retitle [release-4.16] OCPBUGS-49808: sriov-network-metrics-exporter fix for multiple nodes

In response to this:

This is an automated cherry-pick of #1016

/assign openshift-ci-robot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title [release-4.16] OCPBUGS-43106: sriov-network-metrics-exporter fix for multiple nodes [release-4.16] OCPBUGS-49808: sriov-network-metrics-exporter fix for multiple nodes Feb 4, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 4, 2025
@openshift-ci-robot
Copy link
Contributor

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-49808, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.z) matches configured target version for branch (4.16.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-43106 is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-43106 targets the "4.17.z" version, which is one of the valid target versions: 4.17.0, 4.17.z
  • bug has dependents

Requesting review from QA contact:
/cc @zhaozhanqi

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is an automated cherry-pick of #1016

/assign openshift-ci-robot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: openshift-cherrypick-robot
Once this PR has been reviewed and has the lgtm label, please assign bn222 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zeeke
Copy link
Contributor

zeeke commented Feb 4, 2025

/test ?

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

@zeeke: The following commands are available to trigger required jobs:

/test api
/test ci-index-operator-bundle
/test controllers
/test gofmt
/test images
/test operator-e2e
/test pkg

The following commands are available to trigger optional jobs:

/test e2e-openstack-nfv
/test e2e-openstack-nfv-config-drive
/test e2e-telco5g-sriov
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-sriov-network-operator-release-4.16-api
pull-ci-openshift-sriov-network-operator-release-4.16-ci-index-operator-bundle
pull-ci-openshift-sriov-network-operator-release-4.16-controllers
pull-ci-openshift-sriov-network-operator-release-4.16-e2e-openstack-nfv
pull-ci-openshift-sriov-network-operator-release-4.16-e2e-openstack-nfv-config-drive
pull-ci-openshift-sriov-network-operator-release-4.16-gofmt
pull-ci-openshift-sriov-network-operator-release-4.16-images
pull-ci-openshift-sriov-network-operator-release-4.16-operator-e2e
pull-ci-openshift-sriov-network-operator-release-4.16-pkg
pull-ci-openshift-sriov-network-operator-release-4.16-security

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zeeke
Copy link
Contributor

zeeke commented Feb 4, 2025

/test e2e-telco5g-sriov

1 similar comment
@zeeke
Copy link
Contributor

zeeke commented Feb 4, 2025

/test e2e-telco5g-sriov

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

@openshift-cherrypick-robot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-nfv 24c5e0b link false /test e2e-openstack-nfv
ci/prow/e2e-telco5g-sriov 24c5e0b link false /test e2e-telco5g-sriov

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@zeeke
Copy link
Contributor

zeeke commented Feb 5, 2025

/hold

Needs

to pass failing tests:

Summarizing 3 Failures:
  [FAIL] [sriov] Metrics Exporter [It] collects metrics regarding receiving traffic via VF
  /go/src/github.com/openshift/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:228
  [FAIL] [sriov] Metrics Exporter When Prometheus operator is available [It] PrometheusRule should provide namespaced metrics
  /go/src/github.com/openshift/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:129
  [FAIL] [sriov] Metrics Exporter When Prometheus operator is available [It] Metrics should have the correct labels
  /go/src/github.com/openshift/sriov-network-operator/test/conformance/tests/test_exporter_metrics.go:158

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants