Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(must-gather): Added must-gather scripts #261

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

vimalk78
Copy link
Collaborator

@vimalk78 vimalk78 commented Sep 29, 2023

Thie PR adds must-gather scripts for kepler-operator

Sample execution on single node CRC cluster:

$ oc adm must-gather --image=$(oc -n openshift-operators get deployment.apps/kepler-operator-controller-manager -o jsonpath='{.spec.template.spec.containers[?(@.name == "manager")].image}')  -- /usr/bin/gather 
[must-gather      ] OUT Using must-gather plug-in image: quay.io/vimalkum/kepler-operator:0.0.0-must-gather
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1f7b621c-7e7a-4ca8-81d5-f49ad9d9acb6
ClusterVersion: Stable at "4.13.12"
ClusterOperators:
	clusteroperator/cloud-credential is missing
	clusteroperator/cluster-autoscaler is missing
	clusteroperator/insights is missing
	clusteroperator/monitoring is missing
	clusteroperator/storage is missing


[must-gather      ] OUT namespace/openshift-must-gather-t75p4 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-llgwf created
[must-gather      ] OUT pod for plug-in image quay.io/vimalkum/kepler-operator:0.0.0-must-gather created
[must-gather-mjjlz] POD 2023-10-03T09:38:03.616361057Z must-gather logs are located at: '/must-gather/gather-debug.log'
[must-gather-mjjlz] POD 2023-10-03T09:38:03.616361057Z powermon must-gather started...
[must-gather-mjjlz] POD 2023-10-03T09:38:03.629416554Z 2023-10-03 09:38:03 getting kepler instance
[must-gather-mjjlz] POD 2023-10-03T09:38:05.108103126Z 2023-10-03 09:38:05 getting openshift-kepler-operator events
[must-gather-mjjlz] POD 2023-10-03T09:38:05.697635288Z 2023-10-03 09:38:05 getting kepler exporter daemonset
[must-gather-mjjlz] POD 2023-10-03T09:38:06.905154905Z 2023-10-03 09:38:06 getting kepler exporter config map
[must-gather-mjjlz] POD 2023-10-03T09:38:07.653189798Z 2023-10-03 09:38:07 getting kepler exporter service account
[must-gather-mjjlz] POD 2023-10-03T09:38:08.162324862Z 2023-10-03 09:38:08 getting kepler exporter service account
[must-gather-mjjlz] POD 2023-10-03T09:38:08.785281522Z 2023-10-03 09:38:08 running gather script for kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:08.864821711Z 2023-10-03 09:38:08 collecting pod yaml for kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:09.663837981Z 2023-10-03 09:38:09 collecting information from "cpuid" from kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:10.371721220Z 2023-10-03 09:38:10 collecting environment variables from kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:11.223976399Z 2023-10-03 09:38:11 collecting kernel version from kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:12.969152179Z 2023-10-03 09:38:12 collecting ebpf information from kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:14.204599060Z 2023-10-03 09:38:14 collecting logs from kepler pod: kepler-exporter-ds-h26gp
[must-gather-mjjlz] POD 2023-10-03T09:38:14.821072700Z 2023-10-03 09:38:14 running gather script for olm
[must-gather-mjjlz] POD 2023-10-03T09:38:14.914602992Z 2023-10-03 09:38:14 collecting olm info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:18.226907411Z 2023-10-03 09:38:18 collecting olm summary
[must-gather-mjjlz] POD 2023-10-03T09:38:19.342571734Z 2023-10-03 09:38:19 getting kepler-operator info
[must-gather-mjjlz] POD 2023-10-03T09:38:19.380959531Z 2023-10-03 09:38:19 collecting subscription info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:19.839823768Z 2023-10-03 09:38:19 collecting catalogsource info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:20.203048340Z 2023-10-03 09:38:20 collecting installplan info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:20.930806572Z 2023-10-03 09:38:20 collecting CSV for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:24.948783859Z 2023-10-03 09:38:24 collecting deployment info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:29.524788569Z 2023-10-03 09:38:29 collecting pod info for kepler-operator
[must-gather-mjjlz] POD 2023-10-03T09:38:48.171054954Z powermon must-gather completed
[must-gather-mjjlz] OUT waiting for gather to complete
[must-gather-mjjlz] OUT downloading gather output
[must-gather-mjjlz] OUT receiving incremental file list
[must-gather-mjjlz] OUT ./
[must-gather-mjjlz] OUT gather-debug.log
[must-gather-mjjlz] OUT kepler-exporter-cm.yaml
[must-gather-mjjlz] OUT kepler-exporter-ds.yaml
[must-gather-mjjlz] OUT kepler-exporter-sa.yaml
[must-gather-mjjlz] OUT kepler-exporter-scc.yaml
[must-gather-mjjlz] OUT kepler.yaml
[must-gather-mjjlz] OUT openshift-kepler-operator_events
[must-gather-mjjlz] OUT kepler-info/
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/ebpf-info
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/env-variables
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/kepler-pod.yaml
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/kepler.log
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/kernel-info
[must-gather-mjjlz] OUT kepler-info/kepler-exporter-ds-h26gp/node-cpuid-info
[must-gather-mjjlz] OUT kepler-operator-info/
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator-catalogsource.yaml
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator-csv.yaml
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator-deployment.yaml
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator-installplan.yaml
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator-subscription.yaml
[must-gather-mjjlz] OUT kepler-operator-info/kepler-operator.yaml
[must-gather-mjjlz] OUT kepler-operator-info/summary.txt
[must-gather-mjjlz] OUT olm-info/
[must-gather-mjjlz] OUT olm-info/olm-reources.yaml
[must-gather-mjjlz] OUT olm-info/summary.txt
[must-gather-mjjlz] OUT 
[must-gather-mjjlz] OUT sent 469 bytes  received 137,494 bytes  30,658.44 bytes/sec
[must-gather-mjjlz] OUT total size is 592,732  speedup is 4.30
[must-gather      ] OUT namespace/openshift-must-gather-t75p4 deleted
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-llgwf deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1f7b621c-7e7a-4ca8-81d5-f49ad9d9acb6
ClusterVersion: Stable at "4.13.12"
ClusterOperators:
	clusteroperator/cloud-credential is missing
	clusteroperator/cluster-autoscaler is missing
	clusteroperator/insights is missing
	clusteroperator/monitoring is missing
	clusteroperator/storage is missing


@vimalk78 vimalk78 requested review from husky-parul, sthaha and vprashar2929 and removed request for husky-parul September 29, 2023 15:32
@vimalk78 vimalk78 marked this pull request as draft September 29, 2023 17:57
@vimalk78 vimalk78 marked this pull request as ready for review October 2, 2023 08:42
@vimalk78 vimalk78 requested review from husky-parul and sthaha October 2, 2023 08:42
esac

get_kepler_instance() {
log "getting kepler instance" >> "$LOGFILE_PATH"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about log itself writes / appends to $LOGFILE ?

must-gather/common Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
WORKDIR /
COPY --from=builder /workspace/manager .
COPY --from=builder /workspace/must-gather/* /usr/bin/
COPY --from=origincli /usr/bin/oc /usr/bin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it also provide kubectl as well? could we copy that as well?

must-gather/common Outdated Show resolved Hide resolved
must-gather/common Outdated Show resolved Hide resolved
must-gather/common Outdated Show resolved Hide resolved
must-gather/gather-olm-info Show resolved Hide resolved
must-gather/gather-olm-info Outdated Show resolved Hide resolved
Comment on lines +3 to +21
export KEPLER_NS="openshift-kepler-operator"
export LOGFILE_NAME="gather-debug.log"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help with maintenance (and to make debugging easier) it is better to avoid side-effects as much as possible when sourcing files. Ideally, it is better if the file only included variables (esp globals) which is only used by the functions in that file. Hence I would recommend moving KEPLER_NS to individual files. Yes we are repeating ourselves but I believe that is the lesser evil :)

If you notice the use of LOGFILE_NAME, is unused in this file while a LOGFILE_PATH is required by the functions and that must be defined elsewhere. I think it may be easier to encapsulate this by requiring the caller to invoke and init_log function that sets the LOGFILE_PATH and makes it readonly soon after.

Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me buy I will get @vprashar2929 merge if after the validation is done.

Comment on lines +83 to +89
get_subscription
get_catalogsource
get_install_plan
get_csv
get_kepler_operator_deployment_info
get_kepler_operator_pod_info
get_summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add || true to every step ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested it. even with oc command throwing error,. the execution continues

Copy link
Collaborator

@sthaha sthaha Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e should have stopped the script. Could you please try again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that i didn't add it then :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will verify this but we should always ensure that these steps break (exit with non-zero value) due to the nature of bash. E.g. say there is a typo in any of the command above, we should be able to find that outduring development we would comment out # || true in the optional steps. get_catalog_source should be optional since that is will only be present if you run bundle

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have to deal with so much complexity, i would rather have must-gather written as a go program. oc adm must-gather does not care if it is a script or a binary.

get_kepler_sa
get_kepler_scc
gather_kepler_exporter_info
gather_olm_info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we begin with capturing olm and operator info?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test scenario, I installed only the operator (imagine the operator installation failed) and ran the must gather.

❯ less after-install/quay-io-sthaha-kepler-operator-sha256-df656da3b9f7a0c4310f8a44cb7b11eb790a7be43794e67c1343e934953f6478/gather-debug.log
2023-10-05 01:23:52 getting kepler instance
2023-10-05 01:23:52 oc get keplers.kepler.system.sustainable.computing.io kepler -oyaml
Error from server (NotFound): keplers.kepler.system.sustainable.computing.io "kepler" not found

And thats all it captures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the script is not stateful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase, in the scenario where the operator fails to install, i.e. no kepler CRD in cluster, what information should we capture?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if script executes all its steps, we will capture olm info.


get_kepler_instance() {
log "getting kepler instance"
run oc get keplers.kepler.system.sustainable.computing.io kepler -oyaml "$BASE_COLLECTION_PATH/kepler.yaml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of Kepler instance is not created the execution gets terminated and the rest of the steps are skipped. I also believe we should add || true at every step

echo -e "must-gather logs are located at: '${LOGFILE_PATH}'"

mkdir -p "${BASE_COLLECTION_PATH}/cache-dir"
export KUBECACHEDIR=${BASE_COLLECTION_PATH}/cache-dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copies cache-dir which has all the kubectl's discovery and http info, do we need to copy the cache-dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache-dir is deleted at the end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 83 to 93
case ${1-} in
--help | -h )
print_usage
exit 1
esac
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case ${1-} in
--help | -h )
print_usage
exit 1
esac
case ${1-} in
--help | -h )
print_usage
return 0
esac

Since the help was explicitly called.

}

get_summary() {
run oc -n "$KEPLER_OPERATOR_NS" get catalogsource kepler-operator-catalog -owide "$KEPLER_OPERATOR_INFO_DIR/summary.txt"
Copy link
Collaborator

@sthaha sthaha Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scenario where operator is installed from community-catalog, get_summary breaks here and the script terminates execution

2023-10-05 01:24:48 oc -n openshift-operators get subscription -l operators.coreos.com/kepler-operator.openshift-operators= -oyaml
2023-10-05 01:24:48 collecting catalogsource info for kepler-operator
2023-10-05 01:24:48 oc -n openshift-operators get catalogsource kepler-operator-catalog -oyaml
Error from server (NotFound): catalogsources.operators.coreos.com "kepler-operator-catalog" not found
~
❯ ls kepler-operator-info
kepler-operator-catalogsource.yaml  kepler-operator-subscription.yaml

echo -e "\n" >> "$KEPLER_OPERATOR_INFO_DIR/summary.txt"

run oc -n "$KEPLER_OPERATOR_NS" get csv -owide "$KEPLER_OPERATOR_INFO_DIR/summary.txt"
echo -e "\n" >> "$KEPLER_OPERATOR_INFO_DIR/summary.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: how about we add \n in run function itself?

Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting change to only fix the common scenario where operator is installed from community-catalog and running must-gather fails to run to completion.

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Oct 5, 2023

script runs to completion

@sthaha sthaha merged commit 76592a8 into sustainable-computing-io:v1alpha1 Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants