Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[techsupport] improve robustness #2117

Merged
merged 8 commits into from
Apr 18, 2022

Conversation

stepanblyschak
Copy link
Contributor

@stepanblyschak stepanblyschak commented Mar 26, 2022

Signed-off-by: Stepan Blyschak [email protected]

What I did

  1. Execute the command in a separate bash process under timeout. Due to with '--foreground' the child processes are not killed and still have the file descriptor opened to piped process which will basically hang forever.
    From the "timeout" man page:
       --foreground

              when not running timeout directly from a shell prompt,

              allow COMMAND to read from the TTY and get TTY signals; in
              this mode, children of COMMAND will not be timed out

So, a command like this:

timeout --foreground 5s some_process_that_spawn_childs_that_produce_output | tee some_file

Will actually hang forever. The '--foreground' option is still needed. So the proposed solution is to run that construct in another bash process:

timeout --foreground 5s bash -c "some_process_that_spawn_childs_that_produce_output | tee some_file"
  1. Moved hw-mgmt-dump to collect_mellanox function and execute it under timeout.
  2. Handle global timeout: When global timeout is reached the SIGTERM is sent to generate_dump process, upon SIGTERM the generate_dump process will finalize its work by creating a dump of what already has been collected, providing a dump even when timeout is reached. The generate_dump will be killed in case it cannot create dump after initial SIGTERM.

How I did it

I implemented the above mentioned features.

How to verify it

Artificially hang the FW, execute "show techsupport" and verify even in case timeout is reached the dump is generated.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
…o robust-techsupport

Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@liat-grozovik liat-grozovik requested a review from yxieca March 30, 2022 12:13
# run 'hw-management-generate-dump.sh' script and save the result file
HW_DUMP_FILE=/usr/bin/hw-management-generate-dump.sh
if [ -f "$HW_DUMP_FILE" ]; then
${CMD_PREFIX}${timeout_cmd} /usr/bin/hw-management-generate-dump.sh $ALLOW_PROCESS_STOP
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if timeout expired and file does not exist, we should cont but IMO should have an error message identify timeout expired and no hw mgmt file. .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: Stepan Blyschak <[email protected]>
liat-grozovik
liat-grozovik previously approved these changes Mar 30, 2022
@liat-grozovik
Copy link
Collaborator

@yxieca , @qiluo-msft could you please help to signoff?

@liat-grozovik liat-grozovik merged commit 29771e7 into sonic-net:master Apr 18, 2022
liat-grozovik pushed a commit that referenced this pull request Apr 22, 2022
- What I did
Fixed error message: "Failed to get routing stack: a bytes-like object is required, not 'str'" introduced by #2117

- How I did it
Add missing 'text' parameter

- How to verify it
Run on switch

Signed-off-by: Stepan Blyschak <[email protected]>
judyjoseph pushed a commit that referenced this pull request May 2, 2022
- What I did
Execute the command in a separate bash process under timeout. Due to with '--foreground' the child processes are not killed and still have the file descriptor opened to piped process which will basically hang forever.
From the "timeout" man page:
       --foreground

              when not running timeout directly from a shell prompt,

              allow COMMAND to read from the TTY and get TTY signals; in
              this mode, children of COMMAND will not be timed out
So, a command like this:

timeout --foreground 5s some_process_that_spawn_childs_that_produce_output | tee some_file
Will actually hang forever. The '--foreground' option is still needed. So the proposed solution is to run that construct in another bash process:

timeout --foreground 5s bash -c "some_process_that_spawn_childs_that_produce_output | tee some_file"
Moved hw-mgmt-dump to collect_mellanox function and execute it under timeout.
Handle global timeout: When global timeout is reached the SIGTERM is sent to generate_dump process, upon SIGTERM the generate_dump process will finalize its work by creating a dump of what already has been collected, providing a dump even when timeout is reached. The generate_dump will be killed in case it cannot create dump after initial SIGTERM.

- How I did it
I implemented the above mentioned features.

- How to verify it
Artificially hang the FW, execute "show techsupport" and verify even in case timeout is reached the dump is generated.

Signed-off-by: Stepan Blyschak <[email protected]>
judyjoseph pushed a commit that referenced this pull request May 2, 2022
- What I did
Fixed error message: "Failed to get routing stack: a bytes-like object is required, not 'str'" introduced by #2117

- How I did it
Add missing 'text' parameter

- How to verify it
Run on switch

Signed-off-by: Stepan Blyschak <[email protected]>
@Staphylo
Copy link
Contributor

@stepanblyschak @liat-grozovik
Could you clarify the rationale for moving the hw-dump (/usr/bin/hw-management-generate-dump.sh) collection to be specific to mellanox products?
We found it to be really useful and ended up implementing this hook for our platforms.
We only realized now that this data is no longer collected, thus affecting support capabilities.
Would it be fine to move this collection back to common code or should we implement another hook in generate_dump for our products?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants