Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchguard #547

Merged
merged 6 commits into from
Jul 16, 2018
Merged

Watchguard #547

merged 6 commits into from
Jul 16, 2018

Conversation

aionJoey
Copy link
Contributor

@aionJoey aionJoey commented Jul 11, 2018

Notice

It is not allowed to submit your PR to the master branch directly, please submit your PR to the master-pre-merge branch.

Description

Please include a brief summary of the change that this pull request proposes. Include any relevant motivation and context. List any dependencies required for this change.

  • Watchguard toggle through command line argument "watch" (first argument)
  • Terminates kernel when rebounce condition is detected (samples every 30 seconds)
  • Rebounce condition;
    1. Kernel stuck for > 1 minute OR Kernel state is ZOMBIE/DEAD
    1. Thread stuck for > 1 minute AND Thread state is BLOCKED
  • Rebounce timer - prevents rebounce from occurring if last rebounce was within 5 minutes
  • Thread dump to threadDump.txt
  • Added invalid UUID check before kernel boot

Fixes Issue #526

Type of change

Insert x into the following checkboxes to confirm (eg. [x]):

  • Bug fix.
  • New feature.
  • Enhancement.
  • Unit test.
  • Breaking change (a fix or feature that causes existing functionality to not work as expected).
  • Requires documentation update.

Testing

Please describe the tests you used to validate this pull request. Provide any relevant details for test configurations as well as any instructions to reproduce these results.

  • Stress tested the watchguard (150+ rebounces)

  • Test configuration: wait=0, sample=1, tolerance=1, threadRate=1

  • Rebounce conditions

    1. Test configuration: sample=1, tolerance=1, threadRate=10 (kernel dead)
    1. Test configuration: sample=1, threadRate=1 (thread dead) + thread state check commented out
  • Kill conditions; Ctrl-C in kernel and command line (aion.sh)

Verification

Insert x into the following checkboxes to confirm (eg. [x]):

  • I have self-reviewed my own code and conformed to the style guidelines of this project.
  • New and existing tests pass locally with my changes.
  • I have added tests for my fix or feature.
  • I have made appropriate changes to the corresponding documentation.
  • My code generates no new warnings.
  • Any dependent changes have been made.

Copy link
Collaborator

@AionJayT AionJayT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTE. However, are you able to add non-WatchGuard mode into the script, meaning adding the argument to skip the WatchGuard

Copy link
Contributor

@aion-kelvin aion-kelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for working on this ... looks good to me, have a few small suggestions

aion.sh Outdated
env EVMJIT="-cache=1" ./rt/bin/java -Xms4g \
-cp "./lib/*:./lib/libminiupnp/*:./mod/*" org.aion.Aion "$@"
#env EVMJIT="-cache=1" ./rt/bin/java -Xms4g \
# -cp "./lib/*:./lib/libminiupnp/*:./mod/*" org.aion.Aion "$@" &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented code if not needed

aion.sh Outdated
#wait=30
#sample=1
#tolerance=1
#threadRate=20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commented stuff if not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops forgot to remove these from testing

aion.sh Outdated
# Interrupts the Aion kernel and awaits shutdown complete
if $running; then
kill $kPID
temp=$(top -n1 -p $kPID | egrep -o "$kPID")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: I would do something like "ps --pid $kPID" instead of "top -n1 -p $kPID" (and everywhere else you use top)

The top command prints a bunch of memory info and other stuff that might get accidentally picked up by the egrep.

aion.sh Outdated
fi

# Removes remnant processes accessing kernel logfile
if $logging; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What remnant processes are you expecting to find with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to use this when i was trying to kill the kernel using the command line; kill only killed the script but not the kernel - the issue should be fixed now so ill remove this

Copy link
Contributor

@aion-kelvin aion-kelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes!

* 00000000-0000-0000-0000-000000000000
*/
String UUID = cfg.getId();
if (! UUID.matches("[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo: space


# Shutsdown Aion kernel
echo "## Killing Kernel ##"
kill $kPID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is it possible for this script to get stuck if the kernel does not respond to a kill (SIGINT?) request.

Is there anywhere that was escalate to a SIGKILL (for example if the kernel is still alive after 1 minute)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill add a shutdown timer to force the script to kill the process

aionJoey added 2 commits July 16, 2018 10:03
(default) Wait = 300, Sample = 30, Tolerance = 60, ThreadRate = 2
@AionJayT AionJayT merged commit cebc7f7 into master-pre-merge Jul 16, 2018
@AionJayT AionJayT deleted the watchguard branch July 19, 2018 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants