Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(regTests): print logs when regTests timeout #2031

Merged
merged 11 commits into from
Oct 20, 2023

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Oct 17, 2023

This PR:

  • add a python script to print the most recent log
  • if CI timeouts, print the most recent log
  • replace global timeout with timeout command
  • upload all logs on failure()
  • print uid + port + the log files for each newly df instance

The reason that I replaced the global timeout and used the timeout command instead, is because cancelled() is not working when the a job timeouts and therefore it can't be used to print the LOG.

I also tried pytest timeout module but this won't work either because a) there is only globaltimeout which sets the per-test timeout b) it continues running the tests.

One more thing to consider here is that because we use a step we no longer get a nice colored output.

Another, over-engineered solution but probably better solution, would be to register a signal which timeout would use and pytest would handle it before it exits. That way, we could have pytest print the logs, instead of the new extra step

@kostasrim kostasrim self-assigned this Oct 17, 2023
@kostasrim
Copy link
Contributor Author

kostasrim commented Oct 18, 2023

Also, printing only the last log might not be the best idea, since there are pytests that create multiple dragonfly instances and each of those has their own log. Maybe it makes sense to print the last X logs where X is the largest pool of DF instances we create in one go. Upon a timeout, we know which test fails, so we know how many logs we need to read (obviously this will add a little bit of noise)

Another issue is also the default's fixture log. This will be created first so it could be the case that we never printed. But, the ones that deadlock/fail are usually the replication tests which all create a new DF instance. Therefore, this approach should be quite accurate most of the time and we won't need to fall back in download the logs

I also added:

  1. Upload all logs
  2. Each time we create a new DF instance, we give it a UID and we print to the CI its log names and the port that it listens. That way, when we download the logs we know which log belongs to which instance.

@kostasrim kostasrim force-pushed the print_logs_when_job_timeouts branch 3 times, most recently from 7d5772b to 9aca418 Compare October 18, 2023 11:55
@kostasrim kostasrim requested review from dranikpg and romange October 18, 2023 11:57
@kostasrim kostasrim changed the title (Do not review -- testing it) chore(regTests): print logs when regTests timeout chore(regTests): print logs when regTests timeout Oct 18, 2023
@kostasrim kostasrim force-pushed the print_logs_when_job_timeouts branch 2 times, most recently from 579009c to c13fab2 Compare October 18, 2023 12:36
@kostasrim kostasrim force-pushed the print_logs_when_job_timeouts branch from c13fab2 to fa9670c Compare October 18, 2023 12:37
@kostasrim
Copy link
Contributor Author

timeout 20m pytest -m "${{inputs.filter}}" --json-report --json-report-file=rep1_report.json dragonfly/replication_test.py --df alsologtostderr --df enable_multi_shard_sync=true || code=$?; if [[ $code -eq 124 ]]; then echo "TIMEDOUT=1">> "$GITHUB_OUTPUT"; exit 1; fi


timeout 20m pytest -m "${{inputs.filter}}" --json-report --json-report-file=rep2_report.json dragonfly/replication_test.py --df alsologtostderr --df enable_multi_shard_sync=false || code=$?; if [[ $code -eq 124 ]]; then echo "TIMEDOUT=1">> "$GITHUB_OUTPUT"; exit 1; fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need --df alsologtostderr now that we upload the logs on failures?

dranikpg
dranikpg previously approved these changes Oct 18, 2023
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve because it works but it can be made much simpler

Comment on lines 45 to 52
def monotonic_integer_generator():
i = 0
while i < 10000000000:
yield i
i += 1


uid_iterator = monotonic_integer_generator()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uid_gen = itertools.count() 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed!

Comment on lines 1 to 4
#!/usr/bin/env python3

"""Extract the most recent INFO log from a directory."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh... What about ls *.log -tr | head -n 1 <- most recently modified file? 😆
Or ls -a | grep log | sort | head -n 1 for sorting by name like your script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the first one! Also, we don't need r here, since the order is descending by default

Comment on lines 64 to 67
if [[ "${{ env.TIMEDOUT_STEP_1 }}" -eq 1 ]] || [[ "${{ env.TIMEDOUT_STEP_2 }}" -eq 1 ]]; then
echo "🪵🪵🪵🪵🪵🪵 Latest log before timeout 🪵🪵🪵🪵🪵🪵\n\n"
${GITHUB_WORKSPACE}/tools/extract_latest_log.py --path /tmp/ | xargs cat
echo "🪵🪵🪵🪵🪵🪵 Latest log before timeout end 🪵🪵🪵🪵🪵🪵\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about using the shell instead, also you don't need xargs | cat, doesn't it print just so?

I see you're a big fan of over-engineering 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have just printed the file on the python script but honestly I should also have used the bash itself -- really the python script was an overkill -- so I removed it 😄

Yes I needed the xargs cat because otherwise only the filename would get printed and not its contents

+1 it was over-engineered for no reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I forgot the script only prints the name 😅

timeout 20m pytest -m "${{inputs.filter}}" --json-report --json-report-file=rep1_report.json dragonfly/replication_test.py --log-cli-level=INFO --df alsologtostderr --df enable_multi_shard_sync=true || code=$?; if [[ $code -eq 124 ]]; then echo "TIMEDOUT=1">> "$GITHUB_OUTPUT"; exit 1; fi


timeout 20m pytest -m "${{inputs.filter}}" --json-report --json-report-file=rep2_report.json dragonfly/replication_test.py --log-cli-level=INFO --df alsologtostderr --df enable_multi_shard_sync=false || code=$?; if [[ $code -eq 124 ]]; then echo "TIMEDOUT=1">> "$GITHUB_OUTPUT"; exit 1; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if you touched it, maybe we can unify it somehow... it doesn't differ except for the enable_multi_shard option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done :)

@kostasrim kostasrim requested a review from dranikpg October 19, 2023 10:53
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now its much simpler 😎

@kostasrim kostasrim merged commit 64841ef into main Oct 20, 2023
@kostasrim kostasrim deleted the print_logs_when_job_timeouts branch October 20, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants