-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent hanging when sample collection time > interval #124
base: master
Are you sure you want to change the base?
Conversation
Records information about the frames captured in `_stackprof.frames_buffer`, | ||
up to `frame_count-1` (buffer may contain more frames from prior sample), | ||
captured `timestamp_delta` microseconds after previous sample. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is based on my understanding of the code, please correct if I'm misunderstanding. I'm also not sure what the canonical comment format should be here.)
if (_stackprof.raw) { | ||
gettimeofday(&_stackprof.last_sample_at, NULL); | ||
} | ||
gettimeofday(&_stackprof.last_sample_at, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It seems simpler, and trivial, to assign this always, rather than track the downstream paths that do or don't need it.)
ext/stackprof/stackprof.c
Outdated
struct timeval sampling_start, sampling_finish; | ||
|
||
gettimeofday(&sampling_start, NULL); | ||
long int time_since_last_sample_usec = diff_timevals_usec(&_stackprof.last_sample_at, &sampling_start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note, this replaces the timestamp math fix in #122.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of long int, include stdint.h and use int64_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ext/stackprof/stackprof.c
Outdated
return MICROSECONDS_IN_SECOND * diff->tv_sec + diff->tv_usec; | ||
} | ||
|
||
long int diff_timevals_usec(struct timeval *start, struct timeval *end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I adapted this from an algorithm at https://www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime/, and tested it with this repl: https://repl.it/@BenBuckman/SanityCheck-Timevals)
@tmm1 Thanks for your feedback so far. |
Thinking about it some more, another option would be to track and expose |
I see an integer stackprof/ext/stackprof/stackprof.c Line 655 in 1d5ae02
missed_samples the same as I'm proposing here?never mind, the search results missed where it's used, stackprof/ext/stackprof/stackprof.c Line 260 in 1d5ae02
It looks like |
@tmm1 || @tenderlove || other maintainers, do you have time to review this? |
causes segfaults in tests 😞 This reverts commit ac49055.
- in `fprintf`s - `_stackprof.interval` is an int - `timestamp_delta` as `int64_t` - but doing the same with `_stackprof.raw_timestamp_deltas` makes it segfault!
Edit: this is fixed, I needed to also change the |
it appears that at 1μs, performance hit of profiling causes test to take >10s to run, and time out. try at 10μs instead.
ext/stackprof/stackprof.c
Outdated
|
||
if (_stackprof.debug) { | ||
printf("started with interval %d (%ld sec %d usec)\n", | ||
NUM2INT(interval), timer.it_interval.tv_sec, timer.it_interval.tv_usec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This highlights the separate issue of #125)
if (raw) { | ||
gettimeofday(&_stackprof.last_sample_at, NULL); | ||
if (_stackprof.debug) { | ||
printf("started with interval %d (%ld sec %d usec)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This highlights the separate issue of #125)
This attempts to fix the bug described in #123, in which, when
interval
is faster than the time required to record and analyze stack frames, the job queue piles up faster than it's flushed and the program hangs.Pieces:
in_signal_handler
lock to the global_stackprof
struct, and adds a check, and escape hatch, instackprof_signal_handler
.timeval_to_usec
anddiff_timevals_usec
for working with time durations more easily aslong int
s of microseconds, rather thantimeval
structs.debug
mode with stdout+stderr output recording the internal timings and flow.More context on the underlying bug, and consideration of alternative approaches, is at #123.
Motivation:
We would like to use StackProf on a large-scale Ruby application in production. The inability to set an interval that is both fast enough to produce useful data, and guaranteed to never hang the program, is a blocker to doing so.
Caveats:
stackprof_signal_handler
does to the quality of the final results. (At the end of the day, it's a kind of user or program error if the interval is so fast that the escape hatch is necessary. The question is how to deal with it.)wall
mode; I have not yet analyzed how this bug or fix work withcpu
orobject
modes (though I suspect they're similar).cc @NickLaMuro, thank you for your engagement with #123!