-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program hangs if sample-collection time is > interval #123
Comments
OK, it's fairly easy to repro this. This uses recursion (to a max depth) because that causes the stack size to keep growing, which causes def recurse(num = 1, depth = 1)
if depth == 10000
num
else
recurse(num * 2, depth + 1)
end
end At a high/slow sampling rate, this is fine: StackProf.run(mode: :wall, interval: 1000) {recurse} and it prints out some results. (And I see the same with But then down to 1 microsecond, this consistently hangs: StackProf.run(mode: :wall, interval: 1) {recurse} I assume the exact cutoff where this hangs will depend on context-specific factors like CPU. Using |
@benbuckman So looking at this per your request over in #122 So my initial thought after looking through your right up again and suggestions was "why don't we just use a lock so when multiple job handlers are being acted on at the same time, the handler just ejects early", but it turns out that was already implemented: stackprof/ext/stackprof/stackprof.c Lines 548 to 558 in b886c3c
... or at least, I think it is. If that code is indeed not working as intended, do we know why? Is it because the code isn't working as I described, or is something else at play? Would a more proper mutex help here? I will also try reproducing using the method you described above. Edit: Lack of C experience is showing here. The Edit 2: After re-reading your "working theory", everything above is pointless in mentioning as the lock doesn't prevent work from being piled up right after the handler code and outside of the lock... as your had already described. That said, I think I am able to provide some evidence supporting your theory, so I will do so in my next comment. |
Alright, so I think I have some evidence to support @benbuckman 's claims that we are creating a situation where the "queue" is piling up when the Most of the info for this is in a gist I created: https://gist.github.com/NickLaMuro/3b54c57e1ffeb21316f38529fa93ea67 But effectively is a modified form of the reproducer that @benbuckman posted earlier. What I did was basically wrap his In addition, when the process was killed, I was also keeping track of how large the I was surprised at how quick I got consistent results, but once I started seeing somewhat consistent results when the process timed out, I used a general average to set the I think what this shows, indirectly, in respect to your second "TODO" item:
That there is a pileup, and that work is getting completed, though it is hinder specifically by the extra work that So after going through this exercise, I am curious if we should possibly go about this in a similar (though opposite) fashion as I mentioned in a comment Ben's previous PR:
Obviously, in this case, we are in the complete opposite direction, but at some point the That said, you mentioned that you had this issue in production. Are you using a more reasonable |
Thank you @NickLaMuro ! My C knowledge is very weak, but I was able to add a bunch of timestamp diffs to the code – specifically, in
(and then it prints out results and exits nicely.)
(It was going on forever, I had to sigterm the process.) My interpretation:
|
One possible solution which seems like it might work, is simply moving the I'm not sure what this would mean for the final results, in terms of lost frames, I'll have to think about that tomorrow. I'll put up a POC-PR with what I've got so far, shortly. Edit: WIP-PR is at #124 |
One other note, re
One challenge here is that the safe interval is not constant. It depends on the size of the stack at any given moment, whether the same frame is repeating (because |
Hi, I'll take a look at this too. Of course any solution to this will result in dropped samples. Is there a reason you need to make the sampling frequency so high? Or is that just to demonstrate the bug? I'm curious because I've never run in to this, and I use this tool very frequently. |
Hey all, we're still running into this with Ruby 3 despite #150. In our case we're profiling in a separate thread using something like this, which never returns since the thread is never given the opportunity to wake up: profile = StackProf.run(mode: :cpu, raw: true, interval: 1000) do
sleep(duration)
end I'm not sure exactly why the sampling takes so long with our situation, but I've managed to repro this by forcing index 98d4857..5f0f462 100644
--- i/ext/stackprof/stackprof.c
+++ w/ext/stackprof/stackprof.c
@@ -578,6 +578,9 @@ stackprof_record_sample_for_stack(int num, int64_t timestamp_delta)
void
stackprof_record_sample()
{
+ volatile unsigned long long i;
+ for (i = 0; i < 1000000000ULL; ++i);
+
int64_t timestamp_delta = 0;
int num;
if (_stackprof.raw) { My hypothesis is that when the handler takes longer than the SIGPROF timer, by the time a SIGPROF handler is done, there is already another pending SIGPROF that preempts everything else. It's really the kernel/signal system that's doing the "queuing", since signals that happen during handling don't interrupt immediately, but rather are blocked and delivered at the next opportunity. Unfortunately when a process enters this state, it's impossible to recover, since the signal handling code always preempts other work (and seems to be holding the GVL). |
Context
We are trying to enable StackProf in production on a REST API service. We would like to use
mode: :wall, raw: true, aggregate: true
, but it has been challenging to calibrate theinterval
. The segfault bug described in #81 makes very low/fast sampling rates concerning (especially because our hypothesis about the causes of those segfaults, which we have not verified, centered on overlapping signal handlers). But at an interval of (e.g.) 20ms the data is not particularly useful.Observed behavior
At lower intervals, however, we encounter a separate bug in which the program doesn't crash, but rather it hangs – the CPU is pegged to 100% and the program being profiled does not advance. This specifically seems to occur when the time to capture a single sample is greater than the interval between samples. (The time it takes to capture a sample fluctuates, I assume, based on the size of the stack, CPU load, etc, so it could take a bunch of samples successfully and then encounter this race.) What I suspect is occurring – and I am still in the process of debugging and verifying the details – is that it starts piling up jobs with
rb_postponed_job_register_one
, and by the time it finishes one sample, it's started another, (and so on forever), blocking any other instructions.Proposal
I think some code needs to be added, to
stackprof_record_sample
orstackprof_record_sample_for_stack
, measuring the time it took to capture the sample. If that time is>= interval
, it needs to handle that somehow. I can think of a few options at this point:StackProf.results
is called, raise an exception that makes it clear that the profiling was stopped due to this situation. (We don't want to raise mid-profiling because that could bubble up anywhere in the program. This seems the easiest of the 3.)Next steps for me
start
orrun
and confirm that it fixes the bug.(I'll update this issue as I do these.)
In the meantime, I'd be interested in anyone else's thoughts – have you also encountered this, do you know why it might be occurring, and what fix do you propose?
Thank you!
The text was updated successfully, but these errors were encountered: