721 Trace flush bug fixes #722

lifflander · 2020-03-04T22:41:38Z

Fixes #721

codecov · 2020-03-04T23:44:49Z

Codecov Report

Merging #722 into develop will decrease coverage by 0.00%.
The diff coverage is 50.00%.

@@             Coverage Diff             @@
##           develop     #722      +/-   ##
===========================================
- Coverage    79.17%   79.17%   -0.01%     
===========================================
  Files          333      333              
  Lines        10162    10164       +2     
===========================================
+ Hits          8046     8047       +1     
- Misses        2116     2117       +1

Impacted Files	Coverage Δ
src/vt/messaging/active.h	`100.00% <ø> (ø)`
src/vt/vrt/collection/balance/lb_invoke/invoke.h	`53.84% <50.00%> (-0.70%)`	⬇️

src/vt/trace/trace.cc

src/vt/vrt/collection/balance/lb_invoke/invoke.cc

pnstickne · 2020-03-05T04:02:36Z

src/vt/trace/trace.h

@@ -219,7 +219,7 @@ struct Trace {
    vt_gzFile* file, NodeType const node, double const start
  );

-  void writeTracesFile(int flush);
+  void writeTracesFile(int flush, bool is_incremental_flush);


is_incremental sufficient intent.

pnstickne · 2020-03-05T04:03:28Z

src/vt/trace/trace.cc

+
+    if (is_incremental_flush) {
+      vt::trace::TraceScopedEvent scope(flush_event);
+      outputTraces(


Duplicate feels odd. Maybe always have an event? (incremental or full).

I agree the duplicate feels odd. However, I can't have an event when doing the final (non-incremental) flush because it will add an event after the trace gets finalized.

Can a TraceScopedEvent be a 'no-op'? Might not have that support.

vt::trace::TraceScopedEvent scope(is_incremental_flush ? flush_event_ : no_op_event_);

No, that is isn't supported. I could add that support

pnstickne · 2020-03-05T04:06:15Z

src/vt/trace/trace.cc

@@ -755,13 +759,6 @@ TraceEventIDType Trace::logEvent(LogType&& log) {
  TraceEventIDType event = log.event;
  traces_.push(std::move(log));

-  // If auto-flush, can flush immediately.


Implies tracing only flushed in during phase shift.

Seems fine, but this might be the last usage of vt_trace_flush_size other than as a flag detached from meaning.

As far as trace flushing, perhaps on BeginIdle might be a good location to always flush every trace that is present. If that was a reasonable place, it might be possible to eliminate the flush_size setting entirely, eg.

Unfortunately, BeginIdle is not a good location for flushing. For flushing to not impact performance, it must be synchronized across nodes. A short idle period will not cover the latency/bandwidth to disk.

On another note, I was thinking it could spread cost and possibly be more efficient overall of writing the traces directly to a byte-buffer (snprintf, eg) and then in the flush simply sending the prewritten buffers to gzip.

CPU time of printf is spread to trace sites; could explore alternative formatters (many options: https://stackoverflow.com/a/50260393/2864740) as such separate from zlibs formatting support. That said, not sure if the printf-family can be beat for simple non-formatted integers writing to some hunk of chars..

No need to keep around traces_ structure (would keep around around a deque of char[1MB] or whatever instead)

No need to copy the trace log events into said structure either.

Implicitly forms 'slab' allocations, with re-usable pool

Allows 'flush by size', if such was a request. (However, if flushing is only and always+completely done during the LB then this does not apply.)

Flush sites effectively limited to writing stream to gzip+IO (which has the cost of zipping, without the cost of formatting).

Cons:

Time of format may be more than time-to-queue, fsvo.

Write to buffer, write buffer to stream

Lose semantic of traces-to-write

https://www.boost.org/doc/libs/1_52_0/libs/spirit/doc/html/spirit/karma/performance_measurements/numeric_performance/int_performance.html -- very slight lead in some cases? :D (Actually, that compares itoa, not even printf.. and the numbers are older.)

I have an idea. Let's add an opt-in flag for flush at idle. I don't think in general we can tolerate that many writes to disk that often in performance-critical settings.

pnstickne · 2020-03-05T04:11:45Z

src/vt/trace/trace.cc

@@ -653,8 +657,8 @@ void Trace::setTraceEnabledCurrentPhase(PhaseType cur_phase) {

      // Go ahead and perform a trace flush when tracing is disabled (and was


Not in this change, but I still feel something is off with the enable/disable transition above. At the very least the "..we are disabling traces.." is a misleading statement when tracing is disabled -> enabled.

I was thinking about the problem and I think it might be good to, when tracing is disabled, still allow end processing events. Then the open events are not unwound and is guaranteed free from corruption.

Here is a case where stack imbalance might be able to occur:

event = theTrace()->beginProcessing(..); setTraceEnabledCurrentPhase(); // manipulates stack here theTrace->endProcessing(event); // no matching event to close = assert failure

The standard enable/disable are broken for the same manner.

bool isTraceEnabled(eventType) -> EndProcessing/EndIdle are always allowed if tracing is enabled via node/configuration. Other event types are also subject to the dynamic settings.

I sort of understand your point, but not completely. There isn't a bug because in the disable -> enable transition the event returned from beginProcessing is essentially nil. In the enable -> disable transition, Trace::setTraceEnabledCurrentPhase will actually forcibly unwind the stack and close all events at that point before it's disabled.

Scenario:

Given tracing is enabled.

A begin processing event is written, and the processing tag returned. This is done at the start of a nested scheduler loop in Tracing bug: begin/end idle events mixed with begin/end processing #703, and from some other message processing locations.

Somewhere after this call and before the paired end processing call, possibly on a nested stack frame, tracing is disabled (via normal disable, or via phase) when the previous processing event has not been end'ed by the taker of the tag.

The code unwinds the open event stack.

Later one the stack unwinds and the code that called begin processing calls end processing with the tag it acquired - except this tag is no longer valid.

So if there is ever a way that this could could run as part of a nested scheduler, or as some handler either now (or in the future) added begin end processing tags, surrounding the operation, then kapow.

As far as the comment:

if (trace_enabled_cur_phase_ != ret) { if (tracingWasEnabled) { // enable -> disabled // Close and pop everything, we are disabling traces at this point

Ah, I get the problem now. This is a definitely a real problem interacting with #703

pnstickne

I would check interactions with #703 which may inadvertently affect the open event unwinding. (I plan to do so later.)

Even when unwinding, I would ensure to only do so for the enabled -> disabled case (even if the stack is presumed to be empty going in the other direction) for clarity.

I can update #703 to ensure that close processing/idle events are allowed (trace statically enabled, dynamically disabled), which would remove the extra stack unwinding entirely with the very minor caveat that an open event, if any, wouldn't actually be written to the trace until the subsequent flush.

lifflander · 2020-03-06T06:34:30Z

I would check interactions with #703 which may inadvertently affect the open event unwinding. (I plan to do so later.)

Even when unwinding, I would ensure to only do so for the enabled -> disabled case (even if the stack is presumed to be empty going in the other direction) for clarity.

I can update #703 to ensure that close processing/idle events are allowed (trace statically enabled, dynamically disabled), which would remove the extra stack unwinding entirely with the very minor caveat that an open event, if any, wouldn't actually be written to the trace until the subsequent flush.

So will you address this in #703? Should I merge this?

pnstickne · 2020-03-06T09:04:57Z

@lifflander Changes are on #703 to beta.6 PR now. I need to git (get?!) those changes back on another branch to merge down.. haven't tried with the phase enable/disable, but the test scenario with broadcasts + barriers + random trace starts+stops appears successful.

pnstickne · 2020-03-16T04:04:32Z

Rebased onto current origin/develop and force pushed. Slight method reorderings.

lifflander · 2020-03-27T00:00:32Z

@pnstickne This is ready to go once it is rebased?

lifflander requested review from PhilMiller, pnstickne, nlslatt and uhetmaniuk March 4, 2020 22:41

pnstickne reviewed Mar 5, 2020

View reviewed changes

src/vt/trace/trace.cc Outdated Show resolved Hide resolved

pnstickne reviewed Mar 5, 2020

View reviewed changes

src/vt/vrt/collection/balance/lb_invoke/invoke.cc Outdated Show resolved Hide resolved

pnstickne reviewed Mar 5, 2020

View reviewed changes

pnstickne approved these changes Mar 5, 2020

View reviewed changes

lifflander force-pushed the 721-trace-flush-bug-fixes branch from d01ab3c to e62faeb Compare March 6, 2020 19:02

pnstickne force-pushed the 721-trace-flush-bug-fixes branch 2 times, most recently from 0e6794f to c31959d Compare March 16, 2020 04:03

pnstickne added the 1.0.0-beta.6 label Mar 16, 2020

pnstickne force-pushed the 721-trace-flush-bug-fixes branch from 76318a8 to c31959d Compare March 27, 2020 06:23

lifflander added 9 commits March 31, 2020 16:09

#721: trace: move flush down LB call stack

f3ef0fa

#721: trace: remove potential flush every logEevnt

2d1b0a6

#721: trace: add print on node 0 during flush

1e375cb

#721: trace: add use event for flushing

84bc7c8

#721: trace: add potential missing header file

8b3dd56

#721: trace: do not add barrier at LB (already one exists)

3900bbb

#721: trace: flush at trace disable boundary even when < limit

daaeb2b

#721: trace: flush_event -> flush_event_

b99db4c

#721: trace: split LB call to flush into new method

17826cb

lifflander force-pushed the 721-trace-flush-bug-fixes branch from c31959d to 17826cb Compare March 31, 2020 23:11

lifflander added 2 commits March 31, 2020 17:38

#721: trace: add new sentinel val for trace::UserEventIDType

5375be0

#721: trace: cleanup duplicated code

6e0935c

lifflander merged commit b44d3c7 into develop Apr 1, 2020

cz4rs mentioned this pull request Jun 15, 2021

move vt version number into a separate file #1467

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

721 Trace flush bug fixes #722

721 Trace flush bug fixes #722

lifflander commented Mar 4, 2020

codecov bot commented Mar 4, 2020 •

edited

Loading

pnstickne Mar 5, 2020

pnstickne Mar 5, 2020

lifflander Mar 5, 2020

pnstickne Mar 5, 2020 •

edited

Loading

lifflander Mar 6, 2020

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

lifflander Mar 5, 2020

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

lifflander Mar 6, 2020

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

lifflander Mar 5, 2020

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

lifflander Mar 6, 2020

pnstickne left a comment •

edited

Loading

lifflander commented Mar 6, 2020

pnstickne commented Mar 6, 2020 •

edited

Loading

pnstickne commented Mar 16, 2020 •

edited

Loading

lifflander commented Mar 27, 2020

		@@ -653,8 +657,8 @@ void Trace::setTraceEnabledCurrentPhase(PhaseType cur_phase) {

		// Go ahead and perform a trace flush when tracing is disabled (and was

721 Trace flush bug fixes #722

721 Trace flush bug fixes #722

Conversation

lifflander commented Mar 4, 2020

codecov bot commented Mar 4, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

pnstickne Mar 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnstickne left a comment • edited Loading

Choose a reason for hiding this comment

lifflander commented Mar 6, 2020

pnstickne commented Mar 6, 2020 • edited Loading

pnstickne commented Mar 16, 2020 • edited Loading

lifflander commented Mar 27, 2020

codecov bot commented Mar 4, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne Mar 5, 2020 •

edited

Loading

pnstickne left a comment •

edited

Loading

pnstickne commented Mar 6, 2020 •

edited

Loading

pnstickne commented Mar 16, 2020 •

edited

Loading