Remove atomic ops in APM as updates and calculations are serialized. #753

keshonok · 2017-06-18T19:20:01Z

No description provided.

With the change of the concept atomic ops had became unnecessary. Please see the comment here: #712 (comment)

Make sure the function is not called when r equals zero (the first range the left boundary of which is fixed at 1ms).

After the counters are coalesced to the left half of the range, zero out the right half of the range when the range is extended.

The left boundary of a range expands using a special algorithm. When RTT value is big enough (but still within the range of unsigned short), the next value of the left boundary may exceed both the value of RTT and the maximum value of unsigned short. In that case a wraparound occurs which leads to an endless loop.

The percentiles were calculated only if there were updates of stats data. That lead to situations where percentiles were "stuck" at some previous values even though the time frame has moved forward.

TfwPcntCtl{} has supported RTT values up to about 65 seconds. It's possible that higher values may be seen in the wild. This patch makes higher values acceptable.

krizhanovsky

Seems well done, there are mostly questions to discuss in the chat and small adjustments.

krizhanovsky · 2017-06-22T22:55:33Z

tempesta_fw/apm.c

+			rng->cnt[TFW_STATS_RLAST][2 * i]
+			+ rng->cnt[TFW_STATS_RLAST][2 * i + 1];
+	for (i = TFW_STATS_BCKTS / 2; i < TFW_STATS_BCKTS; ++i)
+		rng->cnt[TFW_STATS_RLAST][i] = 0;


Good fix.

If cnt isn't atomic, then why not to use memset()?

Please update t/unit/user_space/percentiles.c. The reason why I ask you to update the test, is that the change is not obvious (I remember that I thought about zeroing), so it'd much easier to see a test which shows the need of zeroing.

One more question, also good to be covered by a test, is that in the loop above we increment order of the range, i.e. we can grow it 4, 8 or more times, but we coalesce only halves of the reange, which is correct for order 2 growing only. Seems like a bug, right?

krizhanovsky · 2017-06-23T13:40:14Z

tempesta_fw/apm.c

+		*r_time = end = prend;
+		if (end == pc->end)
+			return;
+	}


Why do we need to chop r_time value? Now end is 32-bit unsigned integer and can handle large values. Meantime, in fact, we can easily face real life scenario with response time larger than 65 seconds. Moreover, it'd be good to understand why do we have so long response time in our simple wrk workload from #752. I think it's due to too aggressive and unlimited server pipelining, but maybe I'm wrong.

krizhanovsky · 2017-06-23T13:45:18Z

tempesta_fw/apm.c

+	unsigned long prend, cnt = 0, sum = 0, max = 0, i_max = 0;
+
+	if (unlikely(r == 0))
+		return;


It's never called with r = 0, so there could be BUG_ON() or just remove the check.

As the value of 'order' increases, the range may grow by 2, 4, 8, etc. times. Yet just a half of the range was coalesced. Also, this patch removes the code that was intended to deal with an overflow of TfwPcntCtl{}->end. That's unnecessary now that it is of type unsigned int.

Remove atomic ops. Fix all bugs that have been found so far.

krizhanovsky · 2017-06-30T07:59:17Z

Good to merge

keshonok added 7 commits June 18, 2017 14:02

Remove atomic ops used in stats data storage and manipulations.

f1c7008

With the change of the concept atomic ops had became unnecessary. Please see the comment here: #712 (comment)

Additional cleanup due to serialization of updates and calculations.

a9310d9

Small optimization in tfw_apm_adjust().

870f051

Make sure the function is not called when r equals zero (the first range the left boundary of which is fixed at 1ms).

Small fix in tfw_stats_extend().

689a4ee

After the counters are coalesced to the left half of the range, zero out the right half of the range when the range is extended.

Always calculate new percentiles in tfw_apm_prcntl_tmfn().

1d4b916

The percentiles were calculated only if there were updates of stats data. That lead to situations where percentiles were "stuck" at some previous values even though the time frame has moved forward.

Expand TfwPcntCtl{} members in APM to unsigned int type.

1aacac7

TfwPcntCtl{} has supported RTT values up to about 65 seconds. It's possible that higher values may be seen in the wild. This patch makes higher values acceptable.

krizhanovsky requested changes Jun 23, 2017

View reviewed changes

keshonok added 2 commits June 28, 2017 23:50

Align the reference implementation with the current code in APM.

78253e8

Remove atomic ops. Fix all bugs that have been found so far.

keshonok force-pushed the ab-nonatomic-apm branch from 8bf8f3e to 78253e8 Compare June 28, 2017 21:46

keshonok merged commit d791708 into master Jun 30, 2017

keshonok deleted the ab-nonatomic-apm branch June 30, 2017 09:47

keshonok mentioned this pull request Jun 30, 2017

Crash in APM #752

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove atomic ops in APM as updates and calculations are serialized. #753

Remove atomic ops in APM as updates and calculations are serialized. #753

keshonok commented Jun 18, 2017

krizhanovsky left a comment

krizhanovsky Jun 22, 2017

krizhanovsky Jun 23, 2017

krizhanovsky Jun 23, 2017

krizhanovsky commented Jun 30, 2017

Remove atomic ops in APM as updates and calculations are serialized. #753

Remove atomic ops in APM as updates and calculations are serialized. #753

Conversation

keshonok commented Jun 18, 2017

krizhanovsky left a comment

Choose a reason for hiding this comment

krizhanovsky Jun 22, 2017

Choose a reason for hiding this comment

krizhanovsky Jun 23, 2017

Choose a reason for hiding this comment

krizhanovsky Jun 23, 2017

Choose a reason for hiding this comment

krizhanovsky commented Jun 30, 2017