-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools: pingpong total latency #4757
Conversation
case st := <-pps.sentTxid: | ||
if len(txidList) < txidLatencySampleSize { | ||
index := len(txidList) | ||
txidList = append(txidList, st.txid) | ||
byTxid[st.txid] = txidSendTimeIndexed{ | ||
st, | ||
index, | ||
} | ||
} else { | ||
// random replacement | ||
evict := rand.Intn(len(txidList)) | ||
delete(byTxid, txidList[evict]) | ||
txidList[evict] = st.txid | ||
byTxid[st.txid] = txidSendTimeIndexed{ | ||
st, | ||
evict, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this random replacement scheme, just thinking out loud -- if your sample size is smaller than the number of data points, why not just do a circular buffer? Advantages I see would be that the datapoints would be still well-ordered and you would not be missing any data for the range of time the sample was collected. The way it works now makes it so that the most recent datapoints are most-likely to be included, and the least recent datapoints are least-likely to be included, which would also be the case with a circular buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the rate is larger than the buffer then a circular buffer could lose almost all the data. With a buffer of 10_000 but 26_000 transactions in a block it would only know about the most recent transactions and only measure their latency. Better to measure over a longer duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry why not make it 26000 then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old habit from working in RAM-scare environments. And to make up some more justification: maybe I don't even want to log all of the txns, but just a sample, because we also don't need to process a full 6000 TPS of this data.
shared/pingpong/pingpong.go
Outdated
func (pps *WorkerState) txidLatencyBlockWaiter(ctx context.Context, ac *libgoal.Client) { | ||
done := ctx.Done() | ||
restart: | ||
// I wish Go had macros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// something something vim your way to Go Macros :)
shared/pingpong/pingpong.go
Outdated
fmt.Fprintf(os.Stderr, "block waiter w: %v", err) | ||
time.Sleep(5 * time.Second) | ||
goto restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this loop feeds blocks to the latencyBlocks
handling, which in turn calls time.Now
to figure out the latency from the recorded time to block time.
But, looking at this bit here, is it possible that this loop will be sleeping when goal publishes a new block for consumption? If that happens, the time.Now
used for calculation would contain that delay, right?
Since this is error handling, I suspect that we don't really expect small temporary errors like that, but wanted to check anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduced the err restart time to 1 second; maybe the error condition will go away and we'll restart the API calls and not oversleep the round change.
Codecov Report
@@ Coverage Diff @@
## master #4757 +/- ##
==========================================
- Coverage 53.63% 52.88% -0.76%
==========================================
Files 432 432
Lines 54058 54166 +108
==========================================
- Hits 28996 28647 -349
- Misses 22813 23243 +430
- Partials 2249 2276 +27
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
--aout append path
after a bunch of updates I think this is again ready for review and it would be good to get this extra measurement into any new tests |
out := pps.latencyOuts[len(pps.latencyOuts)-1] | ||
for { | ||
select { | ||
case st := <-pps.sentTxid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true sampling should be done on the pps.sentTxid
writer side. Otherwise 10k samples will be fully overwritten in few rounds under full TPS.
circleci is dumb |
circleci is still dumb |
Summary
Measure the total latency of a transaction. Measure from the moment the txn send API returns to the moment we see the txn in a comitted block.
Test Plan
This is a test. Run on local private cluster and maybe aws test cluster.