Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: pingpong total latency #4757

Merged
merged 23 commits into from
Jan 19, 2023
Merged

Conversation

brianolson
Copy link
Contributor

Summary

Measure the total latency of a transaction. Measure from the moment the txn send API returns to the moment we see the txn in a comitted block.

Test Plan

This is a test. Run on local private cluster and maybe aws test cluster.

Comment on lines +1358 to +1375
case st := <-pps.sentTxid:
if len(txidList) < txidLatencySampleSize {
index := len(txidList)
txidList = append(txidList, st.txid)
byTxid[st.txid] = txidSendTimeIndexed{
st,
index,
}
} else {
// random replacement
evict := rand.Intn(len(txidList))
delete(byTxid, txidList[evict])
txidList[evict] = st.txid
byTxid[st.txid] = txidSendTimeIndexed{
st,
evict,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this random replacement scheme, just thinking out loud -- if your sample size is smaller than the number of data points, why not just do a circular buffer? Advantages I see would be that the datapoints would be still well-ordered and you would not be missing any data for the range of time the sample was collected. The way it works now makes it so that the most recent datapoints are most-likely to be included, and the least recent datapoints are least-likely to be included, which would also be the case with a circular buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the rate is larger than the buffer then a circular buffer could lose almost all the data. With a buffer of 10_000 but 26_000 transactions in a block it would only know about the most recent transactions and only measure their latency. Better to measure over a longer duration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry why not make it 26000 then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old habit from working in RAM-scare environments. And to make up some more justification: maybe I don't even want to log all of the txns, but just a sample, because we also don't need to process a full 6000 TPS of this data.

func (pps *WorkerState) txidLatencyBlockWaiter(ctx context.Context, ac *libgoal.Client) {
done := ctx.Done()
restart:
// I wish Go had macros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// something something vim your way to Go Macros :)

Comment on lines 1457 to 1459
fmt.Fprintf(os.Stderr, "block waiter w: %v", err)
time.Sleep(5 * time.Second)
goto restart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this loop feeds blocks to the latencyBlocks handling, which in turn calls time.Now to figure out the latency from the recorded time to block time.

But, looking at this bit here, is it possible that this loop will be sleeping when goal publishes a new block for consumption? If that happens, the time.Now used for calculation would contain that delay, right?

Since this is error handling, I suspect that we don't really expect small temporary errors like that, but wanted to check anyhow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduced the err restart time to 1 second; maybe the error condition will go away and we'll restart the API calls and not oversleep the round change.

@brianolson brianolson changed the title test: pingpong total latency tools: pingpong total latency Nov 22, 2022
@brianolson brianolson marked this pull request as ready for review November 22, 2022 21:43
@brianolson brianolson marked this pull request as draft December 8, 2022 20:37
@codecov
Copy link

codecov bot commented Dec 8, 2022

Codecov Report

Merging #4757 (315c271) into master (c41cd69) will decrease coverage by 0.76%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4757      +/-   ##
==========================================
- Coverage   53.63%   52.88%   -0.76%     
==========================================
  Files         432      432              
  Lines       54058    54166     +108     
==========================================
- Hits        28996    28647     -349     
- Misses      22813    23243     +430     
- Partials     2249     2276      +27     
Impacted Files Coverage Δ
shared/pingpong/config.go 0.00% <ø> (ø)
shared/pingpong/pingpong.go 0.00% <0.00%> (ø)
data/transactions/error.go 0.00% <0.00%> (-50.00%) ⬇️
tools/network/dnssec/client.go 44.11% <0.00%> (-38.24%) ⬇️
catchup/ledgerFetcher.go 13.48% <0.00%> (-25.85%) ⬇️
cmd/tealdbg/cdtdbg.go 63.52% <0.00%> (-18.83%) ⬇️
tools/network/dnssec/trustedzone.go 88.60% <0.00%> (-10.13%) ⬇️
util/codecs/json.go 0.00% <0.00%> (-9.81%) ⬇️
data/basics/overflow.go 43.05% <0.00%> (-9.73%) ⬇️
data/transactions/transaction.go 34.19% <0.00%> (-8.71%) ⬇️
... and 36 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@brianolson brianolson marked this pull request as ready for review December 8, 2022 21:21
@brianolson
Copy link
Contributor Author

after a bunch of updates I think this is again ready for review and it would be good to get this extra measurement into any new tests

out := pps.latencyOuts[len(pps.latencyOuts)-1]
for {
select {
case st := <-pps.sentTxid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true sampling should be done on the pps.sentTxid writer side. Otherwise 10k samples will be fully overwritten in few rounds under full TPS.

@algobolson
Copy link
Contributor

circleci is dumb

@algobolson algobolson closed this Jan 18, 2023
@algobolson
Copy link
Contributor

circleci is still dumb

@algobolson algobolson reopened this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants