Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include histogram in JSON output #395

Merged
merged 3 commits into from
Jun 24, 2019
Merged

Include histogram in JSON output #395

merged 3 commits into from
Jun 24, 2019

Conversation

fxkr
Copy link
Contributor

@fxkr fxkr commented Apr 24, 2019

Background

See #394

Interesting cases to test:

# Main use case
% vegeta report -type=json -hist='[0,1ms,2ms,3ms,4ms,5ms]' vegeta.bin
{"latencies": ..., "hist": [0,0,0,56,44,400], ...}

# If -hist is not given, "hist" is omitted
% vegeta report -type=json vegeta.bin

# -type=hist supports -hist too (for consistency)
% vegeta report -type=hist -hist='[0,1ms,2ms,3ms,4ms,5ms]' vegeta.bin

# Legacy -type=hist invocation continues to work unchanged
% vegeta report -type=hist'[0,1ms,2ms,3ms,4ms,5ms]' vegeta.bin

# Help should mention the new parameter
% vegeta --help

Checklist

  • Git commit messages conform to community standards.
  • Each Git commit represents meaningful milestones or atomic units of work.
  • Changed or added code is covered by appropriate tests.

@fxkr fxkr requested review from tsenart and xla as code owners April 24, 2019 07:39
@tsenart
Copy link
Owner

tsenart commented Apr 28, 2019

Thanks for the contribution! I have some thoughts that I'd like to share and get your input on.

I've had the feeling for a long time that the histogram report is under-used and its UX and utility could be substantially improved. Your use-case hints at the same conclusion.

Similarly to how vegeta plot became its own sub command when it started to deviate significantly from the other "reports", I think that the introduction of this --hist command line argument suggests that a vegeta hist(ogram) sub-command is up for promotion.

My vision for this sub-command could be captured like this in its help output:

Usage: vegeta histogram [options] [<file>...]

Outputs a histogram of request latencies in HTML, JSON or text format.

Arguments:
  <file>  A file with vegeta attack results encoded with one of
          the supported encodings (gob | json | csv) [default: stdin]

Options:
  --bins    How to bin request latencies. Defaults to auto
            logarithmic binning. Specific bins can also be specified
            with as a string of comma separated durations (e.g "0,1ms,2ms,3ms,4ms,5ms")
            [default: "auto"]

  --format  Format to output histogram in. Supported formats are
            html, json and text. [default: "text"]

Examples:
  echo "GET http://:80" | vegeta attack -name=50qps -rate=50 -duration=5s > results.50qps.bin
  cat results.50qps.bin | vegeta histogram

What are your thoughts on this?

@fxkr
Copy link
Contributor Author

fxkr commented Apr 29, 2019

I feel that histograms are underused in general, and I would like to see their use promoted.

I would go the opposite way and eventually include a histogram in the default report output. Having it in a separate command kind of implies it's something not normally needed. Whereas in reality averages alone can be misleading, so imo the histogram should always be shown at least for a quick look. As a user, I would also like to see a good enough summary with one command and no options, ie vegeta report.

The limit is that I also don't want to have to scroll the output. But the current text output is 7 lines plus errors, so some space is available. And actually I'd rather have the histogram than the errors, which are excessive for me due to #399, but that can be fixed with better aggregation (let's discuss that in the other issue).

I'm not against having subcommands for the individual reports. But:

a) I'd still want a combined report for ease of use and to have a standard, easily recognizable output text format (e.g. if I see iperf output I know what to look for right away. I do that a lot.). It's not mutually exclusive with subcommands, it's enough if it just calls all the individual commands with default params for me. But defaults matter, and good defaults can save a lot of time.

b) I'm fine with having a flexible report command with many options for individual "sections" of the report. For histograms, I don't foresee more than --bins. That can easily stay in report. We shouldn't even have too many parameters. Users with very specific requirements will build their own scripts around the JSON or API anyway.

Also I think it's very important to have a single command that outputs all the data in one JSON. This is kind of important for easy integration. It doesn't have to be report -format=json, but it can be.

@tsenart
Copy link
Owner

tsenart commented May 5, 2019

I feel that histograms are underused in general, and I would like to see their use promoted.

I echo this sentiment.

I would go the opposite way and eventually include a histogram in the default report output. Having it in a separate command kind of implies it's something not normally needed. Whereas in reality averages alone can be misleading, so imo the histogram should always be shown at least for a quick look.

It's is because averages can be misleading that the default text report includes percentiles. Do you think that many users fail to interpret percentiles, but would instead succeed with a histogram?

As a user, I would also like to see a good enough summary with one command and no options, ie vegeta report. The limit is that I also don't want to have to scroll the output. But the current text output is 7 lines plus errors, so some space is available.

a) I'd still want a combined report for ease of use and to have a standard, easily recognizable output text format (e.g. if I see iperf output I know what to look for right away. I do that a lot.). It's not mutually exclusive with subcommands, it's enough if it just calls all the individual commands with default params for me.

How are you envisioning this standard report output? Could you show me an illustrated example?

But defaults matter, and good defaults can save a lot of time.

Automatic binning, then, seems like the way to go. How would you approach this?

And actually I'd rather have the histogram than the errors, which are excessive for me due to #399, but that can be fixed with better aggregation (let's discuss that in the other issue).

Good call! These errors can should indeed be better aggregated. Thanks for opening #399.

b) I'm fine with having a flexible report command with many options for individual "sections" of the report. For histograms, I don't foresee more than --bins. That can easily stay in report. We shouldn't even have too many parameters. Users with very specific requirements will build their own scripts around the JSON or API anyway.

Also I think it's very important to have a single command that outputs all the data in one JSON. This is kind of important for easy integration. It doesn't have to be report -format=json, but it can be.

Agreed and agreed.

@fxkr
Copy link
Contributor Author

fxkr commented May 6, 2019

I would go the opposite way and eventually include a histogram in the default report output. Having it in a separate command kind of implies it's something not normally needed. Whereas in reality averages alone can be misleading, so imo the histogram should always be shown at least for a quick look.

It's is because averages can be misleading that the default text report includes percentiles. Do you think that many users fail to interpret percentiles, but would instead succeed with a histogram?

I've read that article before. I was actually wondering why vegeta only outputs the 99th percentile, given that (taking the numbers from there) 18% of users will see worse than the 99 .9th percentile. I suppose it's in the region where we need to be careful with introducing errors (e.g. due to GC).

Just to be clear, I wasn't arguing against percentiles, I was just arguing for adding histograms (and keeping the percentiles). I think percentiles and histograms serve different purposes.

Percentiles are an easier to track metric, and are easier to relate to end user experience. From a blackbox "how good is this system?" perspective, percentiles might even be fine.

Histograms help you understand the system. I know I find it hard to interpret percentiles if I know nothing over the distribution. From a few percentiles I can't even tell a single log-normal distribution from two overlaid ones (e.g. due to a cache that hits in x% cases. more motivation). So the histogram is useful from a whitebox "how do I tune/fix this system?" perspective.

(It would be different if we could look at the CDF, ie "all percentiles as a graph". In terms of information that's equivalent to the histogram. I find them less intuitive than histograms.)

How are you envisioning this standard report output? Could you show me an illustrated example?

Just all the sections one after another.

I wasn't asking to really change much in the format. What I meant was: if we have a single good output that doesn't rely on customization (via subcommands and parameters, e.g. to enable/disable individual sections) then most vegeta reports (incl. those copy/pasted by other people) will look similar. If we encourage users to parametrize the command, the reader will have to visually "parse" the output again every time.

Requests      [total, rate]            10000, 1000.08
Duration      [total, attack, wait]    9.999650299s, 9.999204067s, 446.232µs
Latencies     [mean, 50, 95, 99, max]  508.506µs, 505.419µs, 618.511µs, 733.938µs, 12.052322ms
Bytes In      [total, mean]            5159484, 515.95
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  99.99%
Status Codes  [code:count]             0:1  200:9999  

Bucket         #     %       Histogram
[0s,    1ms]   9952  99.52%  ###########################################################################
[1ms,   2ms]   40    0.40%   ##############################
[2ms,   4ms]   6     0.06%   ##############
[4ms,   8ms]   1     0.01%   
[8ms,   +Inf]  1     0.01%   

Error Set:
Get http://localhost:8000: net/http: timeout awaiting response headers

Automatic binning, then, seems like the way to go. How would you approach this?

I have a rough draft in my fork (branch/commit)

Usage: vegeta report -type=hist. It's only extending the hist report for now, I didn't touch the text report yet.

The problem I'm running into is vegeta has structs in the public API, which means I have to resort to conditionals instead of polymorphism if I want to have a Histogram that manages its own buckets. If Histogram was an Interface this would not be a problem. How attached are you to API compatibility (relative to clean code)?

I picked base 2 and didn't add a way to customize the base, which is probably ok.

It also includes another bugfix/feature that I should probably split out and propose separately (I'm rescaling the bars of the histogram so that the largest bar fits the available width.)

I was also wondering if linear search is best to find the bucket, but haven't benchmarked this yet. Quite possible binary search is slower since len(buckets) is smallish. Anyway that's also a pre-existing issue, so a fine micro-optimization project for later.

I would do a separate PR for the auto log binning after we merge this one here, but it's still good that we're discussing both here until we have an acceptable CLI.

Copy link
Owner

@tsenart tsenart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposed ideas to extend the default text and JSON report to include a histogram by default. Will you follow up with a PR for that? If we do that, we should deprecate the independent hist report.

README.md Outdated
@@ -363,6 +363,8 @@ Options:
--type Which report type to generate (text | json | hist[buckets]).
[default: text]

--hist Histogram buckets, e.g.: '[0,1ms,10ms]'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this flag --buckets instead and not make the output of the actual histogram optional. It should always be included in both the JSON and text report, with the --buckets value determined automatically from the data set when not specified. This, however, would mean that the algorithm for automatic binning must be appropriate for large data sets (i.e. worse-case O(n)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll change it to --buckets (won't get to it today, but soon).

I'll do the change to enable the histogram by default using automatic bins in a follow up PR once this one is in.

This, however, would mean that the algorithm for automatic binning must be appropriate for large data sets (i.e. worse-case O(n)).

Yes. It's already O(num of samples), and num of bucket is "small" (<=16 in practice, <= 64 technically) so that's ok too.

Your comment did make me realize I can just preallocate 64 buckets and remove the extension logic.

I wanted to benchmark the bucket search code, but with 16 buckets I expect linear search will be close to ideal.

report.go Outdated
if len(typ) < 6 {
return fmt.Errorf("bad buckets: '%s'", typ[4:])
}
histStr = typ[4:]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return an error saying that specifying the buckets via the --type flag is deprecated and that one should use the --buckets flag instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@fxkr
Copy link
Contributor Author

fxkr commented May 12, 2019

I like the proposed ideas to extend the default text and JSON report to include a histogram by default. Will you follow up with a PR for that? If we do that, we should deprecate the independent hist report.

Yes, I'll do a follow up PR.

@fxkr
Copy link
Contributor Author

fxkr commented May 12, 2019

Having used the histogram data in the JSON a bit, I now feel that using an array of bucket counts was a huge mistake. I keep second guessing what the bins were in old JSON output. I think I will switch to using a dict. This will also work better with the automatic bins.

@fxkr
Copy link
Contributor Author

fxkr commented May 12, 2019

  • Change --hist to --buckets
  • In JSON output, change array of bucket counts to map[latency]count
  • Add deprecation warning to --type=hist ---> update: no, will be done only when text report includes histogram

@tsenart
Copy link
Owner

tsenart commented May 13, 2019

Add deprecation warning to --type=hist

Maybe the deprecation warning should only be added once we have the text report with the histogram included.

@ruiaraujo
Copy link

Would it be an option to use the HDR histogram log format so that a lot of visualization tooling for HDR can be used?

lib/histogram.go Outdated Show resolved Hide resolved
@fxkr
Copy link
Contributor Author

fxkr commented May 28, 2019

Would it be an option to use the HDR histogram log format so that a lot of visualization tooling for HDR can be used?

I like the idea, but I would say it's out of scope for this PR. I had a brief look at the source code (it appears there's no actual spec for the format) and it would definitely need to be implemented completely separately from vegeta's proprietary JSON format. I don't know what @tsenart thinks about it, but I would suggest filing a separate feature request issue for this.

@fxkr
Copy link
Contributor Author

fxkr commented May 28, 2019

@tsenart Apologies for the delay! Here's the new changeset. Functionally I'd say it's complete I would do separate PR's for more features (automatic bins, extending text report etc.). Let me know if that works for you or if you find anything wrong with the code. Thanks!

@tsenart
Copy link
Owner

tsenart commented Jun 16, 2019

Would it be an option to use the HDR histogram log format so that a lot of visualization tooling for HDR can be used?

Thanks for bringing this to our attention! I wasn't familiar with it. I think it's a great idea to make Vegeta able to output this.

d it would definitely need to be implemented completely separately from vegeta's proprietary JSON format.

Veget'a JSON format is not proprietary. Did you mean "custom"? :-)

README.md Show resolved Hide resolved
tsenart
tsenart previously approved these changes Jun 16, 2019
@ruiaraujo
Copy link

@tsenart I have a fork that outputs the format I described, I will clean it up and propose a PR later.
The main issue is that the golang implementation of hdr histogram is deprecated with no replacement.

@fxkr
Copy link
Contributor Author

fxkr commented Jun 17, 2019

Veget'a JSON format is not proprietary. Did you mean "custom"? :-)

Yes; that was badly worded. I meant "non-standard", to contrast with the hdrhistogram format (which is somewhat less non-standard in the way that some other tools probably use it, even though it's still totally not a standard)

@tsenart
Copy link
Owner

tsenart commented Jun 17, 2019

@ruiaraujo, @fxkr: Mind to take a look? #413

@fxkr
Copy link
Contributor Author

fxkr commented Jun 24, 2019

Rebased onto master and extended README with description of bucket field in JSON:

+In the `buckets` field, each key is a nanosecond value representing the lower bound of a bucket.
+The upper bound is implied by the next higher bucket.
+Upper bounds are non-inclusive.
+The highest bucket is the overflow bucket; it has no upper bound.
+The values are counts of how many requests fell into that particular bucket.

@tsenart tsenart merged commit 2870712 into tsenart:master Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants