forked from tensorflow/serving
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds performance guide and documentation for TensorBoard integration
PiperOrigin-RevId: 291474661
- Loading branch information
1 parent
114c035
commit f1e4eb2
Showing
7 changed files
with
378 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
# Performance Guide | ||
|
||
The performance of TensorFlow Serving is highly dependent on the application it | ||
runs, the environment in which it is deployed and other software with which it | ||
shares access to the underlying hardware resources. As such, tuning its | ||
performance is somewhat case-dependent and there are very few universal rules | ||
that are guaranteed to yield optimal performance in all settings. With that | ||
said, this document aims to capture some general principles and best practices | ||
for running TensorFlow Serving. | ||
|
||
Please use the [Profile Inference Requests with TensorBoard](tensorboard.md) | ||
guide to understand the underlying behavior of your model's computation on | ||
inference requests, and use this guide to iteratively improve its performance. | ||
|
||
Note: If the following quick tips do not solve your problem, please read the | ||
longer discussion to develop a deep understanding of what affects TensorFlow | ||
Serving's performance. | ||
|
||
## Quick Tips | ||
|
||
* Latency of first request is too high? Enable | ||
[model warmup](saved_model_warmup). | ||
* Interested in higher resource utilization or throughput? Configure | ||
[batching](serving_config.md#batching-configuration) | ||
|
||
## Performance Tuning: Objectives and Parameters | ||
|
||
When fine-tuning TensorFlow Serving's performance, there are usually 2 types of | ||
objectives you may have and 3 groups of parameters to tweak to improve upon | ||
those objectives. | ||
|
||
### Objectives | ||
|
||
TensorFlow Serving is an *online serving system* for machine-learned models. As | ||
with many other online serving systems, its primary performance objective is to | ||
*maximize throughput while keeping tail-latency below certain bounds*. Depending | ||
on the details and maturity of your application, you may care more about average | ||
latency than | ||
[tail-latency](https://blog.bramp.net/post/2018/01/16/measuring-percentile-latency/), | ||
but some notion of **latency** and **throughput** are usually the metrics | ||
against which you set performance objectives. Note that we do not discuss | ||
availability in this guide as that is more a function of the deployment | ||
environment. | ||
|
||
### Parameters | ||
|
||
We can roughly think about 3 groups of parameters whose configuration determines | ||
observed performance: 1) the TensorFlow model 2) the inference requests and 3) | ||
the server (hardware & binary). | ||
|
||
#### 1) The TensorFlow Model | ||
|
||
The model defines the computation that TensorFlow Serving will perform upon | ||
receiving each incoming request. | ||
|
||
Underneath the hood, TensorFlow Serving uses the TensorFlow runtime to do the | ||
actual inference on your requests. This means the **average latency** of serving | ||
a request with TensorFlow Serving is _usually_ at least that of doing inference | ||
directly with TensorFlow. This means if on a given machine, inference on a | ||
single example takes 2 seconds, and you have a sub-second latency target, you | ||
need to profile inference requests, understand what TensorFlow ops and | ||
sub-graphs of your model contribute most to that latency, and re-design your | ||
model with inference latency as a design constraint in mind. | ||
|
||
Please note, while the average latency of performing inference with TensorFlow | ||
Serving is usually not lower than using TensorFlow directly, where TensorFlow | ||
Serving shines is keeping the **tail latency** down for many clients querying | ||
many different models, all while efficiently utilizing the underlying hardware | ||
to maximize throughput. | ||
|
||
#### 2) The Inference Requests | ||
|
||
##### API Surfaces | ||
|
||
TensorFlow Serving has two API surfaces (HTTP and gRPC), both of which implement | ||
the | ||
[PredictionService API](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/apis/prediction_service.proto#L15) | ||
(with the exception of the HTTP Server not exposing a `MultiInference` | ||
endpoint). Both API surfaces are highly tuned and add minimal latency but in | ||
practice, the gRPC surface is observed to be slightly more performant. | ||
|
||
##### API Methods | ||
|
||
In general, it is advised to use the Classify and Regress endpoints as they | ||
accept | ||
[tf.Example](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/apis/input.proto#L77), | ||
which is a higher-level abstraction; however, in rare cases of large (O(Mb)) | ||
structured requests, savvy users may find using PredictRequest and directly | ||
encoding their Protobuf messages into a TensorProto, and skipping the | ||
serialization into and deserialization from tf.Example a source of slight | ||
performance gain. | ||
|
||
##### Batch Size | ||
|
||
There are two primary ways batching can help your performance. You may configure | ||
your clients to send batched requests to TensorFlow Serving, or you may send | ||
individual requests and configure TensorFlow Serving to wait up to a | ||
predetermined period of time, and perform inference on all requests that arrive | ||
in that period in one batch. Configuring the latter kind of batching allows you | ||
to hit TensorFlow Serving at extremely high QPS, while allowing it to | ||
sub-linearly scale the compute resources needed to keep up. This is further | ||
discussed in the [configuration guide](serving_config.md#batching-configuration) | ||
and the | ||
[batching README](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/batching/README.md). | ||
|
||
### 3) The Server (Hardware & Binary) | ||
|
||
The TensorFlow Serving binary does fairly precise accounting of the hardware | ||
upon which it runs. As such, you should avoid running other compute- or | ||
memory-intensive applications on the same machine, especially ones with dynamic | ||
resource usage. | ||
|
||
As with many other types of workloads, TensorFlow Serving is more efficient when | ||
deployed on fewer, larger (more CPU and RAM) machines (i.e. a `Deployment` with | ||
a lower `replicas` in Kubernetes terms). This is due to a better potential for | ||
multi-tenant deployment to utilize the hardware and lower fixed costs (RPC | ||
server, TensorFlow runtime, etc.). | ||
|
||
#### Accelerators | ||
|
||
If your host has access to an accelerator, ensure you have implemented your | ||
model to place dense computations on the accelerator - this should be | ||
automatically done if you have used high-level TensorFlow APIs, but if you have | ||
built custom graphs, or want to pin specific parts of graphs on specific | ||
accelerators, you may need to manually place certain subgraphs on accelerators | ||
(i.e. using `with tf.device('/device:GPU:0'): ...`). | ||
|
||
#### Modern CPUs | ||
|
||
Modern CPUs have continuously extended the x86 instruction set architecture to | ||
improve support for [SIMD](https://en.wikipedia.org/wiki/SIMD) (Single | ||
Instruction Multiple Data) and other features critical for dense computations | ||
(eg. a multiply and addition in one clock cycle). However, in order to run on | ||
slightly older machines, TensorFlow and TensorFlow Serving are built with the | ||
modest assumption that the newest of these features are not supported by the | ||
host CPU. | ||
|
||
`Your CPU supports instructions that this TensorFlow binary was not compiled to | ||
use: AVX2 FMA` | ||
|
||
If you see this log entry (possibly different extensions than the 2 listed) at | ||
TensorFlow Serving start-up, it means you can rebuild TensorFlow Serving and | ||
target your particular host's platform and enjoy better performance. Building | ||
TensorFlow Serving from source is relatively easy using Docker and is documented | ||
[here](building_with_docker.md). | ||
|
||
#### Binary Configuration | ||
|
||
TensorFlow Serving offers a number of configuration knobs that govern its | ||
runtime behavior, mostly set through | ||
[command-line flags](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/model_servers/main.cc). | ||
Some of these (most notably `tensorflow_intra_op_parallelism` and | ||
`tensorflow_inter_op_parallelism`) are passed down to configure the TensorFlow | ||
runtime and are auto-configured, which savvy users may override by doing many | ||
experiments and finding the right configuration for their specific workload and | ||
environment. | ||
|
||
## Life of a TensorFlow Serving inference request | ||
|
||
Let's briefly go through the life of a prototypical example of a TensorFlow | ||
Serving inference request to see the journey that a typical request goes | ||
through. For our example, we will dive into a Predict Request being received by | ||
the 2.0.0 TensorFlow Serving gRPC API surface. | ||
|
||
Let's first look at a component-level sequence diagram, and then jump into the | ||
code that implements this series of interactions. | ||
|
||
### Sequence Diagram | ||
|
||
<!-- Note: sequence-diagram is not supported by GitHub's markdown engine. | ||
To activate internally, uncomment the following block and remove the '|' | ||
characters, which are precluding the dashed arrows from being interpreted | ||
as end-comment tokens.--> | ||
|
||
<!-- | ||
<style> .rendered-sequence-diagram { max-width: 900px; overflow: auto; } | ||
.rendered-sequence-diagram svg { zoom: 0.70; } </style> | ||
```sequence-diagram | ||
participant Client as C | ||
participant Prediction\nService as PS | ||
participant TensorFlow Predictor as TP | ||
participant Server\nCore as SC | ||
participant TensorFlow\nRuntime as TF | ||
C->PS: Predict | ||
PS->TP: Predict | ||
TP->SC: GetServableHandle | ||
SC-|->TP: tensorflow::Session | ||
TP->TF: tensorflow::Session::Run | ||
TF-|->TP: Output Tensors from Session.Run | ||
TP-|->PS: PredictResponse | ||
PS-|->C: PredictResponse | ||
``` | ||
--> | ||
|
||
 | ||
|
||
Note that Client is a component owned by the user, Prediction Service, Servables | ||
and Server Core are owned by TensorFlow Serving and TensorFlow Runtime is owned | ||
by [Core TensorFlow](https://github.com/tensorflow/tensorflow). | ||
|
||
### Sequence Details | ||
|
||
1. [`PredictionServiceImpl::Predict`](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/model_servers/prediction_service_impl.cc#L38) | ||
receives the `PredictRequest` | ||
2. We invoke the | ||
[`TensorflowPredictor::Predict`](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_impl.cc#L146), | ||
propagating the request deadline from the gRPC request (if one was set). | ||
3. Inside `TensorflowPredictor::Predict`, we | ||
[lookup the Servable (model)](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_impl.cc#L165) | ||
the request is looking to perform inference on, from which we retrieve | ||
information about the SavedModel and more importantly, a handle to the | ||
`Session` object in which the model graph is (possibly partially) loaded. | ||
This Servable object was created and committed in memory when the model was | ||
loaded by TensorFlow Serving. We then invoke | ||
[internal::RunPredict](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L181) | ||
to carry out the prediction. | ||
4. In `internal::RunPredict`, after validating and preprocessing the request, | ||
we use the `Session` object to perform the inference using a blocking call | ||
to | ||
[Session::Run](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L209), | ||
at which point, we enter core TensorFlow's codebase. After the | ||
`Session::Run` returns and our `outputs` tensors have been populated, we | ||
[convert](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L150) | ||
the outputs to a `PredictionResponse` and return the result up the call | ||
stack. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Profile Inference Requests with TensorBoard | ||
|
||
After deploying TensorFlow Serving and issuing requests from your client, you | ||
may notice that requests take longer than you expected, or you are not achieving | ||
the throughput that you would have liked. | ||
|
||
In this guide, we will use TensorBoard's Profiler, which you may already use to | ||
[profile model training](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras), | ||
to trace inference requests to help us debug and improve inference performance. | ||
|
||
You should use this guide in conjunction with the best practices denoted in the | ||
[Performance Guide](performance.md) to optimize your model, requests and | ||
TensorFlow Serving instance. | ||
|
||
## Overview | ||
|
||
At a high level, we will point TensorBoard's Profiling tool at TensorFlow | ||
Serving's gRPC server. When we send an inference request to Tensorflow Serving, | ||
we will also simultaneously use the TensorBoard UI to ask it to capture the | ||
traces of this request. Behind the scenes, TensorBoard will talk to TensorFlow | ||
Serving over gRPC and ask it to provide a detailed trace of the lifetime of the | ||
inference request. TensorBoard will then visualize the activity of every thread | ||
on every compute device (running code integrated with | ||
[`profiler::TraceMe`](https://github.com/tensorflow/tensorflow/blob/f65b09f9aedcd33d0703cbf3d9845ea2869c0aa8/tensorflow/core/profiler/lib/traceme.h#L73)) | ||
over the course of the lifetime of the request on the TensorBoard UI for us to | ||
consume. | ||
|
||
## Prerequisites | ||
|
||
* `Tensorflow>=2.0.0` | ||
* TensorBoard (should be installed if TF was installed via `pip`) | ||
* Docker (which we'll use to download and run TF serving>=2.1.0 image) | ||
|
||
## Deploy model with TensorFlow Serving | ||
|
||
For this example, we will use Docker, the recommended way to deploy Tensorflow | ||
Serving, to host a toy model that computes `f(x) = x / 2 + 2` found in the | ||
[Tensorflow Serving Github repository](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu/00000123). | ||
|
||
Download the TensorFlow Serving source. | ||
|
||
``` | ||
git clone https://github.com/tensorflow/serving /tmp/serving | ||
cd /tmp/serving | ||
``` | ||
|
||
Launch TensorFlow Serving via Docker and deploy the half_plus_two model. | ||
|
||
``` | ||
docker pull tensorflow/serving | ||
MODELS_DIR="$(pwd)/tensorflow_serving/servables/tensorflow/testdata" | ||
docker run -it --rm -p 8500:8500 -p 8501:8501 \ | ||
-v $MODELS_DIR/saved_model_half_plus_two_cpu:/models/half_plus_two \ | ||
-e MODEL_NAME=half_plus_two \ | ||
tensorflow/serving | ||
``` | ||
|
||
In another terminal, query the model to ensure model is deployed correctly | ||
|
||
``` | ||
curl -d '{"instances": [1.0, 2.0, 5.0]}' \ | ||
-X POST http://localhost:8501/v1/models/half_plus_two:predict | ||
# Returns => { "predictions": [2.5, 3.0, 4.5] } | ||
``` | ||
|
||
## Set up TensorBoard's Profiler | ||
|
||
In another terminal, launch the TensorBoard tool on your machine, providing a | ||
directory to save the inference trace events to: | ||
|
||
``` | ||
mkdir -p ~/logs/inference_demo | ||
tensorboard --logdir ~/logs/inference_demo/ --port 6006 | ||
``` | ||
|
||
Navigate to http://localhost:6006/ to view the TensorBoard UI. Use the drop-down | ||
menu at the top to navigate to the Profile tab. Click Capture Profile and | ||
provide the address of Tensorflow Serving's gRPC server. | ||
|
||
 | ||
|
||
As soon as you press "Capture," TensorBoard will start sending profile requests | ||
to the model server. In the dialog above, you can set both the deadline for each | ||
request and the total number of times Tensorboard will retry if no trace events | ||
are collected. If you are profiling an expensive model, you may want to increase | ||
the deadline to ensure the profile request does not time out before the | ||
inference request completes. | ||
|
||
## Send and Profile an Inference Request | ||
|
||
Press Capture on the TensorBoard UI and send an inference request to TF Serving | ||
quickly thereafter. | ||
|
||
``` | ||
curl -d '{"instances": [1.0, 2.0, 5.0]}' -X POST \ | ||
http://localhost:8501/v1/models/half_plus_two:predict | ||
``` | ||
|
||
You should see a "Capture profile successfully. Please refresh." toast appear at | ||
the bottom of the screen. This means TensorBoard was able to retrieve trace | ||
events from TensorFlow Serving and saved them to your `logdir`. Refresh the page | ||
to visualize the inference request with The Profiler's Trace Viewer, as seen in | ||
the next section. | ||
|
||
Note: If you see `tensorflow.python.framework.errors_impl.UnimplementedError` in | ||
your TensorBoard logs, it likely means you are running a Tensorflow Serving | ||
version older than 2.1. | ||
|
||
## Analyze the Inference Request Trace | ||
|
||
 | ||
|
||
You can now easily see what computation is taking place as a result of your | ||
inference request. You can zoom and click on any of the rectangles (trace | ||
events) to get more information such as exact start time and wall duration. | ||
|
||
At a high-level, we see two threads belonging to the TensorFlow runtime and a | ||
third one that belongs to the REST server, handling the receiving of the HTTP | ||
request and creating a TensorFlow Session. | ||
|
||
We can zoom in to see what happens inside the SessionRun. | ||
|
||
 | ||
|
||
In the second thread, we see an initial ExecutorState::Process call in which no | ||
TensorFlow ops run but initialization steps are executed. | ||
|
||
In the first thread, we see the call to read the first variable, and once the | ||
second variable is also available, executes the multiplication and add kernels | ||
in sequence. Finally, the Executor signals that its computation is done by | ||
calling the DoneCallback and the Session can be closed. | ||
|
||
## Next Steps | ||
|
||
While this is a simple example, you can use the same process to profile much | ||
more complex models, allowing you to identify slow ops or bottlenecks in your | ||
model architecture to improve its performance. | ||
|
||
Please refer to | ||
[TensorBoard Profiler Guide](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras#trace_viewer) | ||
for a more complete tutorial on features of TensorBoard's Profiler and | ||
[TensorFlow Serving Performance Guide](performance.md) to learn more about | ||
optimizing inference performance. |