Adds performance guide and documentation for TensorBoard integration

PiperOrigin-RevId: 291474661
guanxinq · Jan 25, 2020 · f1e4eb2 · f1e4eb2
1 parent 114c035
commit f1e4eb2
Show file tree

Hide file tree

Showing 7 changed files with 378 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -89,7 +89,10 @@ for detailed instructions on how to export SavedModels.
 
 * [Follow a tutorial on Serving Tensorflow models](tensorflow_serving/g3doc/serving_basic.md)
 * [Configure Tensorflow Serving to make it fit your serving use case](tensorflow_serving/g3doc/serving_config.md)
-* Read the [REST API Guide](tensorflow_serving/g3doc/api_rest.md) or [gRPC API definition](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/apis)
+* Read the [Performance Guide](tensorflow_serving/g3doc/performance.md)
+and learn how to [use TensorBoard to profile and optimize inference requests](tensorflow_serving/g3doc/tensorboard.md)
+* Read the [REST API Guide](tensorflow_serving/g3doc/api_rest.md)
+or [gRPC API definition](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/apis)
 * [Use SavedModel Warmup if initial inference requests are slow due to lazy initialization of graph](tensorflow_serving/g3doc/saved_model_warmup.md)
 * [If encountering issues regarding model signatures, please read the SignatureDef documentation](tensorflow_serving/g3doc/signature_defs.md)
 * If using a model with custom ops, [learn how to serve models with custom ops](tensorflow_serving/g3doc/custom_op.md)

diff --git a/tensorflow_serving/g3doc/images/predict_sequence_diagram.png b/tensorflow_serving/g3doc/images/predict_sequence_diagram.png
diff --git a/tensorflow_serving/g3doc/images/tb_profile_overview.png b/tensorflow_serving/g3doc/images/tb_profile_overview.png
diff --git a/tensorflow_serving/g3doc/images/tb_profile_setup_dialog.png b/tensorflow_serving/g3doc/images/tb_profile_setup_dialog.png
diff --git a/tensorflow_serving/g3doc/images/tb_profile_zoom.png b/tensorflow_serving/g3doc/images/tb_profile_zoom.png
diff --git a/tensorflow_serving/g3doc/performance.md b/tensorflow_serving/g3doc/performance.md
@@ -0,0 +1,230 @@
+# Performance Guide
+
+The performance of TensorFlow Serving is highly dependent on the application it
+runs, the environment in which it is deployed and other software with which it
+shares access to the underlying hardware resources. As such, tuning its
+performance is somewhat case-dependent and there are very few universal rules
+that are guaranteed to yield optimal performance in all settings. With that
+said, this document aims to capture some general principles and best practices
+for running TensorFlow Serving.
+
+Please use the [Profile Inference Requests with TensorBoard](tensorboard.md)
+guide to understand the underlying behavior of your model's computation on
+inference requests, and use this guide to iteratively improve its performance.
+
+Note: If the following quick tips do not solve your problem, please read the
+longer discussion to develop a deep understanding of what affects TensorFlow
+Serving's performance.
+
+## Quick Tips
+
+*   Latency of first request is too high? Enable
+    [model warmup](saved_model_warmup).
+*   Interested in higher resource utilization or throughput? Configure
+    [batching](serving_config.md#batching-configuration)
+
+## Performance Tuning: Objectives and Parameters
+
+When fine-tuning TensorFlow Serving's performance, there are usually 2 types of
+objectives you may have and 3 groups of parameters to tweak to improve upon
+those objectives.
+
+### Objectives
+
+TensorFlow Serving is an *online serving system* for machine-learned models. As
+with many other online serving systems, its primary performance objective is to
+*maximize throughput while keeping tail-latency below certain bounds*. Depending
+on the details and maturity of your application, you may care more about average
+latency than
+[tail-latency](https://blog.bramp.net/post/2018/01/16/measuring-percentile-latency/),
+but some notion of **latency** and **throughput** are usually the metrics
+against which you set performance objectives. Note that we do not discuss
+availability in this guide as that is more a function of the deployment
+environment.
+
+### Parameters
+
+We can roughly think about 3 groups of parameters whose configuration determines
+observed performance: 1) the TensorFlow model 2) the inference requests and 3)
+the server (hardware & binary).
+
+#### 1) The TensorFlow Model
+
+The model defines the computation that TensorFlow Serving will perform upon
+receiving each incoming request.
+
+Underneath the hood, TensorFlow Serving uses the TensorFlow runtime to do the
+actual inference on your requests. This means the **average latency** of serving
+a request with TensorFlow Serving is _usually_ at least that of doing inference
+directly with TensorFlow. This means if on a given machine, inference on a
+single example takes 2 seconds, and you have a sub-second latency target, you
+need to profile inference requests, understand what TensorFlow ops and
+sub-graphs of your model contribute most to that latency, and re-design your
+model with inference latency as a design constraint in mind.
+
+Please note, while the average latency of performing inference with TensorFlow
+Serving is usually not lower than using TensorFlow directly, where TensorFlow
+Serving shines is keeping the **tail latency** down for many clients querying
+many different models, all while efficiently utilizing the underlying hardware
+to maximize throughput.
+
+#### 2) The Inference Requests
+
+##### API Surfaces
+
+TensorFlow Serving has two API surfaces (HTTP and gRPC), both of which implement
+the
+[PredictionService API](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/apis/prediction_service.proto#L15)
+(with the exception of the HTTP Server not exposing a `MultiInference`
+endpoint). Both API surfaces are highly tuned and add minimal latency but in
+practice, the gRPC surface is observed to be slightly more performant.
+
+##### API Methods
+
+In general, it is advised to use the Classify and Regress endpoints as they
+accept
+[tf.Example](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/apis/input.proto#L77),
+which is a higher-level abstraction; however, in rare cases of large (O(Mb))
+structured requests, savvy users may find using PredictRequest and directly
+encoding their Protobuf messages into a TensorProto, and skipping the
+serialization into and deserialization from tf.Example a source of slight
+performance gain.
+
+##### Batch Size
+
+There are two primary ways batching can help your performance. You may configure
+your clients to send batched requests to TensorFlow Serving, or you may send
+individual requests and configure TensorFlow Serving to wait up to a
+predetermined period of time, and perform inference on all requests that arrive
+in that period in one batch. Configuring the latter kind of batching allows you
+to hit TensorFlow Serving at extremely high QPS, while allowing it to
+sub-linearly scale the compute resources needed to keep up. This is further
+discussed in the [configuration guide](serving_config.md#batching-configuration)
+and the
+[batching README](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/batching/README.md).
+
+### 3) The Server (Hardware & Binary)
+
+The TensorFlow Serving binary does fairly precise accounting of the hardware
+upon which it runs. As such, you should avoid running other compute- or
+memory-intensive applications on the same machine, especially ones with dynamic
+resource usage.
+
+As with many other types of workloads, TensorFlow Serving is more efficient when
+deployed on fewer, larger (more CPU and RAM) machines (i.e. a `Deployment` with
+a lower `replicas` in Kubernetes terms). This is due to a better potential for
+multi-tenant deployment to utilize the hardware and lower fixed costs (RPC
+server, TensorFlow runtime, etc.).
+
+#### Accelerators
+
+If your host has access to an accelerator, ensure you have implemented your
+model to place dense computations on the accelerator - this should be
+automatically done if you have used high-level TensorFlow APIs, but if you have
+built custom graphs, or want to pin specific parts of graphs on specific
+accelerators, you may need to manually place certain subgraphs on accelerators
+(i.e. using `with tf.device('/device:GPU:0'): ...`).
+
+#### Modern CPUs
+
+Modern CPUs have continuously extended the x86 instruction set architecture to
+improve support for [SIMD](https://en.wikipedia.org/wiki/SIMD) (Single
+Instruction Multiple Data) and other features critical for dense computations
+(eg. a multiply and addition in one clock cycle). However, in order to run on
+slightly older machines, TensorFlow and TensorFlow Serving are built with the
+modest assumption that the newest of these features are not supported by the
+host CPU.
+
+`Your CPU supports instructions that this TensorFlow binary was not compiled to
+use: AVX2 FMA`
+
+If you see this log entry (possibly different extensions than the 2 listed) at
+TensorFlow Serving start-up, it means you can rebuild TensorFlow Serving and
+target your particular host's platform and enjoy better performance. Building
+TensorFlow Serving from source is relatively easy using Docker and is documented
+[here](building_with_docker.md).
+
+#### Binary Configuration
+
+TensorFlow Serving offers a number of configuration knobs that govern its
+runtime behavior, mostly set through
+[command-line flags](https://github.com/tensorflow/serving/blob/r2.0/tensorflow_serving/model_servers/main.cc).
+Some of these (most notably `tensorflow_intra_op_parallelism` and
+`tensorflow_inter_op_parallelism`) are passed down to configure the TensorFlow
+runtime and are auto-configured, which savvy users may override by doing many
+experiments and finding the right configuration for their specific workload and
+environment.
+
+## Life of a TensorFlow Serving inference request
+
+Let's briefly go through the life of a prototypical example of a TensorFlow
+Serving inference request to see the journey that a typical request goes
+through. For our example, we will dive into a Predict Request being received by
+the 2.0.0 TensorFlow Serving gRPC API surface.
+
+Let's first look at a component-level sequence diagram, and then jump into the
+code that implements this series of interactions.
+
+### Sequence Diagram
+
+<!-- Note: sequence-diagram is not supported by GitHub's markdown engine.
+To activate internally, uncomment the following block and remove the '|'
+characters, which are precluding the dashed arrows from being interpreted
+as end-comment tokens.-->
+
+<!--
+
+<style> .rendered-sequence-diagram { max-width: 900px; overflow: auto; }
+.rendered-sequence-diagram svg { zoom: 0.70; } </style>
+
+```sequence-diagram
+participant Client as C
+participant Prediction\nService as PS
+participant TensorFlow Predictor as TP
+participant Server\nCore as SC
+participant TensorFlow\nRuntime as TF
+
+C->PS: Predict
+PS->TP: Predict
+TP->SC: GetServableHandle
+SC-|->TP: tensorflow::Session
+TP->TF: tensorflow::Session::Run
+TF-|->TP: Output Tensors from Session.Run
+TP-|->PS: PredictResponse
+PS-|->C: PredictResponse
+
+```
+
+-->
+
+![Predict Sequence Diagram](images/predict_sequence_diagram.png)
+
+Note that Client is a component owned by the user, Prediction Service, Servables
+and Server Core are owned by TensorFlow Serving and TensorFlow Runtime is owned
+by [Core TensorFlow](https://github.com/tensorflow/tensorflow).
+
+### Sequence Details
+
+1.  [`PredictionServiceImpl::Predict`](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/model_servers/prediction_service_impl.cc#L38)
+    receives the `PredictRequest`
+2.  We invoke the
+    [`TensorflowPredictor::Predict`](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_impl.cc#L146),
+    propagating the request deadline from the gRPC request (if one was set).
+3.  Inside `TensorflowPredictor::Predict`, we
+    [lookup the Servable (model)](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_impl.cc#L165)
+    the request is looking to perform inference on, from which we retrieve
+    information about the SavedModel and more importantly, a handle to the
+    `Session` object in which the model graph is (possibly partially) loaded.
+    This Servable object was created and committed in memory when the model was
+    loaded by TensorFlow Serving. We then invoke
+    [internal::RunPredict](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L181)
+    to carry out the prediction.
+4.  In `internal::RunPredict`, after validating and preprocessing the request,
+    we use the `Session` object to perform the inference using a blocking call
+    to
+    [Session::Run](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L209),
+    at which point, we enter core TensorFlow's codebase. After the
+    `Session::Run` returns and our `outputs` tensors have been populated, we
+    [convert](https://github.com/tensorflow/serving/blob/b5a11f1e5388c9985a6fc56a58c3421e5f78149f/tensorflow_serving/servables/tensorflow/predict_util.cc#L150)
+    the outputs to a `PredictionResponse` and return the result up the call
+    stack.
diff --git a/tensorflow_serving/g3doc/tensorboard.md b/tensorflow_serving/g3doc/tensorboard.md
@@ -0,0 +1,144 @@
+# Profile Inference Requests with TensorBoard
+
+After deploying TensorFlow Serving and issuing requests from your client, you
+may notice that requests take longer than you expected, or you are not achieving
+the throughput that you would have liked.
+
+In this guide, we will use TensorBoard's Profiler, which you may already use to
+[profile model training](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras),
+to trace inference requests to help us debug and improve inference performance.
+
+You should use this guide in conjunction with the best practices denoted in the
+[Performance Guide](performance.md) to optimize your model, requests and
+TensorFlow Serving instance.
+
+## Overview
+
+At a high level, we will point TensorBoard's Profiling tool at TensorFlow
+Serving's gRPC server. When we send an inference request to Tensorflow Serving,
+we will also simultaneously use the TensorBoard UI to ask it to capture the
+traces of this request. Behind the scenes, TensorBoard will talk to TensorFlow
+Serving over gRPC and ask it to provide a detailed trace of the lifetime of the
+inference request. TensorBoard will then visualize the activity of every thread
+on every compute device (running code integrated with
+[`profiler::TraceMe`](https://github.com/tensorflow/tensorflow/blob/f65b09f9aedcd33d0703cbf3d9845ea2869c0aa8/tensorflow/core/profiler/lib/traceme.h#L73))
+over the course of the lifetime of the request on the TensorBoard UI for us to
+consume.
+
+## Prerequisites
+
+*   `Tensorflow>=2.0.0`
+*   TensorBoard (should be installed if TF was installed via `pip`)
+*   Docker (which we'll use to download and run TF serving>=2.1.0 image)
+
+## Deploy model with TensorFlow Serving
+
+For this example, we will use Docker, the recommended way to deploy Tensorflow
+Serving, to host a toy model that computes `f(x) = x / 2 + 2` found in the
+[Tensorflow Serving Github repository](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu/00000123).
+
+Download the TensorFlow Serving source.
+
+```
+git clone https://github.com/tensorflow/serving /tmp/serving
+cd /tmp/serving
+```
+
+Launch TensorFlow Serving via Docker and deploy the half_plus_two model.
+
+```
+docker pull tensorflow/serving
+MODELS_DIR="$(pwd)/tensorflow_serving/servables/tensorflow/testdata"
+docker run -it --rm -p 8500:8500 -p 8501:8501 \
+-v $MODELS_DIR/saved_model_half_plus_two_cpu:/models/half_plus_two \
+-e MODEL_NAME=half_plus_two \
+tensorflow/serving
+```
+
+In another terminal, query the model to ensure model is deployed correctly
+
+```
+curl -d '{"instances": [1.0, 2.0, 5.0]}' \
+-X POST http://localhost:8501/v1/models/half_plus_two:predict
+
+# Returns => { "predictions": [2.5, 3.0, 4.5] }
+```
+
+## Set up TensorBoard's Profiler
+
+In another terminal, launch the TensorBoard tool on your machine, providing a
+directory to save the inference trace events to:
+
+```
+mkdir -p ~/logs/inference_demo
+tensorboard --logdir ~/logs/inference_demo/ --port 6006
+```
+
+Navigate to http://localhost:6006/ to view the TensorBoard UI. Use the drop-down
+menu at the top to navigate to the Profile tab. Click Capture Profile and
+provide the address of Tensorflow Serving's gRPC server.
+
+![Profiling Tool](images/tb_profile_setup_dialog.png)
+
+As soon as you press "Capture," TensorBoard will start sending profile requests
+to the model server. In the dialog above, you can set both the deadline for each
+request and the total number of times Tensorboard will retry if no trace events
+are collected. If you are profiling an expensive model, you may want to increase
+the deadline to ensure the profile request does not time out before the
+inference request completes.
+
+## Send and Profile an Inference Request
+
+Press Capture on the TensorBoard UI and send an inference request to TF Serving
+quickly thereafter.
+
+```
+curl -d '{"instances": [1.0, 2.0, 5.0]}' -X POST \
+http://localhost:8501/v1/models/half_plus_two:predict
+```
+
+You should see a "Capture profile successfully. Please refresh." toast appear at
+the bottom of the screen. This means TensorBoard was able to retrieve trace
+events from TensorFlow Serving and saved them to your `logdir`. Refresh the page
+to visualize the inference request with The Profiler's Trace Viewer, as seen in
+the next section.
+
+Note: If you see `tensorflow.python.framework.errors_impl.UnimplementedError` in
+your TensorBoard logs, it likely means you are running a Tensorflow Serving
+version older than 2.1.
+
+## Analyze the Inference Request Trace
+
+![Inference Request Trace](images/tb_profile_overview.png)
+
+You can now easily see what computation is taking place as a result of your
+inference request. You can zoom and click on any of the rectangles (trace
+events) to get more information such as exact start time and wall duration.
+
+At a high-level, we see two threads belonging to the TensorFlow runtime and a
+third one that belongs to the REST server, handling the receiving of the HTTP
+request and creating a TensorFlow Session.
+
+We can zoom in to see what happens inside the SessionRun.
+
+![Inference Request Trace Zoomed-in](images/tb_profile_zoom.png)
+
+In the second thread, we see an initial ExecutorState::Process call in which no
+TensorFlow ops run but initialization steps are executed.
+
+In the first thread, we see the call to read the first variable, and once the
+second variable is also available, executes the multiplication and add kernels
+in sequence. Finally, the Executor signals that its computation is done by
+calling the DoneCallback and the Session can be closed.
+
+## Next Steps
+
+While this is a simple example, you can use the same process to profile much
+more complex models, allowing you to identify slow ops or bottlenecks in your
+model architecture to improve its performance.
+
+Please refer to
+[TensorBoard Profiler Guide](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras#trace_viewer)
+for a more complete tutorial on features of TensorBoard's Profiler and
+[TensorFlow Serving Performance Guide](performance.md) to learn more about
+optimizing inference performance.