[Docs] Add some new info on performance tuning

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
24minFan · Feb 23, 2021 · 9e0ac10 · 9e0ac10
1 parent 2decd11
commit 9e0ac10
Showing 1 changed file with 33 additions and 14 deletions.
diff --git a/Documentation/devel/performance.rst b/Documentation/devel/performance.rst
@@ -332,13 +332,13 @@ Modern Icelake machines remove many of the hardware bottlenecks of Intel SGX. If
 you must use an older machine (Skylake, Caby Lake, Mehlow), you should be aware
 that they have severe SGX-hardware limitations. In particular:
 
-#. EPC size. You can think of EPC as a physical cache (just like L3 cache) for
-   enclave pages. On all currently available machines, EPC is only 128-256MB in
-   size. This means that if the application has a working set size of more than
-   100-200MB, enclave pages will be evicted from EPC into RAM.  Eviction of
-   enclave pages (also called EPC swapping or paging) is a very expensive
-   hardware operation. Some applications have a working set size of MBs/GBs of
-   data, so performance will be significantly impaired.
+#. :term:`EPC` size. You can think of EPC as a physical cache (just like L3
+   cache) for enclave pages. On all currently available machines, EPC is only
+   128-256MB in size. This means that if the application has a working set size
+   of more than 100-200MB, enclave pages will be evicted from EPC into RAM.
+   Eviction of enclave pages (also called EPC swapping or paging) is a very
+   expensive hardware operation. Some applications have a working set size of
+   MBs/GBs of data, so performance will be significantly impaired.
 
 #. RDTSC/RDTSCP instructions. These instructions are forbidden to execute in an
    SGX enclave on older machines. Unfortunately, many applications and runtimes
@@ -362,6 +362,13 @@ that they have severe SGX-hardware limitations. In particular:
    version that performs function calls inside Graphene instead of raw SYSCALL
    instructions and thus avoids this overhead).
 
+#. CPU topology. The CPU topology may negatively affect performance of Graphene.
+   For example, if the machine has several NUMA domains, it is important to
+   restrict Graphene runs to only one NUMA domain, e.g., via the command
+   ``numactl --cpunodebind=0 --membind=0``. Otherwise Graphene may spread
+   enclave threads and enclave memory across several NUMA domains, which will
+   lead to higher memory access latencies and overall worse performance.
+
 Other considerations
 --------------------
 
@@ -372,6 +379,17 @@ non-debug configuration (again, typically simple ``make SGX=1`` is sufficient).
 Finally, disable the debug log of Graphene by specifying the manifest option
 ``loader.log_level = "none"``.
 
+There are several manifest options that may improve performance of some
+workloads. The manifest options include:
+
+- ``libos.check_invalid_pointers = 0`` -- disable checks of invalid pointers on
+  system call invocations. Most real-world applications never provide invalid
+  arguments to system calls, so there is no need in additional checks.
+- ``sgx.preheat_enclave = 1`` -- pre-fault all enclave pages during enclave
+  initialization. This shifts the overhead of page faults on non-present enclave
+  pages from runtime to enclave startup time. Using this option makes sense only
+  if the whole enclave memory fits into :term:`EPC`.
+
 If your application periodically fails and complains about seemingly irrelevant
 things, it may be due to insufficient enclave memory. Please try to increase
 enclave size by tweaking ``sgx.enclave_size = "512M"``,
@@ -510,8 +528,8 @@ Further reading
 SGX profiling
 -------------
 
-There is some experimental support for profiling the code inside the SGX
-enclave. Here is how to use it:
+There is support for profiling the code inside the SGX enclave. Here is how to
+use it:
 
 #. Compile Graphene with ``SGX=1 DEBUG=1``.
 
@@ -527,11 +545,12 @@ enclave. Here is how to use it:
 
 #. Run ``perf report -i <data file>`` (see :ref:`perf` above).
 
-*Note*: The accuracy of this tool is unclear. The SGX profiling works by
-measuring the value of instruction pointer on each asynchronous enclave exit
-(AEX), which happen on Linux scheduler interrupts, as well as other events such
-as page faults. While we attempt to measure time (and not only count
-occurences), the results might be inaccurate.
+*Note*: The accuracy of this tool is unclear (though we had positive experiences
+using the tool so far). The SGX profiling works by measuring the value of
+instruction pointer on each asynchronous enclave exit (AEX), which happen on
+Linux scheduler interrupts, as well as other events such as page faults. While
+we attempt to measure time (and not only count occurences), the results might be
+inaccurate.
 
 .. _sgx-profile-ocall: