Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-95913: Edit Faster CPython section in 3.11 WhatsNew #98429

Merged
merged 8 commits into from
Mar 7, 2023
186 changes: 109 additions & 77 deletions Doc/whatsnew/3.11.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1162,14 +1162,17 @@ Optimizations
Faster CPython
==============

CPython 3.11 is on average `25% faster <https://github.com/faster-cpython/ideas#published-results>`_
than CPython 3.10 when measured with the
CPython 3.11 is an average of
`25% faster <https://github.com/faster-cpython/ideas#published-results>`_
than CPython 3.10 as measured with the
`pyperformance <https://github.com/python/pyperformance>`_ benchmark suite,
and compiled with GCC on Ubuntu Linux. Depending on your workload, the speedup
could be up to 10-60% faster.
when compiled with GCC on Ubuntu Linux.
Depending on your workload, the overall speedup could be 10-60%.

This project focuses on two major areas in Python: faster startup and faster
runtime. Other optimizations not under this project are listed in `Optimizations`_.
This project focuses on two major areas in Python:
:ref:`whatsnew311-faster-startup` and :ref:`whatsnew311-faster-runtime`.
Optimizations not covered by this project are listed separately under
:ref:`whatsnew311-optimizations`.


.. _whatsnew311-faster-startup:
Expand All @@ -1182,8 +1185,8 @@ Faster Startup
Frozen imports / Static code objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python caches bytecode in the :ref:`__pycache__<tut-pycache>` directory to
speed up module loading.
Python caches :term:`bytecode` in the :ref:`__pycache__ <tut-pycache>`
directory to speed up module loading.

Previously in 3.10, Python module execution looked like this:

Expand All @@ -1192,8 +1195,9 @@ Previously in 3.10, Python module execution looked like this:
Read __pycache__ -> Unmarshal -> Heap allocated code object -> Evaluate

In Python 3.11, the core modules essential for Python startup are "frozen".
This means that their code objects (and bytecode) are statically allocated
by the interpreter. This reduces the steps in module execution process to this:
This means that their :ref:`codeobjects` (and bytecode)
are statically allocated by the interpreter.
This reduces the steps in module execution process to:

.. code-block:: text

Expand All @@ -1202,7 +1206,7 @@ by the interpreter. This reduces the steps in module execution process to this:
Interpreter startup is now 10-15% faster in Python 3.11. This has a big
impact for short-running programs using Python.

(Contributed by Eric Snow, Guido van Rossum and Kumar Aditya in numerous issues.)
(Contributed by Eric Snow, Guido van Rossum and Kumar Aditya in many issues.)


.. _whatsnew311-faster-runtime:
Expand All @@ -1215,17 +1219,19 @@ Faster Runtime
Cheaper, lazy Python frames
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python frames are created whenever Python calls a Python function. This frame
holds execution information. The following are new frame optimizations:
Python frames, holding execution information,
are created whenever Python calls a Python function.
The following are new frame optimizations:

- Streamlined the frame creation process.
- Avoided memory allocation by generously re-using frame space on the C stack.
- Streamlined the internal frame struct to contain only essential information.
Frames previously held extra debugging and memory management information.

Old-style frame objects are now created only when requested by debuggers or
by Python introspection functions such as ``sys._getframe`` or
``inspect.currentframe``. For most user code, no frame objects are
Old-style :ref:`frame objects <frame-objects>`
are now created only when requested by debuggers
or by Python introspection functions such as :func:`sys._getframe` and
:func:`inspect.currentframe`. For most user code, no frame objects are
created at all. As a result, nearly all Python functions calls have sped
up significantly. We measured a 3-7% speedup in pyperformance.

Expand All @@ -1246,10 +1252,11 @@ In 3.11, when CPython detects Python code calling another Python function,
it sets up a new frame, and "jumps" to the new code inside the new frame. This
avoids calling the C interpreting function altogether.

Most Python function calls now consume no C stack space. This speeds up
most of such calls. In simple recursive functions like fibonacci or
factorial, a 1.7x speedup was observed. This also means recursive functions
can recurse significantly deeper (if the user increases the recursion limit).
Most Python function calls now consume no C stack space, speeding them up.
In simple recursive functions like fibonacci or
factorial, we observed a 1.7x speedup. This also means recursive functions
can recurse significantly deeper
(if the user increases the recursion limit with :func:`sys.setrecursionlimit`).
We measured a 1-3% improvement in pyperformance.

(Contributed by Pablo Galindo and Mark Shannon in :issue:`45256`.)
Expand All @@ -1260,7 +1267,7 @@ We measured a 1-3% improvement in pyperformance.
PEP 659: Specializing Adaptive Interpreter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:pep:`659` is one of the key parts of the faster CPython project. The general
:pep:`659` is one of the key parts of the Faster CPython project. The general
idea is that while Python is a dynamic language, most code has regions where
objects and types rarely change. This concept is known as *type stability*.

Expand All @@ -1269,17 +1276,18 @@ in the executing code. Python will then replace the current operation with a
more specialized one. This specialized operation uses fast paths available only
to those use cases/types, which generally outperform their generic
counterparts. This also brings in another concept called *inline caching*, where
Python caches the results of expensive operations directly in the bytecode.
Python caches the results of expensive operations directly in the
:term:`bytecode`.

The specializer will also combine certain common instruction pairs into one
superinstruction. This reduces the overhead during execution.
superinstruction, reducing the overhead during execution.

Python will only specialize
when it sees code that is "hot" (executed multiple times). This prevents Python
from wasting time for run-once code. Python can also de-specialize when code is
from wasting time on run-once code. Python can also de-specialize when code is
too dynamic or when the use changes. Specialization is attempted periodically,
and specialization attempts are not too expensive. This allows specialization
to adapt to new circumstances.
and specialization attempts are not too expensive,
allowing specialization to adapt to new circumstances.

(PEP written by Mark Shannon, with ideas inspired by Stefan Brunthaler.
See :pep:`659` for more information. Implementation by Mark Shannon and Brandt
Expand All @@ -1292,32 +1300,32 @@ Bucher, with additional help from Irit Katriel and Dennis Sweeney.)
| Operation | Form | Specialization | Operation speedup | Contributor(s) |
| | | | (up to) | |
+===============+====================+=======================================================+===================+===================+
| Binary | ``x+x; x*x; x-x;`` | Binary add, multiply and subtract for common types | 10% | Mark Shannon, |
| operations | | such as ``int``, ``float``, and ``str`` take custom | | Dong-hee Na, |
| | | fast paths for their underlying types. | | Brandt Bucher, |
| Binary | ``x + x`` | Binary add, multiply and subtract for common types | 10% | Mark Shannon, |
| operations | | such as :class:`int`, :class:`float` and :class:`str` | | Dong-hee Na, |
| | ``x - x`` | take custom fast paths for their underlying types. | | Brandt Bucher, |
| | | | | Dennis Sweeney |
| | ``x * x`` | | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Subscript | ``a[i]`` | Subscripting container types such as ``list``, | 10-25% | Irit Katriel, |
| | | ``tuple`` and ``dict`` directly index the underlying | | Mark Shannon |
| | | data structures. | | |
| Subscript | ``a[i]`` | Subscripting container types such as :class:`list`, | 10-25% | Irit Katriel, |
| | | :class:`tuple` and :class:`dict` directly index | | Mark Shannon |
| | | the underlying data structures. | | |
| | | | | |
| | | Subscripting custom ``__getitem__`` | | |
| | | Subscripting custom :meth:`~object.__getitem__` | | |
| | | is also inlined similar to :ref:`inline-calls`. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Store | ``a[i] = z`` | Similar to subscripting specialization above. | 10-25% | Dennis Sweeney |
| subscript | | | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Calls | ``f(arg)`` | Calls to common builtin (C) functions and types such | 20% | Mark Shannon, |
| | ``C(arg)`` | as ``len`` and ``str`` directly call their underlying | | Ken Jin |
| | | C version. This avoids going through the internal | | |
| | | calling convention. | | |
| | | | | |
| | | as :func:`len` and :class:`str` directly call their | | Ken Jin |
| | ``C(arg)`` | underlying C version. This avoids going through the | | |
| | | internal calling convention. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Load | ``print`` | The object's index in the globals/builtins namespace | [1]_ | Mark Shannon |
| global | ``len`` | is cached. Loading globals and builtins require | | |
| variable | | zero namespace lookups. | | |
| Load | ``print`` | The object's index in the globals/builtins namespace | [#load-global]_ | Mark Shannon |
| global | | is cached. Loading globals and builtins require | | |
| variable | ``len`` | zero namespace lookups. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Load | ``o.attr`` | Similar to loading global variables. The attribute's | [2]_ | Mark Shannon |
| Load | ``o.attr`` | Similar to loading global variables. The attribute's | [#load-attr]_ | Mark Shannon |
| attribute | | index inside the class/object's namespace is cached. | | |
| | | In most cases, attribute loading will require zero | | |
| | | namespace lookups. | | |
Expand All @@ -1329,14 +1337,15 @@ Bucher, with additional help from Irit Katriel and Dennis Sweeney.)
| Store | ``o.attr = z`` | Similar to load attribute optimization. | 2% | Mark Shannon |
| attribute | | | in pyperformance | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Unpack | ``*seq`` | Specialized for common containers such as ``list`` | 8% | Brandt Bucher |
| Sequence | | and ``tuple``. Avoids internal calling convention. | | |
| Unpack | ``*seq`` | Specialized for common containers such as | 8% | Brandt Bucher |
| Sequence | | :class:`list` and :class:`tuple`. | | |
| | | Avoids internal calling convention. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+

.. [1] A similar optimization already existed since Python 3.8. 3.11
specializes for more forms and reduces some overhead.
.. [#load-global] A similar optimization already existed since Python 3.8.
3.11 specializes for more forms and reduces some overhead.

.. [2] A similar optimization already existed since Python 3.10.
.. [#load-attr] A similar optimization already existed since Python 3.10.
3.11 specializes for more forms. Furthermore, all attribute loads should
be sped up by :issue:`45947`.

Expand All @@ -1346,49 +1355,72 @@ Bucher, with additional help from Irit Katriel and Dennis Sweeney.)
Misc
----

* Objects now require less memory due to lazily created object namespaces. Their
namespace dictionaries now also share keys more freely.
* Objects now require less memory due to lazily created object namespaces.
Their namespace dictionaries now also share keys more freely.
(Contributed Mark Shannon in :issue:`45340` and :issue:`40116`.)

* "Zero-cost" exceptions are implemented, eliminating the cost
of :keyword:`try` statements when no exception is raised.
(Contributed by Mark Shannon in :issue:`40222`.)

* A more concise representation of exceptions in the interpreter reduced the
time required for catching an exception by about 10%.
(Contributed by Irit Katriel in :issue:`45711`.)

* :mod:`re`'s regular expression matching engine has been partially refactored,
and now uses computed gotos (or "threaded code") on supported platforms. As a
result, Python 3.11 executes the `pyperformance regular expression benchmarks
<https://pyperformance.readthedocs.io/benchmarks.html#regex-dna>`_ up to 10%
faster than Python 3.10.
(Contributed by Brandt Bucher in :gh:`91404`.)


.. _whatsnew311-faster-cpython-faq:

FAQ
---

| Q: How should I write my code to utilize these speedups?
|
| A: You don't have to change your code. Write Pythonic code that follows common
best practices. The Faster CPython project optimizes for common code
patterns we observe.
|
|
| Q: Will CPython 3.11 use more memory?
|
| A: Maybe not. We don't expect memory use to exceed 20% more than 3.10.
This is offset by memory optimizations for frame objects and object
dictionaries as mentioned above.
|
|
| Q: I don't see any speedups in my workload. Why?
|
| A: Certain code won't have noticeable benefits. If your code spends most of
its time on I/O operations, or already does most of its
computation in a C extension library like numpy, there won't be significant
speedup. This project currently benefits pure-Python workloads the most.
|
| Furthermore, the pyperformance figures are a geometric mean. Even within the
pyperformance benchmarks, certain benchmarks have slowed down slightly, while
others have sped up by nearly 2x!
|
|
| Q: Is there a JIT compiler?
|
| A: No. We're still exploring other optimizations.
.. _faster-cpython-faq-my-code:

How should I write my code to utilize these speedups?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Write Pythonic code that follows common best practices;
you don't have to change your code.
The Faster CPython project optimizes for common code patterns we observe.


.. _faster-cpython-faq-memory:

Will CPython 3.11 use more memory?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Maybe not; we don't expect memory use to exceed 20% higher than 3.10.
This is offset by memory optimizations for frame objects and object
dictionaries as mentioned above.


.. _faster-cpython-ymmv:

I don't see any speedups in my workload. Why?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Certain code won't have noticeable benefits. If your code spends most of
its time on I/O operations, or already does most of its
computation in a C extension library like NumPy, there won't be significant
speedups. This project currently benefits pure-Python workloads the most.

Furthermore, the pyperformance figures are a geometric mean. Even within the
pyperformance benchmarks, certain benchmarks have slowed down slightly, while
others have sped up by nearly 2x!


.. _faster-cpython-jit:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, these are so that the hyperlink generates nicer links right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are ref target labels, which allow directly and stably linking and cross-referencing of this FAQ answer using the :ref: role (:ref:`faster-cypthon-jit`), both within this document (as with implicit hyperlinks, but much more robustly), but also from anywhere else within the CPython documentation, and also from any other docs that links to this one via Intersphinx (as many do). This reference will continue to work even if this section is later moved, renamed, combined with another section, split into multiple, etc.

Furthermore, it means external links to this fragment id (i.e. using #faster-cpython-jit) will continue to work if the section is renamed on the same page, and it also allows automatically setting up external-link redirects to different pages via some tooling I'm working on. Finally, if it were to break for whatever reason, it will generate an optional warning rather than simply breaking silently.


Is there a JIT compiler?
^^^^^^^^^^^^^^^^^^^^^^^^

No. We're still exploring other optimizations.


.. _whatsnew311-faster-cpython-about:
Expand Down