sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

nedbat · 2024-12-14T21:00:26Z

Bug report

Bug description:

Using a trace function with sys.settrace, performance suffers when importing a large dictionary literal. The performance exhibits quadratic time complexity on 3.12 and up. It is fine on 3.11.

This originated with a coverage.py bug report: nedbat/coveragepy#1906

To reproduce, create make_mapping.py:

"""
Create a really large dict to import.
"""

import random

print("MAPPING = {")
for n in range(25_000):
    print(f"    '{n:08d}': {n},")
print("}")

and trace_it.py:

import linecache, sys, time

t0 = tlast = time.monotonic_ns()

def trace(frame, event, arg):
    global tlast
    if "mapping.py" in frame.f_code.co_filename:
        lineno = frame.f_lineno
        now = time.monotonic_ns()
        print("{:12d} ({}): {} {:05d}".format(
            now - t0, now - tlast,
            event[:4], lineno,
        ))
        tlast = now
    return trace

print(sys.version)
sys.settrace(trace)

import mapping

Then, create the large mapping, and run trace_it.py on multiple versions, and look at the timings:

python3 make_mapping.py > mapping.py
python3.11 trace_it.py > 11.out
python3.12 trace_it.py > 12.out
python3.13 trace_it.py > 13.out
python3.14 trace_it.py > 14.out
grep 'line \d[05]000' *.out

These are the results I see. The parenthetical numbers are nanoseconds since the last output line:

11.out:    76131833 (1917): line 05000
11.out:    86116875 (917): line 10000
11.out:    96115250 (917): line 15000
11.out:   106099958 (958): line 20000
11.out:   116161416 (875): line 25000

12.out:   803728958 (148833): line 05000
12.out:  3009835500 (293042): line 10000
12.out:  6750690750 (444833): line 15000
12.out: 11966781625 (596292): line 20000
12.out: 18720473667 (739000): line 25000

13.out:   443250333 (75125): line 05000
13.out:  1565451416 (147583): line 10000
13.out:  3447408625 (220625): line 15000
13.out:  6130466958 (294458): line 20000
13.out:  9455595416 (370500): line 25000

14.out:   445441625 (89291): line 05000
14.out:  1556189584 (152250): line 10000
14.out:  3410512125 (220625): line 15000
14.out:  6007919334 (366584): line 20000
14.out:  9369414792 (375083): line 25000

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

Linked PRs

The text was updated successfully, but these errors were encountered:

nedbat · 2024-12-15T21:59:14Z

I bisected CPython and it is perhaps no surprise that the problem started with:

commit 411b1692811b2ecac59cb0df0f920861c7cf179a
Author: Mark Shannon <[email protected]>
Date:   Wed Apr 12 12:04:55 2023 +0100

    GH-103082: Implementation of PEP 669: Low Impact Monitoring for CPython (GH-103083)

/cc @markshannon

gaogaotiantian · 2024-12-15T23:43:18Z

This is a known issue, which I tried to alleviate in #118127. The problem is getting the line number - with the new mechanism, if a function(module) is super long, calculating the line number would be difficult. We can solve this for good, but we need extra run-time memory (plenty of those) to remember the line number. Considering how rare this case is, it might not be worth it to make the trade off.

nedbat · 2024-12-15T23:48:56Z

Considering how rare this case is, it might not be worth it to make the trade off.

Do you have a suggestion for the person who cannot move up from 3.11?

gaogaotiantian · 2024-12-16T00:05:29Z

Is ruling out that specific file a possible solution? If coverage.py does not access f_lineno for events on that specific file, the speed should be normal. If it's a file that contains a giant mapping, coverage for that specific file should not be needed I assume.

I know this is not a perfect solution, maybe there is a way to solve this in a better way. I did not manage to do it without too much sacrifice on memory but that might just because of my limited knowledge of the topic. Anyone should feel free to try it out - I think I explained the reason of this slow down in the PR mentioned above.

nedbat · 2024-12-16T00:28:03Z

Dup of #107674 (which I also wrote, oops!)

nedbat · 2024-12-16T10:53:51Z

(Continuing here for a moment...) If I change the trace_it.py code to return None, then 11.out and 12.out both are very short:

3.11.11 (main, Dec  3 2024, 19:47:16) [Clang 16.0.0 (clang-1600.0.26.4)]
    91913042 (91913042): call 00000

and

3.12.8 (main, Dec  3 2024, 19:49:11) [Clang 16.0.0 (clang-1600.0.26.4)]
    81839750 (81839750): call 00000

but 3.12 still takes a long time to complete. Are we calculating the line number even when we don't need to call the trace function? This is a simulation of omitting the large file from coverage.py, which doesn't solve the problem.

nedbat · 2024-12-16T11:27:51Z

BTW, the same is true of tracing very long functions: even returning None from the trace function, 3.12 takes much longer than 3.11. The effect isn't as pronounced as with a dict literal.

gaogaotiantian · 2024-12-16T21:39:27Z

That's because the optimization is not backported (it's not a bug fix unfortunately). It should be different on 3.13 - and we can't really do anything about it.

gaogaotiantian · 2024-12-16T21:41:20Z

And yes, as I mentioned in the comment above, it's not about the literal, it's about line number. Any function/module-level code that's really long will trigger the issue. Calculating the line number of the current byte code is an O(N) operation after 3.12.

nedbat · 2024-12-16T22:34:44Z

Thanks, 3.13 does go quickly when not tracing. Can we talk more about ways to prevent the problem when tracing? Can you point me to the code so I can try to understand the underlying issue? Would increased memory use be a problem? Wouldn't it be a small fraction of the code's memory footprint?

gaogaotiantian · 2024-12-17T01:26:39Z

The fundamental thing of this issue is, you need a mapping from an instruction to a line number. If you take a look at PyCode_Addr2Line, that's how we get the line number of a specific instruction based on its offset, and that's O(N). So, without any optimization, getting the line number of an instruction would be O(N), unless you propose to store a huge mapping from address to line number for each bytecode.

The original optimization, is to calculate it once (we need to do it anyway for instrumentation), and store the data somewhere. First of all, we don't want to impact the size of memory when there's no instrumentation, so it does not live with CodeObject directly. For each instrumented line of code, we have a _PyCoLineInstrumentationData which stores the original opcode of INSTRUMENTED_LINE and an offset to co_firstlineno of the code object, both are uint8. A heuristic algorithm is used to do a linear estimation then add this offset, so in most cases (where a function is less than a few hundred lines), we can get the line number in O(1).

However, this estimation would be off when there are too many lines in a code object, where we would fall back to the O(N) calculation.

One way to sovle this, is to increase the uint8 of offset to uint16 so it can cover tens of thousands of lines of code. However, that would make _PyCoLineInstrumentationData 50% larger (not considering padding, maybe it's more as it would be 3 bytes of data).

In the best case, it would be 1 byte per line with instrumentation, which probably results in kb level run-time memory cost. I don't think that's unacceptable, but the benefit is also insignificant - it helps with the tracing speed when the function is more than a few hundred lines, which is rare (and not recommended). Probably the only real life scenario, is the coverage tool for a constant mapping. Notice that debugger will not trigger the issue after the optimization I made (because it does not access line number for every line event).

So like I mentioned above, I just don't know if this is worth the effort. And of course, if we decide to fix it, the fix would be trivial. It's more about whether we should do it.

I can add @markshannon to the discussion about the issue.

devdanzin · 2024-12-17T13:40:08Z

So, without any optimization, getting the line number of an instruction would be O(N), unless you propose to store a huge mapping from address to line number for each bytecode.

Playing with ways to map instructions to line numbers, I figured that it seemed cheaper (and faster to build) to store a tuple of size N-instructions where the value at each index is the line number. So retrieval should be O(1) and probably more compact than a mapping, right?

# Instruction -> line number:
instr_to_line = {1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 4}

# (line_number, line_number, ...):
as_tuple = (1, 1, 2, 2, 2, 3, 4)

Edit: I found a gist where I recorded this idea: https://gist.github.com/devdanzin/591394b798b38a332ec3ecdef0f6f082

gaogaotiantian · 2024-12-17T23:18:21Z

This actually costs more than what I mentioned above (by mapping I did not mean a hashmap, simply a math-level mapping, a one-on-one relation).

The method I mentioned above costs 1 byte per line, the tuple solution takes a Python integer per bytecode, which is probably 100x larger on average :)

markshannon · 2024-12-30T09:46:58Z

This is still an issue.
I don't think it is a duplicate of #107674 as that issue appears to be fixed, but this isn't.

markshannon · 2024-12-30T10:01:21Z

Possible solutions:

Increase the size of the line number table for sys.monitoring to 2 bytes. Will work up to 64k lines, but will still exhibit the extreme slowdown for files larger than 64k lines (pytest can generate very large files)
Add markers at intervals to the locations table. Will allow ~O(log n) lookup of line numbers at a ~10% increase in the size of the locations table for all code objects.
Vary the size of the line number table for sys.monitoring, using as many bytes as necessary. This uses no extra memory in most cases, but fixes the slowdown.

Of the above, 1 is this simplest but still has the same fundamental flaw. 2 is possibly the most complex, but will speed up all line number lookups, not just those for sys.monitoring. 3. is of intermediate complexity and solves the problem

I think the deciding factor in favor of 3, is that solution 2 cannot be backported.
Once we've implemented and backported 3, we could still implement 2 if we wanted.

… code object (GH-128350)

…of the code object (pythonGH-128350)

nedbat added the type-bug An unexpected behavior, bug, or error label Dec 14, 2024

nedbat mentioned this issue Dec 14, 2024

Coverage on python 3.12 nedbat/coveragepy#1906

Closed

picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Dec 14, 2024

nedbat closed this as completed Dec 16, 2024

markshannon reopened this Dec 30, 2024

markshannon mentioned this issue Dec 30, 2024

GH-127953: Make line number lookup O(1) regardless of the size of the code object #128350

Merged

markshannon added a commit that referenced this issue Jan 21, 2025

GH-127953: Make line number lookup O(1) regardless of the size of the…

7239da7

… code object (GH-128350)

bedevere-app bot mentioned this issue Jan 21, 2025

[3.13] GH-127953: Make line number lookup O(1) regardless of the size of the code object #129127

Open

markshannon added a commit to faster-cpython/cpython that referenced this issue Jan 21, 2025

pythonGH-127953: Make line number lookup O(1) regardless of the size …

c41a1f9

…of the code object (pythonGH-128350)

junkmd mentioned this issue Feb 15, 2025

Add support for coverage enthought/comtypes#775

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

nedbat commented Dec 14, 2024 •

edited by bedevere-app bot

Loading

nedbat commented Dec 15, 2024

gaogaotiantian commented Dec 15, 2024

nedbat commented Dec 15, 2024

gaogaotiantian commented Dec 16, 2024

nedbat commented Dec 16, 2024

nedbat commented Dec 16, 2024

nedbat commented Dec 16, 2024

gaogaotiantian commented Dec 16, 2024

gaogaotiantian commented Dec 16, 2024 •

edited

Loading

nedbat commented Dec 16, 2024

gaogaotiantian commented Dec 17, 2024

devdanzin commented Dec 17, 2024 •

edited

Loading

gaogaotiantian commented Dec 17, 2024

markshannon commented Dec 30, 2024

markshannon commented Dec 30, 2024

sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

Comments

nedbat commented Dec 14, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

nedbat commented Dec 15, 2024

gaogaotiantian commented Dec 15, 2024

nedbat commented Dec 15, 2024

gaogaotiantian commented Dec 16, 2024

nedbat commented Dec 16, 2024

nedbat commented Dec 16, 2024

nedbat commented Dec 16, 2024

gaogaotiantian commented Dec 16, 2024

gaogaotiantian commented Dec 16, 2024 • edited Loading

nedbat commented Dec 16, 2024

gaogaotiantian commented Dec 17, 2024

devdanzin commented Dec 17, 2024 • edited Loading

gaogaotiantian commented Dec 17, 2024

markshannon commented Dec 30, 2024

markshannon commented Dec 30, 2024

nedbat commented Dec 14, 2024 •

edited by bedevere-app bot

Loading

gaogaotiantian commented Dec 16, 2024 •

edited

Loading

devdanzin commented Dec 17, 2024 •

edited

Loading