Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys.settrace suffers quadratic behavior for large dictionary literals on 3.12+ #127953

Open
nedbat opened this issue Dec 14, 2024 · 15 comments
Open
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@nedbat
Copy link
Member

nedbat commented Dec 14, 2024

Bug report

Bug description:

Using a trace function with sys.settrace, performance suffers when importing a large dictionary literal. The performance exhibits quadratic time complexity on 3.12 and up. It is fine on 3.11.

This originated with a coverage.py bug report: nedbat/coveragepy#1906

To reproduce, create make_mapping.py:

"""
Create a really large dict to import.
"""

import random

print("MAPPING = {")
for n in range(25_000):
    print(f"    '{n:08d}': {n},")
print("}")

and trace_it.py:

import linecache, sys, time

t0 = tlast = time.monotonic_ns()

def trace(frame, event, arg):
    global tlast
    if "mapping.py" in frame.f_code.co_filename:
        lineno = frame.f_lineno
        now = time.monotonic_ns()
        print("{:12d} ({}): {} {:05d}".format(
            now - t0, now - tlast,
            event[:4], lineno,
        ))
        tlast = now
    return trace

print(sys.version)
sys.settrace(trace)

import mapping

Then, create the large mapping, and run trace_it.py on multiple versions, and look at the timings:

python3 make_mapping.py > mapping.py
python3.11 trace_it.py > 11.out
python3.12 trace_it.py > 12.out
python3.13 trace_it.py > 13.out
python3.14 trace_it.py > 14.out
grep 'line \d[05]000' *.out

These are the results I see. The parenthetical numbers are nanoseconds since the last output line:

11.out:    76131833 (1917): line 05000
11.out:    86116875 (917): line 10000
11.out:    96115250 (917): line 15000
11.out:   106099958 (958): line 20000
11.out:   116161416 (875): line 25000

12.out:   803728958 (148833): line 05000
12.out:  3009835500 (293042): line 10000
12.out:  6750690750 (444833): line 15000
12.out: 11966781625 (596292): line 20000
12.out: 18720473667 (739000): line 25000

13.out:   443250333 (75125): line 05000
13.out:  1565451416 (147583): line 10000
13.out:  3447408625 (220625): line 15000
13.out:  6130466958 (294458): line 20000
13.out:  9455595416 (370500): line 25000

14.out:   445441625 (89291): line 05000
14.out:  1556189584 (152250): line 10000
14.out:  3410512125 (220625): line 15000
14.out:  6007919334 (366584): line 20000
14.out:  9369414792 (375083): line 25000

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

Linked PRs

@nedbat nedbat added the type-bug An unexpected behavior, bug, or error label Dec 14, 2024
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Dec 14, 2024
@nedbat
Copy link
Member Author

nedbat commented Dec 15, 2024

I bisected CPython and it is perhaps no surprise that the problem started with:

commit 411b1692811b2ecac59cb0df0f920861c7cf179a
Author: Mark Shannon <[email protected]>
Date:   Wed Apr 12 12:04:55 2023 +0100

    GH-103082: Implementation of PEP 669: Low Impact Monitoring for CPython (GH-103083)

/cc @markshannon

@gaogaotiantian
Copy link
Member

This is a known issue, which I tried to alleviate in #118127. The problem is getting the line number - with the new mechanism, if a function(module) is super long, calculating the line number would be difficult. We can solve this for good, but we need extra run-time memory (plenty of those) to remember the line number. Considering how rare this case is, it might not be worth it to make the trade off.

@nedbat
Copy link
Member Author

nedbat commented Dec 15, 2024

Considering how rare this case is, it might not be worth it to make the trade off.

Do you have a suggestion for the person who cannot move up from 3.11?

@gaogaotiantian
Copy link
Member

Is ruling out that specific file a possible solution? If coverage.py does not access f_lineno for events on that specific file, the speed should be normal. If it's a file that contains a giant mapping, coverage for that specific file should not be needed I assume.

I know this is not a perfect solution, maybe there is a way to solve this in a better way. I did not manage to do it without too much sacrifice on memory but that might just because of my limited knowledge of the topic. Anyone should feel free to try it out - I think I explained the reason of this slow down in the PR mentioned above.

@nedbat
Copy link
Member Author

nedbat commented Dec 16, 2024

Dup of #107674 (which I also wrote, oops!)

@nedbat nedbat closed this as completed Dec 16, 2024
@nedbat
Copy link
Member Author

nedbat commented Dec 16, 2024

(Continuing here for a moment...) If I change the trace_it.py code to return None, then 11.out and 12.out both are very short:

3.11.11 (main, Dec  3 2024, 19:47:16) [Clang 16.0.0 (clang-1600.0.26.4)]
    91913042 (91913042): call 00000

and

3.12.8 (main, Dec  3 2024, 19:49:11) [Clang 16.0.0 (clang-1600.0.26.4)]
    81839750 (81839750): call 00000

but 3.12 still takes a long time to complete. Are we calculating the line number even when we don't need to call the trace function? This is a simulation of omitting the large file from coverage.py, which doesn't solve the problem.

@nedbat
Copy link
Member Author

nedbat commented Dec 16, 2024

BTW, the same is true of tracing very long functions: even returning None from the trace function, 3.12 takes much longer than 3.11. The effect isn't as pronounced as with a dict literal.

@gaogaotiantian
Copy link
Member

That's because the optimization is not backported (it's not a bug fix unfortunately). It should be different on 3.13 - and we can't really do anything about it.

@gaogaotiantian
Copy link
Member

gaogaotiantian commented Dec 16, 2024

And yes, as I mentioned in the comment above, it's not about the literal, it's about line number. Any function/module-level code that's really long will trigger the issue. Calculating the line number of the current byte code is an O(N) operation after 3.12.

@nedbat
Copy link
Member Author

nedbat commented Dec 16, 2024

Thanks, 3.13 does go quickly when not tracing. Can we talk more about ways to prevent the problem when tracing? Can you point me to the code so I can try to understand the underlying issue? Would increased memory use be a problem? Wouldn't it be a small fraction of the code's memory footprint?

@gaogaotiantian
Copy link
Member

The fundamental thing of this issue is, you need a mapping from an instruction to a line number. If you take a look at PyCode_Addr2Line, that's how we get the line number of a specific instruction based on its offset, and that's O(N). So, without any optimization, getting the line number of an instruction would be O(N), unless you propose to store a huge mapping from address to line number for each bytecode.

The original optimization, is to calculate it once (we need to do it anyway for instrumentation), and store the data somewhere. First of all, we don't want to impact the size of memory when there's no instrumentation, so it does not live with CodeObject directly. For each instrumented line of code, we have a _PyCoLineInstrumentationData which stores the original opcode of INSTRUMENTED_LINE and an offset to co_firstlineno of the code object, both are uint8. A heuristic algorithm is used to do a linear estimation then add this offset, so in most cases (where a function is less than a few hundred lines), we can get the line number in O(1).

However, this estimation would be off when there are too many lines in a code object, where we would fall back to the O(N) calculation.

One way to sovle this, is to increase the uint8 of offset to uint16 so it can cover tens of thousands of lines of code. However, that would make _PyCoLineInstrumentationData 50% larger (not considering padding, maybe it's more as it would be 3 bytes of data).

In the best case, it would be 1 byte per line with instrumentation, which probably results in kb level run-time memory cost. I don't think that's unacceptable, but the benefit is also insignificant - it helps with the tracing speed when the function is more than a few hundred lines, which is rare (and not recommended). Probably the only real life scenario, is the coverage tool for a constant mapping. Notice that debugger will not trigger the issue after the optimization I made (because it does not access line number for every line event).

So like I mentioned above, I just don't know if this is worth the effort. And of course, if we decide to fix it, the fix would be trivial. It's more about whether we should do it.

I can add @markshannon to the discussion about the issue.

@devdanzin
Copy link
Contributor

devdanzin commented Dec 17, 2024

So, without any optimization, getting the line number of an instruction would be O(N), unless you propose to store a huge mapping from address to line number for each bytecode.

Playing with ways to map instructions to line numbers, I figured that it seemed cheaper (and faster to build) to store a tuple of size N-instructions where the value at each index is the line number. So retrieval should be O(1) and probably more compact than a mapping, right?

# Instruction -> line number:
instr_to_line = {1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 4}

# (line_number, line_number, ...):
as_tuple = (1, 1, 2, 2, 2, 3, 4)

Edit: I found a gist where I recorded this idea: https://gist.github.com/devdanzin/591394b798b38a332ec3ecdef0f6f082

@gaogaotiantian
Copy link
Member

This actually costs more than what I mentioned above (by mapping I did not mean a hashmap, simply a math-level mapping, a one-on-one relation).

The method I mentioned above costs 1 byte per line, the tuple solution takes a Python integer per bytecode, which is probably 100x larger on average :)

@markshannon
Copy link
Member

This is still an issue.
I don't think it is a duplicate of #107674 as that issue appears to be fixed, but this isn't.

@markshannon markshannon reopened this Dec 30, 2024
@markshannon
Copy link
Member

Possible solutions:

  1. Increase the size of the line number table for sys.monitoring to 2 bytes. Will work up to 64k lines, but will still exhibit the extreme slowdown for files larger than 64k lines (pytest can generate very large files)
  2. Add markers at intervals to the locations table. Will allow ~O(log n) lookup of line numbers at a ~10% increase in the size of the locations table for all code objects.
  3. Vary the size of the line number table for sys.monitoring, using as many bytes as necessary. This uses no extra memory in most cases, but fixes the slowdown.

Of the above, 1 is this simplest but still has the same fundamental flaw. 2 is possibly the most complex, but will speed up all line number lookups, not just those for sys.monitoring. 3. is of intermediate complexity and solves the problem

I think the deciding factor in favor of 3, is that solution 2 cannot be backported.
Once we've implemented and backported 3, we could still implement 2 if we wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants