-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More compact range iterator #89189
Comments
The proposed PR provides more compact implementation of the range iterator. It consumes less memory and produces smaller pickles. It is presumably faster because it performs simpler arithmetic operations on iteration (no multiplications). |
Currently the range iterator contains four integers: index, start, step and len. On iteration index is increased until achieve len, the result is calculated as start+index*step. In the proposed PR the range iterator contains only three integers: index was removed. Instead len counts the number of iterations that left, and start is increased by step on every iteration. Less memory, simpler calculation. Pickle no longer contains index, but __setstate__ is kept for compatibility. The only incompatible change is that calling __setstate__ repeatedly will have different effect. Currently __setstate__ with the same values does not have effect, with this PR it will advance iterator. But Python only calls __setstate__ once for just created object when unpickle or copy. |
Is it worth removing the len field as well and lazily using get_len_of_range() as needed? Then the hot function can look something like: static PyObject *
rangeiter_next(rangeiterobject *r)
{
long result = r->start
if (result < r->stop) {
r->start += r->step;
return PyLong_FromLong(result);
}
return NULL;
} |
step can be negative. So the condition should be more complex: ((r->stop > 0) ? (result < r->stop) : (result > r->stop)). And it would look much more complex for longrangeiterobject. |
Microbenchmarks show some speed up: Iterating large range: $ ./python -m timeit -s 'r = range(0, 10**20, 3**35)' 'list(r)'
Before: 2000 loops, best of 5: 199 usec per loop
After: 2000 loops, best of 5: 113 usec per loop Unpickling: $ ./python -m timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i)) for i in range(1000)])' 'loads(p)'
Before: 500 loops, best of 5: 476 usec per loop
After: 500 loops, best of 5: 363 usec per loop I did not observe any difference in iterating small ranges and pickling. Smaller size in memory, smaller pickles. |
Iterating large integers: $ ./python -m pyperf timeit -s 'r = range(0, 10**20, 3**35)' 'for i in r: pass' baseline: Mean +- std dev: 223 us +- 10 us $ ./python -m pyperf timeit -s 'r = range(0, 10**20, 3**35)' 'list(r)'
baseline: Mean +- std dev: 191 us +- 13 us
PR 27986: Mean +- std dev: 107 us +- 7 us
PR 28176: Mean +- std dev: 91.3 us +- 2.4 us Unpickling: $ ./python -m pyperf timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i)) for i in range(1000)])' 'loads(p)'
baseline: Mean +- std dev: 535 us +- 29 us
PR 27986: Mean +- std dev: 420 us +- 15 us
PR 28176: Mean +- std dev: 418 us +- 17 us
$ ./python -m pyperf timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i*10**10)) for i in range(1000)])' 'loads(p)'
baseline: Mean +- std dev: 652 us +- 37 us
PR 27986: Mean +- std dev: 530 us +- 43 us
PR 28176: Mean +- std dev: 523 us +- 17 us Seems PR 28176 is slightly faster than PR 27986 in iterating long integers. |
Looks like #72363 is faster across the board. One missing benchmark here is comparing |
length_hint(), not len(). Its cost is included in microbenchmark for list(), where it is followed by iterating 2000 items. Calling operator.length_hint() in Python: $ ./python -m pyperf timeit -s 'it = iter(range(1000)); from operator import length_hint' 'length_hint(it)'
baseline: Mean +- std dev: 109 ns +- 6 ns
PR 27986: Mean +- std dev: 109 ns +- 5 ns
PR 28176: Mean +- std dev: 115 ns +- 5 ns
$ ./python -m pyperf timeit -s 'it = iter(range(0, 10**20, 3**35)); from operator import length_hint' 'length_hint(it)'
baseline: Mean +- std dev: 114 ns +- 6 ns
PR 27986: Mean +- std dev: 95.6 ns +- 4.3 ns
PR 28176: Mean +- std dev: 285 ns +- 13 ns Indirect call from C (it includes overhead for calling list() and iter() in Python): $ ./python -m pyperf timeit -s 'r = range(10**20, 10**20+1, 3**35)' 'list(iter(r))'
baseline: Mean +- std dev: 331 ns +- 16 ns
PR 27986: Mean +- std dev: 300 ns +- 16 ns
PR 28176: Mean +- std dev: 391 ns +- 18 ns With few experiments I have found that PR 28176 is faster than PR 27986 for list(iter(range(...))) if a range is larger than 40-100 items. |
May the best PR win. :-) |
Since len timings for ranges of 100 items are negligible anyway, I personally still favor #72363 which is clearly faster during iteration. |
I benchmarked #72173 and #72363 on "for i in range(10000): pass" and found that #72173 was faster for this (likely most common) case of relatively small integers. Mean +- std dev: [main] 204 us +- 5 us -> [GH-27986] 194 us +- 4 us: 1.05x faster It's possible to have different implementations for small/large integers, but IMO it's probably best to keep consistency and go with #72173. |
Good point benchmarking small iterations too, Dennis. I missed that. Agreed then, #72173 looks like a winner. |
I do not see any difference in iterating small integers: $ ./python -m pyperf timeit -s 'r = range(10000)' 'for i in r: pass'
PR 27986: Mean +- std dev: 174 us +- 9 us
PR 28176: Mean +- std dev: 172 us +- 10 us |
Serhiy, right: looks like the difference stems from recreating the range objects, not from iteration. But Dennis' observation still stands: using `for i in range(...):` is a very popular idiom. If that becomes slower than the status quo, we will be making existing Python programs slower as well. |
There is nothing in code that could explain a measureable difference in creating the range objects or the range object iterators. And indeed, it is in the range of the standard deviation, so it is non-existent. |
I did more benchmarks on my Windows laptop, and it seems the difference goes away after using PGO. The benchmarking program: ################################# from pyperf import Runner
runner = Runner()
for n in [10, 100, 1000, 10_000, 100_000]:
runner.timeit(f"for i in range({n}): pass",
stmt=f"for i in range({n}): pass")
################################### The results (without PGO): PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\20e3149c175a24466c7d1c352f8ff2c11effc489.json .\cffa90a8b0057d7e7456571045f2fb7b9ceb426f.json -G
Faster (2):
Benchmark hidden because not significant (3): for i in it_100: pass, for i in it_10000: pass, for i in it_100000: pass Geometric mean: 1.03x slower ################################### The results (with PGO): PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\PGO-20e3149c175a24466c7d1c352f8ff2c11effc489.json .\PGO-cffa90a8b0057d7e7456571045f2fb7b9ceb426f.json -G
Faster (7):
Benchmark hidden because not significant (6): deque(it_10), list(iter(range(10))), for i in range(100): pass, deque(it_100), deque(it_1000), deque(it_10000) Geometric mean: 1.01x faster |
Dennis, run your benchmarks with --rigorous to avoid "Benchmark hidden because not significant". I note that the second and third benchmarks aren't useful as written because the iterators are exhausted after first repetition. I could see this in my results, note how the values don't rise with the iterator size: for i in it_10: pass: Mean +- std dev: 25.0 ns +- 0.3 ns
for i in it_100: pass: Mean +- std dev: 25.1 ns +- 0.5 ns
for i in it_1000: pass: Mean +- std dev: 25.0 ns +- 0.3 ns
for i in it_10000: pass: Mean +- std dev: 25.0 ns +- 0.3 ns
for i in it_100000: pass: Mean +- std dev: 25.6 ns +- 0.5 ns deque(it_10): Mean +- std dev: 334 ns +- 8 ns When I modified those to recreate the iterator on every run, the story was much different. |
Benchmarks for PGO builds on macOS 10.15 Catalina, Intel MBP 2018. Like in Dennis' case, 20e3149 is #72173 and cffa90a is #72363. The difference is that ################
Faster (8):
Benchmark hidden because not significant (1): for i in range(10): pass Geometric mean: 1.00x slower The results look like a wash here. Let me compare both to |
Well, this is kind of disappointing on my end. I attach the full result of a three-way comparison between Geometric mean 20e3149-2: 1.01x slower At least on my Macbook Pro, all PGO builds, looks like the status quo is on average faster than any of the candidate PRs. |
Given the benchmark results, are we abandoning this? |
The difference of 1% is not significant. You can get larger difference from run to run with the same binary. But for large integers the difference is ~2x. And the difference in the memory and the pickle sizes is not insignificant. The original code was changed, in particular there is a special path in the eval loop for range iteration, so all benchmarks should be rerun. It sounds funny, but the small difference in iterating small ranges in #27986 disappeared after applying a simple manual optimization: long result = r->start;
- r->start += r->step;
+ r->start = result + r->step;
r->len--;
return PyLong_FromLong(result); I thought that the compiler is smart enough for this. |
#27986 is now the same as the baseline in iterating small integers range. #28176 is slightly slower.
But it is the fastest in iterating large integers range.
|
I would vote for the faster PR when the values are small -- perf for large ranges is not a priority (these occur too rare and everything else is slower with them as well). |
I agree. Thank you for review. |
* main: (112 commits) pythongh-99894: Ensure the local names don't collide with the test file in traceback suggestion error checking (python#99895) pythongh-99612: Fix PyUnicode_DecodeUTF8Stateful() for ASCII-only data (pythonGH-99613) Doc: Add summary line to isolation_level & autocommit sqlite3.connect params (python#99917) pythonGH-98906 ```re``` module: ```search() vs. match()``` section should mention ```fullmatch()``` (pythonGH-98916) pythongh-89189: More compact range iterator (pythonGH-27986) bpo-47220: Document the optional callback parameter of weakref.WeakMethod (pythonGH-25491) pythonGH-99905: Fix output of misses in summarize_stats.py execution counts (pythonGH-99906) pythongh-99845: PEP 670: Convert PyObject macros to functions (python#99850) pythongh-99845: Use size_t type in __sizeof__() methods (python#99846) pythonGH-99877) Fix typo in exception message in `multiprocessing.pool` (python#99900) pythongh-87092: move all localsplus preparation into separate function called from assembler stage (pythonGH-99869) pythongh-99891: Fix infinite recursion in the tokenizer when showing warnings (pythonGH-99893) pythongh-99824: Document that sqlite3.connect implicitly open a transaction if autocommit=False (python#99825) pythonGH-81057: remove static state from suggestions.c (python#99411) Improve zip64 limit error message (python#95892) pythongh-98253: Break potential reference cycles in external code worsened by typing.py lru_cache (python#98591) pythongh-99127: Allow some features of syslog to the main interpreter only (pythongh-99128) pythongh-82836: fix private network check (python#97733) Docs: improve accuracy of socketserver reference (python#24767) ...
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: