Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

display summary and breakdown of timing tests #41

Merged
merged 4 commits into from
Jul 12, 2021

Conversation

iritkatriel
Copy link

This change makes the classic version slower. There was something unfair about the previous test when names were repeated between the functions. I think marshal was able to benefit from caching or interning or something which the new method was not.

@iritkatriel iritkatriel mentioned this pull request Jul 10, 2021
@iritkatriel
Copy link
Author

Starting speed test: test_speed_few_locals
Classic load: 0.076
Classic exec #1: 0.024
Classic exec #2: 0.028
Classic exec #3: 0.025
Classic exec #4: 0.025
Classic total: 0.178
PYCO: 0.128
New PYC load: 0.000
New PYC exec #1: 0.080
New PYC exec #2: 0.044
New PYC exec #3: 0.027
New PYC exec #4: 0.027
New PYC total: 0.182
Classic-to-new ratio: 0.98 (new is -2% faster)

Starting speed test: test_speed_few_locals_with_call
Classic load: 0.079
Classic exec #1: 0.201
Classic exec #2: 0.190
Classic exec #3: 0.187
Classic exec #4: 0.200
Classic total: 0.857
PYCO: 0.127
New PYC load: 0.001
New PYC exec #1: 0.418
New PYC exec #2: 0.201
New PYC exec #3: 0.199
New PYC exec #4: 0.202
New PYC total: 1.025
Classic-to-new ratio: 0.84 (new is -16% faster)

Starting speed test: test_speed_many_constants
Classic load: 8.591
Classic exec #1: 0.033
Classic exec #2: 0.026
Classic exec #3: 0.026
Classic exec #4: 0.026
Classic total: 8.703
PYCO: 2.679
New PYC load: 0.001
New PYC exec #1: 0.797
New PYC exec #2: 0.040
New PYC exec #3: 0.030
New PYC exec #4: 0.027
New PYC total: 0.897
Classic-to-new ratio: 9.70 (new is 870% faster)

Starting speed test: test_speed_many_globals
Classic load: 1.814
Classic exec #1: 0.645
Classic exec #2: 0.644
Classic exec #3: 0.645
Classic exec #4: 0.645
Classic total: 4.393
PYCO: 0.362
New PYC load: 0.001
New PYC exec #1: 2.011
New PYC exec #2: 0.793
New PYC exec #3: 0.749
New PYC exec #4: 0.750
New PYC total: 4.307
Classic-to-new ratio: 1.02 (new is 2% faster)

Starting speed test: test_speed_many_locals
Classic load: 10.680
Classic exec #1: 0.027
Classic exec #2: 0.577
Classic exec #3: 0.026
Classic exec #4: 0.026
Classic total: 11.335
PYCO: 1.466
New PYC load: 0.000
New PYC exec #1: 0.897
New PYC exec #2: 0.032
New PYC exec #3: 0.028
New PYC exec #4: 0.028
New PYC total: 0.989
Classic-to-new ratio: 11.47 (new is 1047% faster)

Starting speed test: test_speed_many_locals_with_call
Classic load: 2.495
Classic exec #1: 0.240
Classic exec #2: 0.241
Classic exec #3: 0.240
Classic exec #4: 0.243
Classic total: 3.460
PYCO: 0.357
New PYC load: 0.000
New PYC exec #1: 5.504
New PYC exec #2: 0.322
New PYC exec #3: 0.272
New PYC exec #4: 0.275
New PYC total: 7.680
Classic-to-new ratio: 0.45 (new is -55% faster)
test_new_pyc passed in 1 min 4 sec

@@ -93,7 +93,7 @@ def test_consts(self):
def do_test_speed(self, body, call=False):
NUM_FUNCS = 100
functions = [
f"def f{num}(a, b):\n{body}"
f"def f{num}(a, b):\n{body}".replace("a", f"a{num}_").replace("b", f"b{num}_")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace() call bothers me a bit, since it's so context-free. Maybe we should invest a little in a better way of generating N functions with M lines each following a pattern that's a function of N and M?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We probably want both tests, unique and not unique, to measure with and without the interning benefit.

@gvanrossum
Copy link
Owner

Are those test results significantly better than before? I find the speed tests pretty noisy (two runs may vary rather wildly).

@gvanrossum
Copy link
Owner

PS. Maybe the 4 consecutive execs don't actually help much, and we should just do the first exec? (Only the first one hydrates anything.)

@iritkatriel
Copy link
Author

PS. Maybe the 4 consecutive execs don't actually help much, and we should just do the first exec? (Only the first one hydrates anything.)

The second exec is interesting because we want it to be the same in both cases once hydrated.

@gvanrossum
Copy link
Owner

The second exec is interesting because we want it to be the same in both cases once hydrated.

Oh, right. NM then.

@iritkatriel
Copy link
Author

The results I get seem fairly stable, but it doesn't make sense to me that test_speed_many_locals and slower than test_speed_many_locals_with_call for Classic.

@iritkatriel
Copy link
Author

To your question - with the new test Classes is significantly slower, so New looks better in comparison. The load+1 exec time for test_speed_many_locals_with_call is 2x slower rather than 5x slower.

@iritkatriel
Copy link
Author

I just got this for the third time, it's weird:
test_speed_many_locals
Classic load: 11.324

test_speed_many_locals_with_call
Classic load: 2.541

@iritkatriel
Copy link
Author

ah of course: test_speed_many_locals_with_call has for i in range(100): and test_speed_many_locals has for i in range(300):

@iritkatriel
Copy link
Author

iritkatriel commented Jul 11, 2021

So if I increase it to 300 in test_speed_many_locals_with_call then I get for the old test:

Starting speed test: test_speed_many_locals_with_call
Classic load: 0.388
Classic exec #1: 1.018
Classic exec #2: 1.011
Classic exec #3: 1.014
Classic exec #4: 1.009
       Classic total: 4.441
PYCO: 0.824
New PYC load: 0.000
New PYC exec #1: 5.552
New PYC exec #2: 1.130
New PYC exec #3: 1.130
New PYC exec #4: 1.199
       New PYC total: 9.039
Classic-to-new ratio: 0.49 (new is -51% faster)

and for the new test:

Starting speed test: test_speed_many_locals_with_call
Classic load: 8.153
Classic exec #1: 1.018
Classic exec #2: 1.022
Classic exec #3: 1.029
Classic exec #4: 1.026
       Classic total: 12.248
PYCO: 1.123
New PYC load: 0.000
New PYC exec #1: 8.639
New PYC exec #2: 1.178
New PYC exec #3: 1.119
New PYC exec #4: 1.052
       New PYC total: 12.057
Classic-to-new ratio: 1.02 (new is 2% faster)

@iritkatriel iritkatriel marked this pull request as draft July 11, 2021 00:05
@gvanrossum
Copy link
Owner

Hm, of course this is the problem with micro-benchmarks -- you can use them to prove anything. :-)

We need to think more about this. Maybe we need to produce a little mathematical model to predict the approximate cost of loading a code object using the classic and new approach. Something like a constant K plus some factor times the number of instructions plus another factor times the number of unique constants plus yet another factor times the number of unique names. (Did I forget anything?) Microbenchmarking can give us K and the other factors, assuming the model is not too far off. (We need two models, one for classic marshal and another for the new approach,)

We can then attempt to do a static analysis of a known code corpus (e.g. the top 100 or top 4000 PyPI packages) to estimate the average number of instructions, unique constants, names, etc. This can then give us an approximate cut-off point for the one thing that's harder to estimate: the fraction of functions that are never called. For example if 50% of the functions in the corpus (selected randomly) are called, we'll obtain an X speedup.

We can also try to measure X (or more directly the classic and new times) for the modules that are imported during a typical standard run before the first line of user code is run -- this tells us whether we're doing anything about the "startup is slow" problem.

@iritkatriel
Copy link
Author

The micro benchmarks at this point are useful to see what exactly is going on. For our worst result (many locals with call with repeated variable names) we know now that the difference is probably the interning that marshal does. So if we implement interning as you suggested here then the story will improve for that case.

Once we're done with this and all the other optimizaitons then I agree a model can help understand the tradeoffs.

@gvanrossum
Copy link
Owner

It's simpler than interning even. There was an important missed case, where a string found via a redirect would be loaded even if the redirect target was already loaded. I have pushed a fix.

@iritkatriel
Copy link
Author

Cool, I'm trying it now.

(rebased this and pushed some stuff I did on the tests)

@iritkatriel
Copy link
Author

iritkatriel commented Jul 11, 2021

This prints for each test a breakdown like this:

Classic-to-new comparison:
>> load+exec ratio: 0.41 (new is -59% faster)
>> steady state ratio: 0.95 (new is -5% faster)

And then a summary in the end:

                               load+exec   steady state
many_locals                        5.823          0.881
many_constants                     1.487          1.301
many_globals                       1.319          0.840
few_locals                         0.946          0.884
few_locals_with_call               0.737          0.959
many_locals_with_call              0.413          0.951

@gvanrossum
Copy link
Owner

Ah, sorry about the typo in ceval.c. Fixed in HEAD.

These summaries are very useful, but I keep getting confused about whether larger ratios mean faster or slower. I propose to invert the ratios (so tn/tc instead of tc/tn) so that we can stick to "smaller numbers are better" (which seems to be the mantra of PyPerformance and pyperf).

I still see a lot of noise. E.g. two consecutive runs (with inverted ratios):

                               load+exec   steady state
many_locals_with_call              3.106          1.128
few_locals_with_call               1.935          0.804
few_locals                         1.232          0.898
many_globals                       0.885          0.900
many_constants                     0.726          0.887
many_locals                        0.181          1.239

                               load+exec   steady state
many_locals_with_call              3.179          1.133
few_locals_with_call               1.985          0.883
few_locals                         1.099          1.067
many_constants                     0.761          0.839
many_globals                       0.751          0.987
many_locals                        0.159          1.085

But the trend is still clear: steady state is sometimes faster, sometimes slower, three tests are clearly faster (the many constants, globals, locals) and the tests with calls are still dramatically slower.

If I comment out the hydration of local variable names (as shown before), many_locals_with_calls load+exec time goes down to 2.5, and few_locals_with_call to 1.6. I don't think interning will help (since in the faster version names aren't loaded at all).

@iritkatriel iritkatriel marked this pull request as ready for review July 11, 2021 22:05
@iritkatriel iritkatriel changed the title make test functions use unique variable names display summary and breakdown of timing tests Jul 11, 2021
Copy link
Owner

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@iritkatriel
Copy link
Author

almost.. need to flip the direction of the prose.

@iritkatriel
Copy link
Author

It's either what I just committed, or it needs to be "New is x% slower" (which sounds very negative) or "classic is X% faster" (which sounds like the wrong focus).

@iritkatriel
Copy link
Author

No, that was wrong, let me revert the last commit.

@gvanrossum gvanrossum mentioned this pull request Jul 12, 2021
@gvanrossum gvanrossum merged commit 89e264d into gvanrossum:pyco Jul 12, 2021
@gvanrossum
Copy link
Owner

Thanks again!

@iritkatriel iritkatriel deleted the fix-test branch August 2, 2021 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants