-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: findHead scan optimization #3899
gc: findHead scan optimization #3899
Conversation
e282884
to
565a24d
Compare
I wonder if we want to leave an old copy of the code disabled unless the runtime is built with Also, there is a similar forward scanning loop in |
9a315e8
to
565a24d
Compare
I need to take a closer look at the code still, but I like how it's kept small. Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer
I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).
Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement. |
32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.
It also looked to me that first Regarding the code size, I suspect it is due to use of |
@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :) |
It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620 On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc |
Thanks for the info, it is good to know.
It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway. |
Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line. |
16f759a
to
47d0cb1
Compare
@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the |
Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench
So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat ) |
@dgryski Here are some results from benchmarking. I also ran both algorithms side by side (also comparing the output): One strange thing I noticed is that on RP-2040 board Still didn't figure out what is the cause for this since only change is in the |
47d0cb1
to
9a1ca3c
Compare
Given this is slower for very short scans, and that most allocations are only for a single block, I wonder if we want to have an additional test to see if it's a one-block allocation before jumping into the larger logic? Or maybe the extra test outweighs the difference. |
Also, ran this code against the corpus with the following patch to check for any mismatches:
|
This is a short-circuit patch to return early if we're already on a head block.
|
@dgryski sorry for the late response, I think I tried adding check for single block allocations, but it got complicated because of assert logic. I like how your suggestion avoids dealing with assert since short circuit check does not decrements
Did you found any mismatch when running these tests? I did similar for my test programs and never got different value between old and new algorithm. |
I never got any differences between your code and the old version running tinygo through the test corpus on either Linux or WASI. (Pretty sure, although I might double-check this claim...) |
builder/sizes_test.go
Outdated
{"hifive1b", "examples/echo", 4560, 280, 0, 2268}, | ||
{"microbit", "examples/serial", 2868, 388, 8, 2272}, | ||
{"wioterminal", "examples/pininterrupt", 6104, 1484, 116, 6832}, | ||
{"hifive1b", "examples/echo", 4688, 280, 0, 2268}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current echo
size should be 4660
. Then this should pass CI and I think we'll be good to merge.
@aykevl Aside from the above CI issue with binary sizes, do you have any comments before this is merged? |
9a1ca3c
to
a48056d
Compare
@dgryski is there a way I can run tinyjsonbench myself? I'd like to play a bit with |
This is similar to #3899, but smaller and hopefully just as efficient.
This is similar to #3899, but smaller and hopefully just as efficient.
Did some experimenting, and wrote some code that's a bit smaller (typically 20 byte increase in binary size, as opposed to 44-84 bytes in this PR) and I think just as fast or maybe slightly faster:
PR is at #4574. The main difference is that I don't try complicated XOR and
@dgryski as you can see, I did get tinyjsonbench to work. However I did find that run.sh is locale dependent: I had to replace a bunch of commas with dots in old.times and new.times for it to show something sensible (I've set my system to a weird mixture of English language and Dutch formatting of dates and numbers and stuff). |
@aykevl I like the simplicity of your suggestion. I did some benchmarking on RP2040 and the code from this PR is slightly (but consistently) faster than your suggestion. I'm not sure if this small speedup justifies so much size increase, especially if benchmarks on linux/wasm show similar/better performance. |
@HattoriHanzo031 can you share the benchmark code? |
@aykevl I'm definitely a fan of your approach, if only because I could immediately see that it was correct. I'm willing to trade off a little speed for smaller code size and simpler code. I'd be curious to see the performance difference between the two approaches, rather than comparing against the |
a48056d
to
b5a6ca7
Compare
The benchmark that I am doing on embedded targets does not test the speed of user code, but compares the speed of execution of different versions of
I agree, I think we should abandon I have few ideas for micro optimisations of the approach from @aykevl PR.
It is essentially the same code as |
I just ran tinyjsonbench for the micro optimisations and it looks like they give comparable performance (new.times), while producing few bites smaller binary. |
Here are the results from RP2040: Total runtime for each channel for the sample I recorded is:
Implementation with XOR logic and Given this results and binary size difference I am in favour of approach from #4574 with micro optimisations. P.S. this is the code used to generate timing signals on embedded targets:
|
This is similar to #3899, but smaller and hopefully just as efficient. Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for improving the performance of the code even further.
This is similar to #3899, but smaller and hopefully just as efficient. Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for improving the performance of the code even further.
@HattoriHanzo031 updated #4574 with your suggestions, and refactored the code a bit to hopefully be easier to read. I didn't re-run the benchmarks but the changes seem reasonable and should indeed improve performance.
Also, I expect most pointers to point to the start of an allocation even for large allocations. For example, a slice will typically be pointing to the start of the buffer unless the code explicitly uses |
b5a6ca7
to
47f5d3c
Compare
Looks very clear and readable. I am closing this PR since it is redundant now. |
This is similar to #3899, but smaller and hopefully just as efficient. Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for improving the performance of the code even further.
Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.