Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add neat #487

Merged
merged 2 commits into from
Dec 27, 2023
Merged

Add neat #487

merged 2 commits into from
Dec 27, 2023

Conversation

jinyus
Copy link
Owner

@jinyus jinyus commented Dec 27, 2023

Neat:

        Processing time (w/o IO): 177.988000ms
        total: 0.77s memory: 108904k
        Processing time (w/o IO): 197.040000ms
        total: 0.79s memory: 108904k
        Processing time (w/o IO): 177.947000ms
        total: 0.77s memory: 108900k
        Processing time (w/o IO): 196.555000ms
        total: 0.79s memory: 108904k
        Processing time (w/o IO): 180.425000ms
        total: 0.77s memory: 108900k
        Processing time (w/o IO): 178.002000ms
        total: 0.77s memory: 108772k
        Processing time (w/o IO): 196.600000ms
        total: 0.79s memory: 108776k
        Processing time (w/o IO): 178.117000ms
        total: 0.77s memory: 108900k
        Processing time (w/o IO): 196.533000ms
        total: 0.79s memory: 108772k
        Processing time (w/o IO): 178.200000ms
        total: 0.77s memory: 108900k

Neat:

        Processing time (w/o IO): 2656.659000ms
        total: 5.18s memory: 424464k
        Processing time (w/o IO): 2657.508000ms
        total: 5.17s memory: 424464k
        Processing time (w/o IO): 2953.203000ms
        total: 5.35s memory: 424336k

Neat:

        Processing time (w/o IO): 25958.542000ms
        total: 33.93s memory: 1660584k
        Processing time (w/o IO): 25959.446000ms
        total: 34.16s memory: 1660736k
        Processing time (w/o IO): 25995.728000ms
        total: 33.79s memory: 1660860k

@jinyus jinyus merged commit 42e50a7 into main Dec 27, 2023
@jinyus
Copy link
Owner Author

jinyus commented Dec 27, 2023

Hey @FeepingCreature, hope you have the time to take a look to make sure everything was setup correctly. I just copied the code from your gist. Doesn't seem to scale too well but it's very young so that isn't surprising...but it might be worthwhile to do some profiling.

@FeepingCreature
Copy link

I'll check more tomorrow, but note: -optimize is not actually a flag, I have no idea why it works. You want -O -release, like D.

@FeepingCreature
Copy link

Oh damn I'm an idiot, of course it's a flag, I specifically added support for checking -flag against longflags before shortflags.

Bleh.

At any rate, try -release. Array index checks are really hitting the example hard.

@jinyus
Copy link
Owner Author

jinyus commented Dec 27, 2023

Disabling bounds check is against the rules but it does give a decent speed up. 160ms -> 105ms

@FeepingCreature
Copy link

Huh, now I'm wondering how D avoids the bounds check. Probably because the array size is known statically.

@jinyus
Copy link
Owner Author

jinyus commented Dec 27, 2023

Yea, languages without fixed-size arrays are at a disadvantage.

@FeepingCreature
Copy link

Hm. Well at any rate, I see some good places I can optimize the refcounter with this, there's some bad things going on with array appends spamming reference inc/decs even if it's single-owner. I'll go look into it.

@FeepingCreature
Copy link

Okay, the array append issue (though fixed) wasn't actually on the critical path. But there is something fishy going on with the range checks; LLVM should easily be able to erase those, it's all inlined anyways. For some reason it assigns a number to a field and then doesn't realize that the number stays constant throughout. I'm on it.

@FeepingCreature
Copy link

Rules clarification question: We can change the language to remove specific internal bounds checks that are demonstrably redundant, right?

@jinyus
Copy link
Owner Author

jinyus commented Jan 1, 2024

Yes, that's fine... as long as it's a general improvement and not only useful in this benchmark.

@FeepingCreature
Copy link

FeepingCreature commented Jan 1, 2024

Great, cause I just noticed that I was doing bounds checks for element loads for array loops. Which are uh, impossible to ever be violated. Ie. for (key, value in array) was doing a bounds check for every loop.

I'm pretty sure that's the main reason Neat was slow.

(This optimization is safe because if you're doing a loop and you append to the variable, the loop still only runs to the original length; similarly if you truncate)

Could you retry with 0.5.1 please?

@jinyus
Copy link
Owner Author

jinyus commented Jan 1, 2024

I'm seeing a 2x speed up.

0.5.0:

Neat | 185.74 ms | 2.76 s | 25.97 s

0.5.1:

Neat:

        Processing time (w/o IO): 84.985234ms
        total: 0.37s memory: 59620k
        Processing time (w/o IO): 85.004227ms
        total: 0.37s memory: 59752k
        Processing time (w/o IO): 85.150147ms
        total: 0.39s memory: 59748k
        Processing time (w/o IO): 85.087656ms
        total: 0.37s memory: 59752k
        Processing time (w/o IO): 85.107234ms
        total: 0.37s memory: 59880k
        Processing time (w/o IO): 85.150078ms
        total: 0.37s memory: 59620k
        Processing time (w/o IO): 85.060516ms
        total: 0.37s memory: 59880k
        Processing time (w/o IO): 84.853219ms
        total: 0.37s memory: 59748k
        Processing time (w/o IO): 84.757633ms
        total: 0.37s memory: 59880k
        Processing time (w/o IO): 84.843406ms
        total: 0.37s memory: 59748k

Neat:

        Processing time (w/o IO): 1154.516875ms
        total: 2.30s memory: 227980k
        Processing time (w/o IO): 1156.726875ms
        total: 2.32s memory: 227988k
        Processing time (w/o IO): 1154.512250ms
        total: 2.43s memory: 227988k

Neat:

        Processing time (w/o IO): 9915.724000ms
        total: 13.54s memory: 539836k
        Processing time (w/o IO): 9916.659000ms
        total: 13.59s memory: 540088k
        Processing time (w/o IO): 11525.339000ms
        total: 15.33s memory: 539964k

@FeepingCreature
Copy link

Hooray! \o/ I think I'd need to look at D's assembly to see where to get more, but it's a start.

@jinyus
Copy link
Owner Author

jinyus commented Jan 1, 2024

llvm yielded a greater improvement. Though installation is a bit cumbersome as it expects llvm to be in a specific location. I think a which llvm-config would be better in this case.

Neat:

        Processing time (w/o IO): 58.419998ms
        total: 0.30s memory: 59756k
        Processing time (w/o IO): 58.604000ms
        total: 0.33s memory: 59624k
        Processing time (w/o IO): 58.535999ms
        total: 0.30s memory: 59752k
        Processing time (w/o IO): 50.978001ms
        total: 0.29s memory: 59884k
        Processing time (w/o IO): 50.511002ms
        total: 0.29s memory: 59628k
        Processing time (w/o IO): 58.518002ms
        total: 0.30s memory: 59756k
        Processing time (w/o IO): 58.595001ms
        total: 0.30s memory: 59632k
        Processing time (w/o IO): 58.584999ms
        total: 0.30s memory: 59752k
        Processing time (w/o IO): 50.693001ms
        total: 0.29s memory: 59752k
        Processing time (w/o IO): 50.667000ms
        total: 0.29s memory: 59624k

Neat:

        Processing time (w/o IO): 832.085022ms
        total: 1.86s memory: 227876k
        Processing time (w/o IO): 833.153015ms
        total: 1.93s memory: 227880k
        Processing time (w/o IO): 833.741028ms
        total: 1.91s memory: 227876k

Neat:

        Processing time (w/o IO): 7259.328125ms
        total: 10.39s memory: 529236k
        Processing time (w/o IO): 6195.353027ms
        total: 9.35s memory: 529368k
        Processing time (w/o IO): 7255.119141ms
        total: 10.11s memory: 529236k

@FeepingCreature
Copy link

FeepingCreature commented Jan 2, 2024

Oh, you were using the gcc backend? Yeah the LLVM one is the one I use and optimize the most, the gcc is mostly a fallback for bootstrapping.

The locations are hardcoded cause I need a specific LLVM version on multi-LLVM systems. How to resolve this is completely unstandardized, sadly.

Heh, looking at the diff this was pretty much the biggest possible speedup I could have got without advancing in the ranking, lol.

@FeepingCreature
Copy link

Hey, um. You may want to try 0.5.2 as well :) I think the speedup should be impressive.

Turns out it wasn't the bounds checking at all; I was just doing stupid things with argument parsing.

@jinyus
Copy link
Owner Author

jinyus commented Jan 5, 2024

Nice! Right up there with go and java.

Neat:

        Processing time (w/o IO): 18.870001ms
        total: 0.11s memory: 59616k
        Processing time (w/o IO): 18.753000ms
        total: 0.15s memory: 59616k
        Processing time (w/o IO): 18.900000ms
        total: 0.11s memory: 59616k
        Processing time (w/o IO): 18.914000ms
        total: 0.11s memory: 59488k
        Processing time (w/o IO): 19.087000ms
        total: 0.11s memory: 59616k
        Processing time (w/o IO): 18.981001ms
        total: 0.11s memory: 59488k
        Processing time (w/o IO): 18.892000ms
        total: 0.11s memory: 59572k
        Processing time (w/o IO): 19.103001ms
        total: 0.11s memory: 59484k
        Processing time (w/o IO): 19.546000ms
        total: 0.11s memory: 59364k
        Processing time (w/o IO): 19.062000ms
        total: 0.11s memory: 59616k

Neat:

        Processing time (w/o IO): 251.744995ms
        total: 0.68s memory: 227720k
        Processing time (w/o IO): 250.667007ms
        total: 0.81s memory: 227724k
        Processing time (w/o IO): 252.091995ms
        total: 0.75s memory: 227592k

Neat:

        Processing time (w/o IO): 2141.670898ms
        total: 3.81s memory: 534540k
        Processing time (w/o IO): 2144.925049ms
        total: 3.69s memory: 534412k
        Processing time (w/o IO): 2140.822998ms
        total: 3.84s memory: 534288k

@FeepingCreature
Copy link

That's what I like to see :)

Turns out, if you have three-pointer (24-byte) arrays, you really don't want to pass them as structs on AMD64 - as opposed to D's 16 byte arrays, 24-byte arrays force an alloca and having a few dozen allocas rolling around being assembled and dissembled really messes up LLVM's ability to optimize register moves something fierce. Luckily since it's not a C type, I can just decide to pass them differently. And suddenly we're twice as fast. :)

There's some other perf things in 0.5.2 - also technically I could switch to static arrays now - but "just pass arrays as individual arguments rather than stack pointers" was the big one.

@FeepingCreature
Copy link

FeepingCreature commented Jan 5, 2024

To clarify, there's basically no point to static arrays, it gives no appreciable speedup. With the extraneous bounds checks being removed in the last version, it now only has to check the posts array anyways - LLVM can already tell that the 5-entry array has five entries and elide the bounds.

Also, -release has basically no effect anymore.

@FeepingCreature
Copy link

Just for fun, I would also like to draw your attention to this commit. Note the disparity between the size of the debugging effort and the size of the fix. I had 0.5.2 ready this morning if not for that. :)

(Was it worth it? Absolutely. Moving up in benchmark placement makes the brain meats secrete the endorphins something fierce. - I think I'll leave it there though, D v2 is a bit silly.)

@jinyus
Copy link
Owner Author

jinyus commented Jan 5, 2024

Haha, I love to see it, thanks for the effort. I realized that this benchmark has help several language designers to find inefficiencies in their implementations. dart,lobster, inko

The D and Rust guys had a little tit for tat going on, but D ultimately came out on top. As for Julia HO, it puzzles me how it's so fast, but I plan on doing a deep dive on whatever wizardry is going on under the hood.

@jinyus jinyus deleted the neat branch February 9, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants