Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ResourceArc::make_binary #487

Merged
merged 6 commits into from
Sep 18, 2022

Conversation

filmor
Copy link
Member

@filmor filmor commented Sep 11, 2022

No description provided.

@filmor
Copy link
Member Author

filmor commented Sep 11, 2022

@josevalim FYI

@josevalim
Copy link
Contributor

@filmor looks great! 👍

rustler/src/resource.rs Outdated Show resolved Hide resolved
@josevalim
Copy link
Contributor

Btw, for completeness, I don't think this would work for my use case, so we would still need the low-level APIs. I am allocating multiple binaries from the same resource:

https://github.com/elixir-nx/explorer/pull/346/files#diff-3cb6b1439fef8fae9e6949e097296b4a3b17c8761ec9d8e98c0b4fa813d362f3R224

Well, maybe you could allow a Vec of slices to be returned? But also note in the example above that I proceed to create sub binaries from the resource binary, and even if you changed this function to create multiple binaries, it would require me doing an additional pass on the data.

In any case, I think this is still solid for a convenient, safer API. :) Even though I don't think my opinion is worth much here.

@hansihe
Copy link
Member

hansihe commented Sep 11, 2022

Couldn't you just call make_binary first, returning array.values().as_slice(), then call make_subbinary on that inside your iter.rfold? Aside from some bounds checks, I think that should be equivalent to what your code does today

@josevalim
Copy link
Contributor

@hansihe that's a chunked array, so array.values() returns a vector of resource binaries. I can't convert array.values().as_slice() to a resource binary.

rustler/src/resource.rs Outdated Show resolved Hide resolved
@hansihe
Copy link
Member

hansihe commented Sep 11, 2022

This API is a really good idea by the way @filmor!

@hansihe
Copy link
Member

hansihe commented Sep 11, 2022

@josevalim Ahh, I see what you mean

@filmor
Copy link
Member Author

filmor commented Sep 11, 2022

How about an unsafe variant of the same function without the lifetime bound? That way you at least don't have to fiddle with so many pointers.

@hansihe
Copy link
Member

hansihe commented Sep 11, 2022

I think adding a fn make_binary_unsafe<'env, 'a>(&self, env: Env<'env>, data: &'a [u8]) -> Binary might be better, if we don't enforce the lifetime, I don't think there would be any need to have it go through a callback?

We need to make the documentation for that function very scary.

@josevalim
Copy link
Contributor

@filmor that would work but I may send a PR for Term::make_sub_binary so we skip the binary overhead in the tight loop, if that's ok?

@filmor
Copy link
Member Author

filmor commented Sep 12, 2022

I think adding a fn make_binary_unsafe<'env, 'a>(&self, env: Env<'env>, data: &'a [u8]) -> Binary might be better, if we don't enforce the lifetime, I don't think there would be any need to have it go through a callback?

Having the same signature helps to clarify the difference, IMO. It also makes it clear that the binary's lifetime should be connected to the referenced object.

@filmor that would work but I may send a PR for Term::make_sub_binary so we skip the binary overhead in the tight loop, if that's ok?

What overhead are you referring to? Creating a Binary from a Term in an unsafe way (like we are doing here) should have no measurable overhead, it's just an additional stack operation. And it clarifies for make_sub_binary at compile-time that we know that the referenced term is a binary. The only additional overhead I could think of is from the bounds-checking in Binary::make_subbinary, but that could also be lifted with an unsafe variant.

@josevalim
Copy link
Contributor

Let’s merge this with the unsafe version and I can give it a try and benchmark. But last I checked it was measurable. :)

@josevalim
Copy link
Contributor

josevalim commented Sep 12, 2022

Let’s merge this with the unsafe version and I can give it a try and benchmark. But last I checked it was measurable. :)

I was actually able to reproduce the benchmark with the current code. Here is the current code:

https://github.com/elixir-nx/explorer/blob/ab0ff323c2bd0d5bae69424dabac83268a50c2f6/native/explorer/src/encoding.rs#L235-L249

Here is the patched version:

        iter.rfold(acc, |acc, uncast_offset| {
            let offset = *uncast_offset as NIF_TERM;

            let term = if validity_iter.next_back().unwrap_or(true) {
                unsafe {
                    rustler_sys::enif_make_sub_binary(
                        env_as_c_arg,
                        binary_as_c_arg,
                        offset,
                        last_offset - offset,
                    )
                }
            } else {
                nil_atom.as_c_arg()
            };

            last_offset = offset;
            unsafe { list::make_list_cell(env_as_c_arg, term, acc) }
        })

Here are the benchmark numbers:

Name               ips        average  deviation         median         99th %
new            15.02 K       66.57 μs    ±69.59%       67.08 μs      109.05 μs
old             6.41 K      155.93 μs    ±42.30%      152.58 μs      227.04 μs

I have also removed the bound checks, to see if they are the root cause, and no meaningful difference in the numbers:

Name               ips        average  deviation         median         99th %
nobound         6.38 K      156.62 μs    ±38.62%      152.19 μs      289.57 μs

And for completeness, here is the benchmark:

alias Explorer.Series, as: S

strings =
  for _ <- 1..1000 do
    :crypto.strong_rand_bytes(8) |> Base.encode64()
  end

series = S.from_list(strings)

Benchee.run(
  %{"new" => fn -> S.to_list(series) end},
  time: 5
)

In a nutshell, each call to to_list will create 1k binaries. In the benchmark, we are creating 6M-15M subbinaries per second, and that's why we are seeing those small changes, such as Term and Binary wrapping, show up. Which is why I have been asking for more straight rustler::wrapper functions. :) That would make a difference for us.

@filmor
Copy link
Member Author

filmor commented Sep 12, 2022

Well, we have this pending PR to introduce inlining, maybe that does it already?

@josevalim
Copy link
Contributor

@filmor I have just tried the PR, it actually made things slower?

Name               ips        average  deviation         median         99th %
inline          5.56 K      179.91 μs   ±163.61%      152.54 μs      857.17 μs 

My understanding from other languages is that sometimes inlining can cause negative impact, by making the code larger so it doesn't fit the cache or by making code reuse harder (and then purged more frequently). I am not sure how much this applies to Rust.

@hansihe
Copy link
Member

hansihe commented Sep 12, 2022

I looked into this a little bit, and I think the main difference in performance between using enif_make_sub_binary directly vs the make_subbinary function is due to make_subbinary also calling enif_inspect_binary.

You can see that Binary::make_subbinary calls Binary::from_term, which again calls enif_inspect_binary to get the ErlNifBinary data from it.

As far as I can see, we only really use the ErlNifBinary to get the data pointer and length. If we instead stored those directly in the Binary struct, we could avoid calling enif_inspect_binary when making subbinaries.

@josevalim
Copy link
Contributor

@hansihe nice find! We could apply the same idea to this PR as well. There is no reason to inspect it if we have the term, pointer and size upfront.

@hansihe
Copy link
Member

hansihe commented Sep 12, 2022

Made a PR for the change

#488

@josevalim
Copy link
Contributor

@hansihe awesome! It brings the benchmark up to:

Name               ips        average  deviation         median         99th %
new branch     10.05 K       99.54 μs    ±47.18%         100 μs      270.91 μs

@filmor
Copy link
Member Author

filmor commented Sep 12, 2022

@hansihe nice find! We could apply the same idea to this PR as well. There is no reason to inspect it if we have the term, pointer and size upfront.

True, will do that later today. We just need to add an additional unsafe private constructor to Binary for the (very specific :)) case of having term, pointer and size available.

@hansihe
Copy link
Member

hansihe commented Sep 12, 2022

I think the remaining overhead might be code bloat from panic handling for when unwrap is called on the result from make_subbinary.

Introducing and using a make_subbinary_unchecked seems to result in pretty much the exact same assembly code output for that function.

@filmor filmor force-pushed the make-resource-binary branch from fc9a709 to 4c92d8f Compare September 13, 2022 06:44
@filmor filmor marked this pull request as ready for review September 13, 2022 08:44
Copy link
Member

@hansihe hansihe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general!

Would want some additional info in the docs for the unsafe variant

rustler/src/resource.rs Show resolved Hide resolved
@filmor
Copy link
Member Author

filmor commented Sep 13, 2022

@josevalim Can you check with your benchmark whether this is now acceptable without using the direct enif_ functions? :)

@hansihe
Copy link
Member

hansihe commented Sep 13, 2022

I think we might still need to have a make_subbinary_unchecked to match the performance. I think the current state should be roughly equivalent of the new_branch benchmark posted earlier?

@josevalim
Copy link
Contributor

@hansihe correct. I am on 10k with this branch, the code looks super clean though. :)

@filmor
Copy link
Member Author

filmor commented Sep 14, 2022

@josevalim I added the unchecked variant and extended the comments a bit, will that do now? :)

@filmor
Copy link
Member Author

filmor commented Sep 14, 2022

Looking at your code from the branch again, you can make it even more safe by using the direct term methods, so instead of list::make_list_cell you could keep everything in a term and do acc.list_prepend(term). That should allow you to ditch pretty much all unsafe code.

@josevalim
Copy link
Contributor

With the new check:

                unsafe {
                    binary.make_subbinary_unchecked(offset, last_offset - offset)
                }.to_term(env).as_c_arg()

I get this:

sub_unchecked       11.65 K       85.83 μs    ±82.26%       80.71 μs      200.61 μs

It is an improvement but more than 20% slower than the previous 15k. I will convert this to rustler_sys just to be double sure that the difference is exclusively in this function.


Using Term.list_prepend on top of the above, I got this:

list_prepend       10.23 K       97.72 μs   ±124.10%       89.88 μs      304.94 μs

However, I just realized I have been benchmarking in MIX_ENV=dev and I assume MIX_ENV=prod will give different results. So give me a sec. :)

@josevalim
Copy link
Contributor

Yeah, I can actually measure those in MIX_ENV=prod too:

Name                ips        average  deviation         median         99th %
unsafe_list     37.44 K       26.71 μs   ±172.22%       22.83 μs      101.17 μs
term_list       23.86 K       41.92 μs   ±123.37%       40.71 μs      105.82 μs

@josevalim
Copy link
Contributor

Yes, after more benchmarks the wrapper/rustler_sys version is still ~30% faster. So I would say having something like pub use rustler_sys::enif_make_sub_binary as make_subbinary; in the wrapper/binary.rs module makes sense for this paticular use case. WDYT?

I am not sure if it makes sense to keep the unchecked version though if we add the wrapper. :)

Thanks btw for all the back and forth in this, I am learning a lot!

@hansihe
Copy link
Member

hansihe commented Sep 15, 2022

I am definitely for exporting enif_make_sub_binary for when that is needed.

It would be nice however if we could try to iron out the remaining potential performance issues and see if we can get the higher level APIs on par with the enif_ functions. I could also probably try to get the benchmarks set up myself so we don't have to go back and forth on this here?

The version of rustler you are using doesn't have the #[inline] attribute on the Term.list_prepend function right? If that is the case then this might be what slows things down.

In Rust, functions not marked with #[inline] will never be inlined across crates (with the exception of LTO). That is why the inline attribute on small functions like this in a hot loop could potentially have a big impact on performance.

@josevalim
Copy link
Contributor

This is the branch/PR: elixir-explorer/explorer#346 - notice I am pointing to a local Rustler for now.

Here is the benchmark:

alias Explorer.Series, as: S

strings =
  for _ <- 1..1000 do
    :crypto.strong_rand_bytes(8) |> Base.encode64()
  end

series = S.from_list(strings)

Benchee.run(
  %{
    "flat_map" => fn -> S.to_list(series) end
  },
  time: 5

Run it with mix run foo.exs. I will try inlining.

@josevalim
Copy link
Contributor

I have added inline to Term.as_c_arg, Binary.to_term, and Binary.make_subbinary_unchecked and saw no difference. :(

@hansihe
Copy link
Member

hansihe commented Sep 18, 2022

OK, I think we should do the following:

  1. Expose the enif_ functions you need directly for now. This makes it possible for you to make progress on your end @josevalim.
  2. Merge this PR for others to use.
  3. Open a separate performance tracking issue that links to this discussion so that this does not get lost. I would like to try to iron out the performance issues from these APIs where possible.

Does this make sense to you @josevalim @filmor?

@filmor filmor merged commit d89a3d1 into rusterlium:master Sep 18, 2022
@filmor filmor deleted the make-resource-binary branch September 18, 2022 10:41
@josevalim
Copy link
Contributor

Beautiful, thank you! I will send a PR for the missing bit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants