Implement ResourceArc::make_binary #487

filmor · 2022-09-11T17:58:46Z

No description provided.

filmor · 2022-09-11T17:59:20Z

josevalim · 2022-09-11T18:29:02Z

@filmor looks great! 👍

rustler/src/resource.rs

josevalim · 2022-09-11T18:53:55Z

Btw, for completeness, I don't think this would work for my use case, so we would still need the low-level APIs. I am allocating multiple binaries from the same resource:

https://github.com/elixir-nx/explorer/pull/346/files#diff-3cb6b1439fef8fae9e6949e097296b4a3b17c8761ec9d8e98c0b4fa813d362f3R224

Well, maybe you could allow a Vec of slices to be returned? But also note in the example above that I proceed to create sub binaries from the resource binary, and even if you changed this function to create multiple binaries, it would require me doing an additional pass on the data.

In any case, I think this is still solid for a convenient, safer API. :) Even though I don't think my opinion is worth much here.

hansihe · 2022-09-11T19:08:38Z

Couldn't you just call make_binary first, returning array.values().as_slice(), then call make_subbinary on that inside your iter.rfold? Aside from some bounds checks, I think that should be equivalent to what your code does today

josevalim · 2022-09-11T19:15:47Z

@hansihe that's a chunked array, so array.values() returns a vector of resource binaries. I can't convert array.values().as_slice() to a resource binary.

rustler/src/resource.rs

hansihe · 2022-09-11T19:17:31Z

This API is a really good idea by the way @filmor!

hansihe · 2022-09-11T19:26:43Z

@josevalim Ahh, I see what you mean

filmor · 2022-09-11T20:38:29Z

How about an unsafe variant of the same function without the lifetime bound? That way you at least don't have to fiddle with so many pointers.

hansihe · 2022-09-11T20:53:42Z

I think adding a fn make_binary_unsafe<'env, 'a>(&self, env: Env<'env>, data: &'a [u8]) -> Binary might be better, if we don't enforce the lifetime, I don't think there would be any need to have it go through a callback?

We need to make the documentation for that function very scary.

josevalim · 2022-09-11T22:17:58Z

@filmor that would work but I may send a PR for Term::make_sub_binary so we skip the binary overhead in the tight loop, if that's ok?

filmor · 2022-09-12T05:52:28Z

I think adding a fn make_binary_unsafe<'env, 'a>(&self, env: Env<'env>, data: &'a [u8]) -> Binary might be better, if we don't enforce the lifetime, I don't think there would be any need to have it go through a callback?

Having the same signature helps to clarify the difference, IMO. It also makes it clear that the binary's lifetime should be connected to the referenced object.

@filmor that would work but I may send a PR for Term::make_sub_binary so we skip the binary overhead in the tight loop, if that's ok?

What overhead are you referring to? Creating a Binary from a Term in an unsafe way (like we are doing here) should have no measurable overhead, it's just an additional stack operation. And it clarifies for make_sub_binary at compile-time that we know that the referenced term is a binary. The only additional overhead I could think of is from the bounds-checking in Binary::make_subbinary, but that could also be lifted with an unsafe variant.

josevalim · 2022-09-12T06:21:38Z

Let’s merge this with the unsafe version and I can give it a try and benchmark. But last I checked it was measurable. :)

josevalim · 2022-09-12T07:11:25Z

Let’s merge this with the unsafe version and I can give it a try and benchmark. But last I checked it was measurable. :)

I was actually able to reproduce the benchmark with the current code. Here is the current code:

https://github.com/elixir-nx/explorer/blob/ab0ff323c2bd0d5bae69424dabac83268a50c2f6/native/explorer/src/encoding.rs#L235-L249

Here is the patched version:

        iter.rfold(acc, |acc, uncast_offset| {
            let offset = *uncast_offset as NIF_TERM;

            let term = if validity_iter.next_back().unwrap_or(true) {
                unsafe {
                    rustler_sys::enif_make_sub_binary(
                        env_as_c_arg,
                        binary_as_c_arg,
                        offset,
                        last_offset - offset,
                    )
                }
            } else {
                nil_atom.as_c_arg()
            };

            last_offset = offset;
            unsafe { list::make_list_cell(env_as_c_arg, term, acc) }
        })

Here are the benchmark numbers:

Name               ips        average  deviation         median         99th %
new            15.02 K       66.57 μs    ±69.59%       67.08 μs      109.05 μs
old             6.41 K      155.93 μs    ±42.30%      152.58 μs      227.04 μs

I have also removed the bound checks, to see if they are the root cause, and no meaningful difference in the numbers:

Name               ips        average  deviation         median         99th %
nobound         6.38 K      156.62 μs    ±38.62%      152.19 μs      289.57 μs

And for completeness, here is the benchmark:

alias Explorer.Series, as: S

strings =
  for _ <- 1..1000 do
    :crypto.strong_rand_bytes(8) |> Base.encode64()
  end

series = S.from_list(strings)

Benchee.run(
  %{"new" => fn -> S.to_list(series) end},
  time: 5
)

In a nutshell, each call to to_list will create 1k binaries. In the benchmark, we are creating 6M-15M subbinaries per second, and that's why we are seeing those small changes, such as Term and Binary wrapping, show up. Which is why I have been asking for more straight rustler::wrapper functions. :) That would make a difference for us.

filmor · 2022-09-12T07:26:51Z

Well, we have this pending PR to introduce inlining, maybe that does it already?

josevalim · 2022-09-12T07:33:29Z

@filmor I have just tried the PR, it actually made things slower?

Name               ips        average  deviation         median         99th %
inline          5.56 K      179.91 μs   ±163.61%      152.54 μs      857.17 μs

My understanding from other languages is that sometimes inlining can cause negative impact, by making the code larger so it doesn't fit the cache or by making code reuse harder (and then purged more frequently). I am not sure how much this applies to Rust.

hansihe · 2022-09-12T11:27:01Z

I looked into this a little bit, and I think the main difference in performance between using enif_make_sub_binary directly vs the make_subbinary function is due to make_subbinary also calling enif_inspect_binary.

You can see that Binary::make_subbinary calls Binary::from_term, which again calls enif_inspect_binary to get the ErlNifBinary data from it.

As far as I can see, we only really use the ErlNifBinary to get the data pointer and length. If we instead stored those directly in the Binary struct, we could avoid calling enif_inspect_binary when making subbinaries.

josevalim · 2022-09-12T11:33:43Z

@hansihe nice find! We could apply the same idea to this PR as well. There is no reason to inspect it if we have the term, pointer and size upfront.

hansihe · 2022-09-12T11:47:45Z

Made a PR for the change

#488

josevalim · 2022-09-12T12:40:20Z

@hansihe awesome! It brings the benchmark up to:

Name               ips        average  deviation         median         99th %
new branch     10.05 K       99.54 μs    ±47.18%         100 μs      270.91 μs

filmor · 2022-09-12T12:51:01Z

@hansihe nice find! We could apply the same idea to this PR as well. There is no reason to inspect it if we have the term, pointer and size upfront.

True, will do that later today. We just need to add an additional unsafe private constructor to Binary for the (very specific :)) case of having term, pointer and size available.

hansihe · 2022-09-12T13:40:13Z

I think the remaining overhead might be code bloat from panic handling for when unwrap is called on the result from make_subbinary.

Introducing and using a make_subbinary_unchecked seems to result in pretty much the exact same assembly code output for that function.

rustler/src/resource.rs

hansihe

Looks good to me in general!

Would want some additional info in the docs for the unsafe variant

rustler/src/resource.rs

filmor · 2022-09-13T16:29:24Z

@josevalim Can you check with your benchmark whether this is now acceptable without using the direct enif_ functions? :)

hansihe · 2022-09-13T17:02:51Z

I think we might still need to have a make_subbinary_unchecked to match the performance. I think the current state should be roughly equivalent of the new_branch benchmark posted earlier?

josevalim · 2022-09-13T17:07:44Z

@hansihe correct. I am on 10k with this branch, the code looks super clean though. :)

filmor · 2022-09-14T10:17:23Z

@josevalim I added the unchecked variant and extended the comments a bit, will that do now? :)

filmor · 2022-09-14T10:26:15Z

Looking at your code from the branch again, you can make it even more safe by using the direct term methods, so instead of list::make_list_cell you could keep everything in a term and do acc.list_prepend(term). That should allow you to ditch pretty much all unsafe code.

rustler/src/types/binary.rs

josevalim · 2022-09-14T10:46:53Z

With the new check:

                unsafe {
                    binary.make_subbinary_unchecked(offset, last_offset - offset)
                }.to_term(env).as_c_arg()

I get this:

sub_unchecked       11.65 K       85.83 μs    ±82.26%       80.71 μs      200.61 μs

It is an improvement but more than 20% slower than the previous 15k. I will convert this to rustler_sys just to be double sure that the difference is exclusively in this function.

Using Term.list_prepend on top of the above, I got this:

list_prepend       10.23 K       97.72 μs   ±124.10%       89.88 μs      304.94 μs

However, I just realized I have been benchmarking in MIX_ENV=dev and I assume MIX_ENV=prod will give different results. So give me a sec. :)

josevalim · 2022-09-14T10:56:32Z

Yeah, I can actually measure those in MIX_ENV=prod too:

Name                ips        average  deviation         median         99th %
unsafe_list     37.44 K       26.71 μs   ±172.22%       22.83 μs      101.17 μs
term_list       23.86 K       41.92 μs   ±123.37%       40.71 μs      105.82 μs

josevalim · 2022-09-14T11:10:50Z

Yes, after more benchmarks the wrapper/rustler_sys version is still ~30% faster. So I would say having something like pub use rustler_sys::enif_make_sub_binary as make_subbinary; in the wrapper/binary.rs module makes sense for this paticular use case. WDYT?

I am not sure if it makes sense to keep the unchecked version though if we add the wrapper. :)

Thanks btw for all the back and forth in this, I am learning a lot!

hansihe · 2022-09-15T13:33:03Z

I am definitely for exporting enif_make_sub_binary for when that is needed.

It would be nice however if we could try to iron out the remaining potential performance issues and see if we can get the higher level APIs on par with the enif_ functions. I could also probably try to get the benchmarks set up myself so we don't have to go back and forth on this here?

The version of rustler you are using doesn't have the #[inline] attribute on the Term.list_prepend function right? If that is the case then this might be what slows things down.

In Rust, functions not marked with #[inline] will never be inlined across crates (with the exception of LTO). That is why the inline attribute on small functions like this in a hot loop could potentially have a big impact on performance.

josevalim · 2022-09-15T14:33:31Z

This is the branch/PR: elixir-explorer/explorer#346 - notice I am pointing to a local Rustler for now.

Here is the benchmark:

alias Explorer.Series, as: S

strings =
  for _ <- 1..1000 do
    :crypto.strong_rand_bytes(8) |> Base.encode64()
  end

series = S.from_list(strings)

Benchee.run(
  %{
    "flat_map" => fn -> S.to_list(series) end
  },
  time: 5

Run it with mix run foo.exs. I will try inlining.

josevalim · 2022-09-15T14:44:39Z

I have added inline to Term.as_c_arg, Binary.to_term, and Binary.make_subbinary_unchecked and saw no difference. :(

hansihe · 2022-09-18T10:25:13Z

OK, I think we should do the following:

Expose the enif_ functions you need directly for now. This makes it possible for you to make progress on your end @josevalim.
Merge this PR for others to use.
Open a separate performance tracking issue that links to this discussion so that this does not get lost. I would like to try to iron out the performance issues from these APIs where possible.

Does this make sense to you @josevalim @filmor?

josevalim · 2022-09-18T10:43:01Z

Beautiful, thank you! I will send a PR for the missing bit!

josevalim reviewed Sep 11, 2022

View reviewed changes

rustler/src/resource.rs Outdated Show resolved Hide resolved

hansihe reviewed Sep 11, 2022

View reviewed changes

rustler/src/resource.rs Outdated Show resolved Hide resolved

filmor force-pushed the make-resource-binary branch from e1c9add to fc9a709 Compare September 13, 2022 05:44

filmor mentioned this pull request Sep 13, 2022

do not store all of ErlNifTerm in Binary struct #488

Closed

hansihe and others added 4 commits September 13, 2022 08:44

do not store all of ErlNifTerm in Binary struct

54df948

Implement ResourceArc::make_binary

b2daf69

Add unsafe variant

86673df

Use unsafe Binary constructor and add comments

4c92d8f

filmor force-pushed the make-resource-binary branch from fc9a709 to 4c92d8f Compare September 13, 2022 06:44

josevalim reviewed Sep 13, 2022

View reviewed changes

rustler/src/resource.rs Show resolved Hide resolved

filmor marked this pull request as ready for review September 13, 2022 08:44

hansihe reviewed Sep 13, 2022

View reviewed changes

rustler/src/resource.rs Show resolved Hide resolved

filmor added 2 commits September 14, 2022 12:13

Add Safety section to make_binary_unsafe comment

71368d2

Add unsafe make_subbinary_unchecked

6663cb1

josevalim reviewed Sep 14, 2022

View reviewed changes

rustler/src/types/binary.rs Show resolved Hide resolved

hansihe approved these changes Sep 18, 2022

View reviewed changes

filmor merged commit d89a3d1 into rusterlium:master Sep 18, 2022

filmor deleted the make-resource-binary branch September 18, 2022 10:41

This was referenced Sep 18, 2022

Expose wrappers around sub and resource binaries #485

Closed

Support enif_make_resource_binary #452

Closed

philss mentioned this pull request Apr 17, 2023

Add experimental IO writer that talks to a process elixir-explorer/explorer#567

Closed

Implement ResourceArc::make_binary #487

Implement ResourceArc::make_binary #487

Conversation

filmor commented Sep 11, 2022

filmor commented Sep 11, 2022

josevalim commented Sep 11, 2022

josevalim commented Sep 11, 2022

hansihe commented Sep 11, 2022

josevalim commented Sep 11, 2022

hansihe commented Sep 11, 2022

hansihe commented Sep 11, 2022

filmor commented Sep 11, 2022

hansihe commented Sep 11, 2022

josevalim commented Sep 11, 2022

filmor commented Sep 12, 2022

josevalim commented Sep 12, 2022

josevalim commented Sep 12, 2022 • edited Loading

filmor commented Sep 12, 2022

josevalim commented Sep 12, 2022

hansihe commented Sep 12, 2022 • edited Loading

josevalim commented Sep 12, 2022

hansihe commented Sep 12, 2022

josevalim commented Sep 12, 2022

filmor commented Sep 12, 2022

hansihe commented Sep 12, 2022

hansihe left a comment

Choose a reason for hiding this comment

filmor commented Sep 13, 2022

hansihe commented Sep 13, 2022

josevalim commented Sep 13, 2022

filmor commented Sep 14, 2022

filmor commented Sep 14, 2022

josevalim commented Sep 14, 2022

josevalim commented Sep 14, 2022

josevalim commented Sep 14, 2022

hansihe commented Sep 15, 2022 • edited Loading

josevalim commented Sep 15, 2022

josevalim commented Sep 15, 2022

hansihe commented Sep 18, 2022 • edited Loading

josevalim commented Sep 18, 2022

josevalim commented Sep 12, 2022 •

edited

Loading

hansihe commented Sep 12, 2022 •

edited

Loading

hansihe commented Sep 15, 2022 •

edited

Loading

hansihe commented Sep 18, 2022 •

edited

Loading