Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create sections encoding for multi-value regions #760

Merged
merged 5 commits into from
Feb 2, 2021
Merged

Conversation

webmaster128
Copy link
Member

Closes #602

TL:DR:

  • Before: value || key || keylen
  • After: section1 || section1_len || section2 || section2_len || section3 || section3_len || …

It turned out that splitting the memory of a Vec into multiple vectors is heavility unsafe and it is better to keep the decoding in two steps:

  1. consume Region into one vector using the exact raw components it was created with,
  2. split the resulting vector into multiple values.

This means we need to do copies. If we split off from right to left, we can keep the original vector as the first element without reallocation, saving 50% of copies for the two element case.

@webmaster128 webmaster128 added this to the 0.14.0 milestone Feb 2, 2021
@ethanfrey
Copy link
Member

Ah, so the length after value is so we can parse from left-to-right.
Makes sense but definitely needs to be documented (also with the method names).

Will look at the code now

Copy link
Member

@ethanfrey ethanfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Added some suggestions to polish

packages/std/src/imports.rs Show resolved Hide resolved
#[allow(dead_code)] // used in Wasm and tests only
pub fn decode_sections2(data: Vec<u8>) -> (Vec<u8>, Vec<u8>) {
let section2_len: usize = if data.len() >= 4 {
u32::from_be_bytes([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not encode with 2 bytes as now?
We expect/support > 64KB passed as a value for each section?

Although I guess it is a trivial cost to support this and allows us to never worry about size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not make unnecessary assumptions on data length.

data[data.len() - 1],
]) as usize
} else {
panic!("Cannot read section2 length");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pull this into a helper function (it is used twice)?
The less code that needs to change to go from decode_sections2 -> decode_sections3 -> decode_sectionsN the better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not return error instead of panic?
Not that anyone would handle them, but panics don't provide any useful message to the caller, errors help debugging if something did go wrong in the vm side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Rust's iterator interface does not support errors in next()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Good point

}

let mut first = data;
let mut second = first.split_off(section1_len_end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of an lower-level api that would let one easily create decodeN wrappers.

split_tail(data: Vec<u8>) -> StdResult<(Vec<u8>, Vec<u8>)> which would parse out the last section, allocate memory and truncate the input return (head, tail).

one_element(data: Vec<u8>) -> StdResult<Vec<u8>> after splitting off all the N args, this is called for the last (first) item. It parses out the length bytes, ensures they are correct, and then returns the truncated input without a realloc.

We would end up with something like:

pub fn decode_sections2(mut data: Vec<u8>) -> StdResult<(Vec<u8>, Vec<u8>)> {
  let (head, second) = split_tail(data)?;
  let first = one_element(head)?;
  Ok((first, second))
}

pub fn decode_sections3(mut data: Vec<u8>) -> StdResult<(Vec<u8>, Vec<u8>)> {
  let (head, third) = split_tail(data)?;
  let (head, second) = split_tail(head)?;
  let first = one_element(head)?;
  Ok((first, second, third))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make sense if we intend 3, 4, 5 variants and avoids any bugs in writing those. Not sure if we will ever need those however

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool. Not because we need it not but because it explains the algorithm. Implemented. The caller code got even nicer:

pub fn decode_sections2(data: Vec<u8>) -> (Vec<u8>, Vec<u8>) {
    let (rest, second) = split_tail(data);
    let (_, first) = split_tail(rest);
    (first, second)
}

Copy link
Member

@ethanfrey ethanfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice updates. Looking good

]) as usize
let (rest, mut tail) = if rest_len_end == 0 {
// i.e. all data is the tail
(Vec::new(), data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this is 0 alloc, right? so you don't even need to handle the head case differently and still just as efficient. nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jupp. From Vec::new:

The vector will not allocate until elements are pushed onto it.

More general, a vector only allocates heap when capacity > 0.

@webmaster128 webmaster128 merged commit 8a60668 into main Feb 2, 2021
@webmaster128 webmaster128 deleted the multi-return branch February 2, 2021 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More generic encoding for complex return value
2 participants