Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CStr::from_bytes_with_nul #30614

Merged
merged 4 commits into from
Feb 23, 2016
Merged

CStr::from_bytes_with_nul #30614

merged 4 commits into from
Feb 23, 2016

Conversation

arcnmx
Copy link
Contributor

@arcnmx arcnmx commented Dec 29, 2015

I'm a bit iffy on the return type, but Result would also be a bit weird... The two error cases are Unterminated and InteriorNul(usize).

I considered from_chars(&[c_char]) but this meshes better with as_bytes_with_nul() and Rust code in general.

Should there also be a corresponding unsafe from_bytes_unchecked variant?

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@arcnmx arcnmx force-pushed the cstr-bytes branch 2 times, most recently from 641ac9e to c5d9806 Compare December 29, 2015 19:06
@brson
Copy link
Contributor

brson commented Dec 29, 2015

cc @rust-lang/libs

@nagisa
Copy link
Member

nagisa commented Dec 29, 2015

See rust-lang/rfcs#1264

@arcnmx
Copy link
Contributor Author

arcnmx commented Dec 29, 2015

@nagisa aha! I do feel like an explicit error type is probably better.

A Cow conversion would actually be quite nice too, but that one maybe belongs outside the standard library. from_bytes -> Result<Cow<CStr>, usize> seems like a little too much magic to me personally, but I wouldn't be entirely opposed to it. An additional method could work too, but I'd have no idea what to call it... from_bytes_cow is a terrible name.

EDIT: On second thought, I can't imagine a case when you wouldn't know beforehand whether your byte slice is terminated or not, and we already have CString::new for that.

@arcnmx arcnmx force-pushed the cstr-bytes branch 4 times, most recently from c9c189e to 9c46d43 Compare December 31, 2015 22:52
@alexcrichton
Copy link
Member

cc #20475, another issue that I believe this would handle.

I'm gonna tag this with T-libs so it comes up during normal triage. I'm not so sure about this API personally, could you elaborate on some of the use cases you have in mind?

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jan 11, 2016
@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 11, 2016

Well, I'd break it into...

  1. It's currently "impossible" to create a &CStr from a byte slice. You can use a transmute but that makes assumptions about the memory layout of the CStr type, which may change in the future to a thin pointer. So from_bytes_unchecked is pretty important to have as a basic building block.
  2. from_bytes is the safe checked version of the above. It's the borrowed version of CString::new, which I find to be useful.
  3. It's essentially a lifetime-safe version of from_ptr, so similar usecases apply. Mapping data inside another slice to a &CStr (for example when zero-copy parsing some binary formats) is a concrete one that's come up for me.
  4. It feels nicer than CStr::from_ptr(b"hello\0".as_ptr() as *const _) for literals. Making it const unsafe would be really nice too, though I'm not sure if that's possible to implement in today's rust?

@alexcrichton
Copy link
Member

It's currently "impossible" to create a &CStr from a byte slice

This is actually somewhat intentional, because the ideal representation of &CStr is just a pointer. The CStr type itself should continue to be unsized, but it's just an implementation detail that the only way to do that today is via [u8]. The intention was that CStr doesn't keep track of the length, and hence it architecturally is indeed impossible to go from &[u8] to &CStr in a "lossless fashion".


I do agree that literal construction is pretty difficult, but that's more along the lines of needs-to-be-a-compiler-plugin than anything else I think to ensure that the nul byte is always at the end. In theory the from_ptr function is a noop (and hence a const candidate), but due to the reasons mentioned above it needs to calculate the length and hence isn't a const candidate.

@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 12, 2016

This is actually somewhat intentional, because the ideal representation of &CStr is just a pointer. The CStr type itself should continue to be unsized, but it's just an implementation detail that the only way to do that today is via [u8].

Whether the &CStr type encodes the length as part of its type or calculates it when necessary is an implementation detail though! That's exactly why it helps to have that method as a forward-compatible way of converting, since an unsafe transmute won't necessarily work in future Rust versions.

The intention was that CStr doesn't keep track of the length, and hence it architecturally is indeed impossible to go from &[u8] to &CStr in a "lossless fashion".

My view would be that as long as you can reconstruct the &[u8] from the &CStr (even if a length calculation is involved), the conversion can be considered to be lossless and the types can be considered semantically equivalent (plus the extra invariants guaranteed by the &CStr type).

I do agree that literal construction is pretty difficult, but that's more along the lines of needs-to-be-a-compiler-plugin than anything else I think to ensure that the nul byte is always at the end.

Well, if we assume a rustc with a more useful const fn story, we could implement it as a macro like so:

macro_rules! cstr {
    ($s:expr) => {
        unsafe { ::std::ffi::CStr::from_bytes_unchecked(concat!($s, "\0").as_bytes()) }
    }
}

The same could be implemented via from_ptr if we can get rid of the length calculation. It's true that's a discussion not really that related to this PR.


Another thought is that we could make from_ptr not involve a length calculation today, and defer it to when it's needed (if someone calls .len(), .as_bytes() or similar). Simply use 0 length as a sentinel that means the length isn't known. Or -1 as usize. And if you feel super adventurous, update the length with a volatile store (on platforms where that's guaranteed to be atomic) once it's calculated upon a call of one of those methods to avoid the check in the future.


It's true that from_bytes_unchecked would be unnecessary from a performance perspective if from_ptr didn't involve a length calculation, as they'll potentially be equivalent in the future. I still feel like the lifetime tie-in is a nice advantage that makes this a useful API in any case. The safe version that checks for interior and trailing null is mainly for convenience.

Also, another case where this API is useful is when creating C strings on the stack in a [u8; MAXPATHLEN], both for the lifetime borrow checking metadata and length check bypass.

@alexcrichton
Copy link
Member

I think one part I'm worried about is that both of these assertions may fail

assert_eq!(CStr::from_bytes(slice).unwrap().to_bytes(), slice);
assert_eq!(CStr::from_bytes_unchecked(slice).to_bytes(), slice);

The first line will truncate slice to the first nul byte when calling to_bytes, as will the second. The second also runs the risk of memory unsafety (e.g. reading extra memory) if there's no nul byte.

Out of curiosity, what's the use case of creating a CStr from a &[u8] in the first place? In general Rust uses slices for passing data around, so the &CStr type should only rarely be used at FFI boundaries at most.

@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 12, 2016

I think one part I'm worried about is that both of these assertions may fail

Ah, do you mean the naming is inconsistent vs .to_bytes_with_nul(), since those assertions will pretty much always fail?

The second also runs the risk of memory unsafety (e.g. reading extra memory) if there's no nul byte.

It's an unsafe method that runs the risk of memory unsafety if you provide it data that doesn't uphold the invariants, that's pretty much the point.

Out of curiosity, what's the use case of creating a CStr from a &[u8] in the first place? In general Rust uses slices for passing data around, so the &CStr type should only rarely be used at FFI boundaries at most.

Mm, FFI is the main place to be using &CStrs. One example out there right now is in nix.

@alexcrichton
Copy link
Member

Oh I'm not worried about naming inconsistency, it just seems that if you construct a CStr from a slice, when you convert it back I would expect the same slice to come back. In this implementation, however, that's not always the case.

@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 15, 2016

Well, if we break it apart...

assert_eq!(CStr::from_bytes(slice).unwrap().to_bytes_with_nul(), slice);

In the current implementation, this will always succeed, if slice has a trailing null byte and no interior bytes. If neither of those are true, the from_bytes/unwrap will fail instead. I'm partial to changing this to only check for a trailing null for performance reasons, which would make it act like you suggest, potentially returning a shorter slice than was originally provided. I'd consider that "defined undefined behaviour", but if you really care about performance that's what the unchecked version is for anyway.

assert_eq!(CStr::from_bytes_unchecked(slice).to_bytes(), slice);

Again, this is unsafe so you have to assert you're maintaining the invariants, or you will indeed get a different slice back, read past allocated memory, etc.

@alexcrichton
Copy link
Member

Ah yes that's correct, I think I misread the implementation of from_bytes as truncating if a zero were found sooner.

Ok, thinking about this some more, I think my confused stemmed from perhaps misconstruing this PR as sort of reinterpreting a CStr as a "general purpose container of bytes." To me &[u8] is indeed this sort of container, but CStr is rather a general purpose container of non-0 bytes. It seems like the motivation here is a safe API that can convert to and from these containers, which seems reasonable to me.

Along those lines, however, I wonder if perhaps a better name would be from_bytes_with_nul instead of just from_bytes. That would clearly match up with to_bytes_with_nul, and those I believe are indeed the symmetric values here.

@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 19, 2016

CStr is rather a general purpose container of non-0 bytes

Basically this, and it still also applies if in the future the type is changed to not literally store its length alongside the pointer. Though I'd argue that CStr is a general purpose container of non-0 bytes that ends in a terminating 0 byte and that to_bytes is the wrong default; to_bytes()/to_inner_bytes() would've made more sense IMO, but it's too late to argue that now.

I wonder if perhaps a better name would be from_bytes_with_nul instead of just from_bytes. That would clearly match up with to_bytes_with_nul

Unfortunate naming, but I agree that it would be more consistent/symmetrical. I mean damn, from_bytes_with_nul_unchecked(data) is the same amount of typing as from_ptr(data.as_ptr() as *const _)... Unless from_bytes_unchecked can still sort of apply maybe...

@alexcrichton
Copy link
Member

Ah well, we'll discuss this at the next libs meeting regardless and I'm sure we'll have some feedback as well :)

@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the conclusion was that we're ok with these methods so long as they have the with_nul suffix. @arcnmx would you be ok implementing that?

@arcnmx
Copy link
Contributor Author

arcnmx commented Jan 25, 2016

@alexcrichton like this? ab136bf623e291a51067c70ec6c8a17cab0ac8e8

/// ```
#[unstable(feature = "cstr_from_bytes", reason = "recently added", issue = "0")]
pub fn from_bytes_with_nul<'a>(bytes: &'a [u8]) -> Option<&'a CStr> {
if bytes.is_empty() || memchr::memchr(0, &bytes) != Some(bytes.len() - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't bytes.is_empty() redundant?

EDIT: I see, it's needed to avoid an underflow in bytes.len() - 1

@alexcrichton
Copy link
Member

Thanks @arcnmx! Could you also hook this up to #31190 (the tracking issue) and add a few tests as well?

/// Creates a C string wrapper from a byte slice.
///
/// This function will cast the provided `bytes` to a `CStr` wrapper after
/// ensuring that it is null terminated but does not contain any interior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and does not contain"?

/// # }
/// ```
#[unstable(feature = "cstr_from_bytes", reason = "recently added", issue = "0")]
pub fn from_bytes_with_nul<'a>(bytes: &'a [u8]) -> Option<&'a CStr> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does lifetime elision not work here?

@alexcrichton alexcrichton changed the title CStr::from_bytes CStr::from_bytes_with_nul Jan 27, 2016
@alexcrichton alexcrichton removed the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jan 27, 2016
@arcnmx
Copy link
Contributor Author

arcnmx commented Feb 23, 2016

Sorry that took so long...

@alexcrichton
Copy link
Member

@bors: r+ 71f29cd

@bors
Copy link
Contributor

bors commented Feb 23, 2016

⌛ Testing commit 71f29cd with merge 281f9d8...

bors added a commit that referenced this pull request Feb 23, 2016
I'm a bit iffy on the return type, but `Result` would also be a bit weird... The two error cases are `Unterminated` and `InteriorNul(usize)`.

I considered `from_chars(&[c_char])` but this meshes better with `as_bytes_with_nul()` and Rust code in general.

Should there also be a corresponding unsafe `from_bytes_unchecked` variant?
@bors bors merged commit 71f29cd into rust-lang:master Feb 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants