Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allows ZendStr to contain null bytes #202

Merged
merged 4 commits into from
Dec 9, 2022

Conversation

ju1ius
Copy link
Contributor

@ju1ius ju1ius commented Nov 26, 2022

Closes #200

Rationale

In PHP zend_strings are binary strings with no encoding information. They can contain any byte at any position.
The current implementation use CString to transfer zend_strings between Rust and PHP, which prevents zend_strings containing null-bytes to roundtrip through the ffi layer. Moreover, ZendStr::new() accepts only a &str, which is incorrect since a zend_string is not required to be valid UTF8.

When reading the PHP source code, it is apparent that most functions marked with ZEND_API that accept a const *char are convenience wrappers that convert the const *char to a zend_string and delegate to another function. For example zend_throw_exception() takes a const *char message, and just converts it to a zend_string before delegating to zend_throw_exception_zstr().

I kept this PR focused around ZendStr and it's usages in the library, but it should be seen as the first step of a more global effort to remove usages of CString everywhere possible.

Also, I didn't change the return type of the string related methods of Zval (e.g. I could have made Zval::set_string()
accept an impl AsRef<[u8]> instead of &str and return () instead of Result<()>). If I get feedback that it should be done in this PR, I'll do it.

Summary of the changes:

ZendStr

  • [BC break]: ZendStr::new() and ZendStr::new_interned() now accept an impl AsRef<[u8]> instead of just &str, and are therefore infaillible (outside of the cases where we panic, e.g. when allocation fails). This is a BC break, but it's impact shouldn't be huge (users will most likely just have to remove a bunch of ? or add a few Ok()).
  • [BC break]: Conversely, ZendStr::as_c_str() now returns a Result<&CStr> since it can fail on strings containing null bytes.
  • [BC break]: ZensStr::as_str() now returns a Result<&str> instead of an Option<&str> since we have to return an error in case of invalid UTF8.
  • adds method ZendStr::as_bytes() to return the underlying byte slice.
  • adds convenience methods ZendStr::as_ptr() and ZendStr::as_mut_ptr() to return raw pointers to the zend_string.

ZendStr conversion traits

  • adds impl AsRef<[u8]> for ZendStr
  • [BC break]: replaces impl TryFrom<String> for ZBox<ZendStr> by impl From<String> for ZBox<ZendStr>.
  • [BC break]: replaces impl TryFrom<&str> for ZBox<ZendStr> by impl From<&str> for ZBox<ZendStr>.
  • [BC break]: replaces impl From<&ZendStr> for &CStr by impl TryFrom<&ZendStr> for &CStr.

Error

  • adds new enum member Error::InvalidUtf8 used when converting a ZendStr to String or &str

@ju1ius
Copy link
Contributor Author

ju1ius commented Nov 26, 2022

A few off-topic remarks:

  1. since persistent zend_strings are almost never what you want, and there is a ZendStr::new_interned constructor, shouldn't we add a ZendStr::new_persistent() constructor and remove the persistent parameter in ZendStr::new() ?
  2. we don't currently tackle the special cases: when the length of the string is either 0 or 1 like in zend_string_init_fast()
  3. I've seen a few well-known frameworks reserving a memory buffer for their error handlers (like so). I guess the goal is for them to be able to run the handlers even when the allowed memory limit is reached. I don't know if this applies for extensions, but maybe panicking on failed allocations is not good citizenship?

WDYT?

@ju1ius
Copy link
Contributor Author

ju1ius commented Nov 26, 2022

@ptondereau I overlooked some of the impls, so this still need a bit of work.

@ju1ius ju1ius marked this pull request as draft November 26, 2022 13:19
@ptondereau ptondereau self-requested a review November 26, 2022 13:22
@ptondereau
Copy link
Collaborator

  1. since persistent zend_strings are almost never what you want, and there is a ZendStr::new_interned constructor, shouldn't we add a ZendStr::new_persistent() constructor and remove the persistent parameter in ZendStr::new() ?

IMHO, this is a code smell to bring a builder, but I agree that we can have a specific constructor for a not so used context.

  1. we don't currently tackle the special cases: when the length of the string is either 0 or 1 like in zend_string_init_fast()

It should be done then.

  1. I've seen a few well-known frameworks reserving a memory buffer for their error handlers (like so). I guess the goal is for them to be able to run the handlers even when the allowed memory limit is reached. I don't know if this applies for extensions, but maybe panicking on failed allocations is not good citizenship?

I guess PHP will crash before Rust have the allocation state and this leads to an unresolved error (that I need to create issue) that leak memory of Rust's allocated stuff. I would go for a wrapping with zend_try+zend_catch in our C wrapper, but maybe this isn't the best solution.

@ju1ius ju1ius force-pushed the ju1ius/zend-str branch 2 times, most recently from 81928ad to f279440 Compare November 26, 2022 14:40
@ju1ius ju1ius marked this pull request as ready for review November 26, 2022 14:56
@ju1ius ju1ius marked this pull request as draft November 26, 2022 15:58
@ju1ius ju1ius marked this pull request as ready for review November 26, 2022 16:12
pub fn as_str(&self) -> Option<&str> {
self.as_c_str().to_str().ok()
pub fn as_str(&self) -> Result<&str> {
std::str::from_utf8(self.as_bytes()).map_err(|_| Error::InvalidUtf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary that we might have a Schlemiel the Painter problem here.
If I remember correctly, there's a flag somewhere at the C-level to indicate that the string has already been utf8-validated (i.e. by the mbstring or pcre extensions), need to investigate...

Copy link
Contributor Author

@ju1ius ju1ius Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's here and also here.
So we could do something like:

if GC_FLAGS(self) & IS_STR_VALID_UTF8 {
   return std::str::from_utf8_unchecked(...);
}
let s = std::str::from_utf8(self.as_bytes()).map_err(|_| Error::InvalidUtf8)?;
GC_ADD_FLAG(self, IS_STR_VALID_UTF8):
return s;

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires adding stuff to the ffi bindings though (the gc flags related stuff in zend_types.h).
I'm going to leave this off for now and do it in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dors this require mbstring to be installed

Copy link
Contributor Author

@ju1ius ju1ius Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dors this require mbstring to be installed

No, the flag is used by mbstring and PCRE but it is defined in zend_types.h.

See this PR for historical context. 😉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, I think this is a requirement here because It can lead to the same perf regression as in the PHP issue.

Copy link
Contributor Author

@ju1ius ju1ius Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so too! I proposed to do this in another PR to keep this one focused (and maximize the chances to get it merged quickly), but if you think it should be done here I'm OK with that.
Do you confirm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was my intention. Sorry and thank you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shall be done then! 😉

@ju1ius ju1ius force-pushed the ju1ius/zend-str branch 2 times, most recently from 2d3cf78 to 19c67be Compare November 27, 2022 05:56
@ju1ius
Copy link
Contributor Author

ju1ius commented Nov 28, 2022

IMHO, this is a code smell to bring a builder, but I agree that we can have a specific constructor for a not so used context.

The idea wasn't to bring a builder, just to remove the persistent parameter and introduce a new constructor function:

impl ZendStr {
  pub fn new(value: impl AsRef<[u8]>) -> Self {}
  pub fn new_persistent(value: impl AsRef<[u8]>) -> Self {}
  pub fn new_interned(value: impl AsRef<[u8]>) -> Self {}
}

This is more a DX improvement. For example this would allow removing the persistent parameter from the conversion traits (FromZval et al.). Since it is only ever use for strings and even then rarely so, I find it cumbersome to have to specify it hen it is never needed (like true.into_zval(false)...).

But this is not needed for this PR, so let's keep things the way they are for now.

@ju1ius
Copy link
Contributor Author

ju1ius commented Nov 28, 2022

3. we don't currently tackle the special cases: when the length of the string is either 0 or 1 like in zend_string_init_fast()

It should be done then.

This requires adding symbols to the ffi bindings (see related constants in zend_string.h).
I'm going to leave this off for now and do it in a separate PR.

Copy link
Collaborator

@ptondereau ptondereau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ptondereau ptondereau changed the title allows ZendStr to contain null bytes feat: allows ZendStr to contain null bytes Dec 9, 2022
@ptondereau ptondereau merged commit d52a878 into davidcole1340:master Dec 9, 2022
@ju1ius ju1ius deleted the ju1ius/zend-str branch December 9, 2022 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZendStr should accept strings containing null bytes.
2 participants