-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double allocation when passing strings via IntoParam
#1713
Changes from 5 commits
63255d4
679a5af
db39bae
8f3dbe1
6efc6e7
4b32a59
3cf2c92
ca3f255
e3bca2b
0ad384c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,7 +45,7 @@ unsafe impl Abi for PCWSTR { | |
#[cfg(feature = "alloc")] | ||
impl<'a> IntoParam<'a, PCWSTR> for &str { | ||
fn into_param(self) -> Param<'a, PCWSTR> { | ||
Param::Boxed(PCWSTR(heap_string(&self.encode_utf16().collect::<alloc::vec::Vec<u16>>()))) | ||
Param::Boxed(PCWSTR(unsafe { string_from_iter(self.encode_utf16(), self.len()) })) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if you just embrace the fact that you know string_from_iter(self.encode_utf16(), self.encode_utf16().count()) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would need to iterate over it twice, which provides some memory savings at the cost of run-time computation. The amount of memory we waste is 0% if the string consists of ASCII only. Given that most strings are ASCII or contain mostly ASCII, I believe it is a reasonable choice. When considering memory waste we have to look at the intention the In heavily resource constrained Windows environment such as Nano server, with a minimum requirement of 512 MB, I'd believe heap allocation is frowned upon as is case in most embedded environments. I don't think this convenience API is something that is intended for Windows Nano environments and even if, given small strings the waste is still comparatively low. Given the 2 GB memory requirement of Windows, the only way in which string allocation would be worry-some would be strings with length of 512 MB. Aside from the fact that it would be ludicrous to keep such huge strings in memory and not maintain/convert for yourself for performance reasons, these strings are also only allocated for the duration of the function call, they are immediately freed after. I'd say given the use case, only a very low amount of PCs having memory with less than 4 GB and that most strings aren't occupying some ludicrous size, I'd say that over-allocating here is a reasonable choice to make. The mostly practical application of supplementary code points or BMP code-points is mostly foreign languages that due to their amount of characters convey more entropy in their texts anyway and are hence smaller. That being said, allocation the optimal length is still a desirable goal. Additionally, reading from the memory when decoding the unicode code points will likely enter the processor cache, which should make the the second iteration faster. I'll rewrite something later that takes advantage of this. |
||
} | ||
} | ||
#[cfg(feature = "alloc")] | ||
|
@@ -58,7 +58,7 @@ impl<'a> IntoParam<'a, PCWSTR> for alloc::string::String { | |
impl<'a> IntoParam<'a, PCWSTR> for &::std::ffi::OsStr { | ||
fn into_param(self) -> Param<'a, PCWSTR> { | ||
use ::std::os::windows::ffi::OsStrExt; | ||
Param::Boxed(PCWSTR(heap_string(&self.encode_wide().collect::<alloc::vec::Vec<u16>>()))) | ||
Param::Boxed(PCWSTR(unsafe { string_from_iter(self.encode_wide(), self.len()) })) | ||
} | ||
} | ||
#[cfg(feature = "alloc")] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will
self.len() >= self.encode_utf16().count()
always hold true? Intuitively it seems like it should, a 3-byte UTF-8 sequence representing two code points could be represented by either twou16
if they combine, or threeu16
if they map 1:1. Worst case is that it over-allocates.Edit: Based on everything I've been able to find, the worst case is over-allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd been banging my head at this for a while now, too. It seems to hold true, so long as
encode_utf16()
doesn't do anything crazy, like decomposing code points.The rationale I came up with is this: Each Unicode code point can be encoded as no more than two UTF-16 code units. For the inequality to hold, we can thus ignore all UTF-8 sequences that consist of two or more code units. That leaves the set of code points that require a single UTF-8 code unit. Those are the 128 ASCII characters, which are all encoded using a single code unit in UTF-16 as well.
In summary:
While I couldn't find any formal documentation on what
encode_utf16()
does and doesn't do, the example listed underencode_utf16()
at least seems to confirm the reasoning above:assert!(utf16_len <= utf8_len);
.Still a bit uneasy about all of this. Personally, I'd prefer a
string_from_iter()
that takes an iterator only, wastes some cycles oncount()
(ultimately doing the de-/encoding twice), and drops theunsafe
along the way.Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will given three assumptions:
1 UTF-8 Character => 1 UTF-16 Character
2 UTF-8 Characters => 1 UTF-16 Character
3 UTF-8 Characters => 1 UTF-16 Character
4 UTF-8 Characters => 2 UTF-16 Characters
No (de)normalization takes place / no decomposing or composing character transformations (e.g. A + ¨: LATIN CAPITAL LETTER A (U+0041) + COMBINING DIAERESIS (U+0308)). There is no indication in the documentation of the Rust standard library that any of such normal or transformations takes place and looking at the source code only validates that assumption.
Input is valid UTF-8. We might have to look into edge-cases around OsStr, since it consists of WTF-8 strings.Even WTF-8 strings uphold this invariant, because unpaired surrogate 16-bit code units would still shrink to 1 UTF-16 Character (2 UTF-16 characters for supplementary code points, which requires 4 UTF-8 characters).I strongly believe that it should not be the case, but if there is concern we could also safeguard it either by truncating the maximum number of UTF-16 characters or panicing if that were to occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that an iterator can be iterated twice?
I doubt there's any de/normalization taking place in
encode_utf16()
I'd be willing to bet dozens of dollars on it! Survey says... There's no denormalization, just bit twiddling as expected: https://doc.rust-lang.org/stable/src/core/char/methods.rs.html#1706Also, the Rust documentation for
encode_utf16()
seems to indicate that the utf-16u16
count is strictly less than the utf-8 byte count:https://doc.rust-lang.org/stable/src/core/str/mod.rs.html#998
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. Even though it's just an example I'd say it was intended as a guarantee.
The way I see it there are three ways in which we can resolve this:
To be honest I'm not very happy about this function to begin with. I designed it the way I did in order to remain "backwards compatible" to
heap_string
. We are currently generic over T with an iterator we can only iterate once.Why do we need to be generic over T? The only strings that the Windows API uses are UTF-8 and UTF-16 and some day there might be a use case for UTF-32. These have an alignment of 1, 2 and 4 respectively. As we are generic currently, we assert that the alignment of the output buffer matches the alignment of T. However, given that
HeapAlloc
has an alignment of 8+ and that we only ever use 4 as maximum alignment, that check is fairly redundant.As it stands, this API has the following inputs:
&OsStr
and&str
. And the following outputs:*const u8
and*const u16
. We could create two functions instead:heap_string_narrow
,heap_string_wide
for*const u8
and*const u16
, respectively. This would allow us to iterate over the string any number of times and ensure they are within bounds.As I was writing this I actually have an even better idea, hold on let me implement it.