-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double allocation when passing strings via IntoParam
#1713
Changes from 6 commits
63255d4
679a5af
db39bae
8f3dbe1
6efc6e7
4b32a59
3cf2c92
ca3f255
e3bca2b
0ad384c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,7 +45,7 @@ unsafe impl Abi for PCWSTR { | |
#[cfg(feature = "alloc")] | ||
impl<'a> IntoParam<'a, PCWSTR> for &str { | ||
fn into_param(self) -> Param<'a, PCWSTR> { | ||
Param::Boxed(PCWSTR(heap_string(&self.encode_utf16().collect::<alloc::vec::Vec<u16>>()))) | ||
Param::Boxed(PCWSTR(unsafe { string_from_iter(self.encode_utf16(), self.len()) })) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will Edit: Based on everything I've been able to find, the worst case is over-allocation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd been banging my head at this for a while now, too. It seems to hold true, so long as The rationale I came up with is this: Each Unicode code point can be encoded as no more than two UTF-16 code units. For the inequality to hold, we can thus ignore all UTF-8 sequences that consist of two or more code units. That leaves the set of code points that require a single UTF-8 code unit. Those are the 128 ASCII characters, which are all encoded using a single code unit in UTF-16 as well. In summary:
While I couldn't find any formal documentation on what Still a bit uneasy about all of this. Personally, I'd prefer a Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it will given three assumptions:
1 UTF-8 Character => 1 UTF-16 Character
I strongly believe that it should not be the case, but if there is concern we could also safeguard it either by truncating the maximum number of UTF-16 characters or panicing if that were to occur. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that an iterator can be iterated twice? I doubt there's any de/normalization taking place in Also, the Rust documentation for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Nice find. Even though it's just an example I'd say it was intended as a guarantee. The way I see it there are three ways in which we can resolve this:
To be honest I'm not very happy about this function to begin with. I designed it the way I did in order to remain "backwards compatible" to Why do we need to be generic over T? The only strings that the Windows API uses are UTF-8 and UTF-16 and some day there might be a use case for UTF-32. These have an alignment of 1, 2 and 4 respectively. As we are generic currently, we assert that the alignment of the output buffer matches the alignment of T. However, given that As it stands, this API has the following inputs: As I was writing this I actually have an even better idea, hold on let me implement it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if you just embrace the fact that you know string_from_iter(self.encode_utf16(), self.encode_utf16().count()) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would need to iterate over it twice, which provides some memory savings at the cost of run-time computation. The amount of memory we waste is 0% if the string consists of ASCII only. Given that most strings are ASCII or contain mostly ASCII, I believe it is a reasonable choice. When considering memory waste we have to look at the intention the In heavily resource constrained Windows environment such as Nano server, with a minimum requirement of 512 MB, I'd believe heap allocation is frowned upon as is case in most embedded environments. I don't think this convenience API is something that is intended for Windows Nano environments and even if, given small strings the waste is still comparatively low. Given the 2 GB memory requirement of Windows, the only way in which string allocation would be worry-some would be strings with length of 512 MB. Aside from the fact that it would be ludicrous to keep such huge strings in memory and not maintain/convert for yourself for performance reasons, these strings are also only allocated for the duration of the function call, they are immediately freed after. I'd say given the use case, only a very low amount of PCs having memory with less than 4 GB and that most strings aren't occupying some ludicrous size, I'd say that over-allocating here is a reasonable choice to make. The mostly practical application of supplementary code points or BMP code-points is mostly foreign languages that due to their amount of characters convey more entropy in their texts anyway and are hence smaller. That being said, allocation the optimal length is still a desirable goal. Additionally, reading from the memory when decoding the unicode code points will likely enter the processor cache, which should make the the second iteration faster. I'll rewrite something later that takes advantage of this. |
||
} | ||
} | ||
#[cfg(feature = "alloc")] | ||
|
@@ -58,7 +58,7 @@ impl<'a> IntoParam<'a, PCWSTR> for alloc::string::String { | |
impl<'a> IntoParam<'a, PCWSTR> for &::std::ffi::OsStr { | ||
fn into_param(self) -> Param<'a, PCWSTR> { | ||
use ::std::os::windows::ffi::OsStrExt; | ||
Param::Boxed(PCWSTR(heap_string(&self.encode_wide().collect::<alloc::vec::Vec<u16>>()))) | ||
Param::Boxed(PCWSTR(unsafe { string_from_iter(self.encode_wide(), self.len()) })) | ||
} | ||
} | ||
#[cfg(feature = "alloc")] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be fused because the chaining of
once()
is guaranteed to be fused?iter
might not be fused, but the implementation will never callnext()
on it again because the chain moved on to the next one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I didn't consider that.
Once
is indeed guaranteed to be fused, so that issue is solved!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asserting
here seems wrong... I'm not sure we want to panic if the encoder contains more thanlen
. If anything, adebug_assert!
might be appropriate (though I think silently ignoring is likely ok).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to avoid panics in general. I have customers for whom this is inappropriate. If it means we have to embed this inside
PCWSTR
andPCSTR
directly to avoid the issues with generality, then so be it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this panic:
string_from_iter("hello world".encode_utf16(), 5)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should never happen. My concern is that we're trying to harden a function that is only ever used internally so if there's a general concern about the safety of this function then we can either mark it unsafe or just get rid of the function entirely.