Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce unsafe usage #28

Merged
merged 2 commits into from
Jul 10, 2020
Merged

Reduce unsafe usage #28

merged 2 commits into from
Jul 10, 2020

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Jul 10, 2020

by storing Box<str> rather than &'static str for the owned values in the SymbolTable.

This is mostly a proof-of-concept (as it does not adjust the bstr interner or any documentation that is now innacurate), and not strictly necessary for soundness or any feature addition. Rather, it's primarily because you don't need to do the Box::leak/Box::from_raw dance, and this reduces room for error as Slice no longer requires a manual Drop implementation.

I'm collecting some comparisons between some of the top string interner crates in the vein of Amos's comparison of small strings, and was surprised that an interner that says it works via leaking strings actually doesn't leak anything and properly cleans up after itself.

(Also, you're the only interner of the ones I'm checking that support interning &'static str special without copying the string, so that's cool!)

@lopopolo lopopolo added A-security Area: Security vulnerabilities and unsoundness issues. I-str Interner: UTF-8 string SymbolTable. S-speculative Status: This is just an idea. labels Jul 10, 2020
@lopopolo
Copy link
Member

This is a neat trick. Let me work through how I think this works.

  • Slice::Leaked always stored an owned Box<str>
  • Slice::Leaked serialized that Box<str> to a &'static str with Box::leak to extend the lifetime for storing in the Vec.
  • dropping the owner slice requires a custom drop impl with unsafe that is a bit scary.

Instead:

  • store an owned Box<str> directly. This does not increase the size of Slice since &str are already fat pointers.
  • Extend the lifetime “locally” in Slice::as_static_slice.
  • Drop normally.

To summarize, size of interner stays the same and much less unsafe by only changing an internal representation. Sounds great.

Good catch that the docs don’t make it clear that leaking was an implementation detail and intaglio cleans up after itself. Thanks.

I’d like to take this change. How would you like to proceed @CAD97?

One change is like to make is to move the constructors for Slice to two associated functions on Slice and remove the alloc method to hide any other changes we might make to the internals inside of Slice

@CAD97
Copy link
Contributor Author

CAD97 commented Jul 10, 2020

Yes, you've got it right.

I've pushed a second commit to do the same improvment for the bytes interner. The implementation work is now ready to be merged, though I'd definitely suggest going through another pass to clean up documentation (internal and external) to make sure that it's not talking about leaking strings anymore. You've much more familiarity with the library to do the work than I.

My quick tour of the bytes module to apply this diff suggests you might be better off just not using BStr. I've had to use Box<[u8]> for the owned variant rather than Box<BStr>, and this caused no (direct) issues, as the Slice is used through as_slice() which returns &[u8] anyway. If you want to keep using bstr, then you can get Box<BStr> by a transmute, implement Debug manually to use &BStr, or wait for upstream to provide the necessary API to create Box<BStr>.

TL;DR: merge this PR, work from there.

@lopopolo
Copy link
Member

One quick note before I review:

The only reason BStr is included as a dependency in this crate is for its fmt::Debug implementation.

In Artichoke, while symbols are byte strings, they are conventionally UTF-8 and almost always real UTF-8 strings. Without bstr doing dbg!(symbol_table) is mostly unreadable.

Copy link
Member

@lopopolo lopopolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CAD97! This pointer wizardry is easier to understand than the wizardry it replaces.

I'm pleased that the changes are very localized in scope.

🎉 💎

@@ -818,8 +795,7 @@ impl<S> SymbolTable<S> {
#[must_use]
fn alloc(contents: Vec<u8>) -> Slice {
let boxed_slice = contents.into_boxed_slice();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going from Vec<u8> to BString would require activating the std feature.

I'd like to avoid doing that. One way we can work around this is by manually implementing Debug and calling as_ref().as_bstr() on the Owned boxed slice.

@@ -79,19 +78,31 @@ use crate::{Symbol, SymbolOverflowError, DEFAULT_SYMBOL_TABLE_CAPACITY};
enum Slice {
/// True `'static` references.
Static(&'static BStr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll move away from storing BStrs in the Slice and only convert to impl Debug. The reason to store the slices as &BStr is that it lets me be lazy and derive debug.

@lopopolo lopopolo merged commit 5416e70 into artichoke:trunk Jul 10, 2020
@CAD97 CAD97 deleted the less-unsafe branch July 10, 2020 18:56
@lopopolo lopopolo added the I-bytes Interner: Byte string SymbolTable. label Jul 10, 2020
@CAD97
Copy link
Contributor Author

CAD97 commented Jul 11, 2020

I'm collecting some comparisons between some of the top string interner crates in the vein of Amos's comparison of small strings,

If you're interested, the post is out now: https://dev.to/cad97/string-interners-in-rust-797

@lopopolo
Copy link
Member

Thanks for checking out intaglio and contributing back some wonderful improvements.

@lopopolo lopopolo mentioned this pull request Aug 8, 2020
@lopopolo lopopolo added the A-interner Area: String interners, data structures, and types. label Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interner Area: String interners, data structures, and types. A-security Area: Security vulnerabilities and unsoundness issues. I-bytes Interner: Byte string SymbolTable. I-str Interner: UTF-8 string SymbolTable. S-speculative Status: This is just an idea.
Development

Successfully merging this pull request may close these issues.

2 participants