-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data ownership for zero-copy data structs #667
Comments
My initial response is that neither of those options really work optimally unfortunately and I don't know if we can have an optional architecture, but I'd like to try. For Rust clients, in my ideal world DataProvider owns data that is loaded. The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc. The reason I would prefer that is that I imagine clients wanting to "just" load DateTimeFormat, or NumberFormat and use it. Such formatter constructor would then ask DataProvider for data and on the first request, the provider loads the data, and on the consecutive the data is already there. This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out. I can see a claim that when The story may not cater well to the FFI scenario where we may need to return owned data over the FFI and allow it to live in the instance. So, seems like I'm also nudging in the direction of option1+2, but I'm a bit stronger on option 1 than you are. |
Option 1 is incompatible with a FIFO cache on the FsDataProvider layer, because the data provider cannot know whether a downstream formatter is retaining references to data. Rc/Arc might work; I just don't have a design for it yet. I think it would require that the DataProvider
What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?
+1. This is an easier problem to solve, and it is compatible with both Option 1 and Option 2. I don't want to conflate that problem with the problem of raw data ownership (e.g., strings, ZeroVecs). |
When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the |
Okay. This is actually already the case; DateTimeFormat has a |
@Manishearth: An option 3 would be to put a |
2021-04-22: The only option really off the table is Option 1 without reference counting. |
We just finished 2yr long project of removing something like Option 1 from our code base - it would be best if we could avoid cache + reference/pointers combination. Major reason is that it's not really a cache, because you can't evict data. Idea where we have locale based provider is better, but it's just a thin wrapper around general data provider. Users can still shoot themselves in the foot by creating large number of locale based providers, but in that case it's user mistake. I would fast forward couple iterations and land on a hybrid model, where in some cases we return a reference to a large dataset (like timezone names), and in other cases we return a value (date patterns). Returning a value works great with caching, and we can implement efficient cache eviction rules in that case. This approach would require having two types of data providers - reference and value based. We also need to denote, possibly in data, which data set is to be returned as reference and which as value. It can also be combined with locale data providers to limit the data growth. The question is - are we more concerned with performance hit from complicating data provider logic or potential memory inflation due to user error? |
I and @sffc have been discussing RecapFirstly, a recap of existing approaches so that we have a list in one place, and their downsides
I won't discuss option 3.1 as much since option 3.2 is better in every way, but we did discuss ideas in that space today. @sffc may write some more about the various options available to us later. Design of Option 3.2On to the actual design of Option 3.2, which I may later turn into a design doc to check in. The rough ideaThe typical data struct will look something like this: struct MyDataStruct<'s> {
cow: Cow<'s, str>,
vec: ZeroVec<'s, u32>
integer: u64,
// ...
} This struct can contain fully owned data, but it may also contain data that has been zero-copy deserialized from a Ideally, we'd be able to carry around a struct like struct SharedMyDataStruct {
rc: Rc<[u8]>, // or some kind of reference counting
data: MyDataStruct<'self>
} such that the Of course, Transforming lifetimesNow, we can implement struct SharedMyDataStruct {
rc: Rc<[u8]>, // or some kind of reference counting
data: MyDataStruct<'static> // the 'static is a lie
}
impl SharedMyDataStruct {
fn get<'a>(&'a self) -> &'a MyDataStruct<'a> { // note the lifetime!
// transmute the lifetime of &self.data
}
} It's not safe to return This is all well and good. Except this can't be done generically -- we will have many data struct types and it would be a huge pain to write this struct for each of them. What we want is: struct SharedData<T> {
rc: Rc<[u8]>, // or some kind of reference counting
data: T<'static> // the 'static is a lie
}
impl<T> SharedData<T> {
fn get<'a>(&'a self) -> &'a T<'a> {..}
fn get_mut<'a>(&'a mut self) -> &'a mut T<'a> {..}
}
// some kind of `Deserialize` impl that can construct this object zero-copy? except of course this isn't expressible, not without GATs. In current Rust there is no way to generically talk about " Build our own GATsWe can, however, manually build this scaffolding: unsafe trait LifetimeTransform<'a>: 'static {
type Output: 'a; // MUST be `Self` with the lifetime swapped
// used by `SharedData::get()`
fn transform(&'a self) -> &'a Self::Output;
fn transform_mut(&'a mut self) -> &'a mut Self::Output;
// used for deserialization. Safety constraint: `Self` must be
// destroyed before the data `Self::Output` was deserialized
// from is
unsafe fn make(Self::Output) -> Self;
}
unsafe impl<'a> LifetimeTransform<'a> for MyDataStruct<'static> {
type Output = MyDataStruct<'a>
// impls are transmutes
} The trait is
struct SharedData<T> {
rc: Rc<[u8]>, // or some kind of reference counting
data: T<'static> // the 'static is a lie
}
impl<T: for<'a> LifetimeTransform<'a>> SharedData<T> {
fn get<'a>(&'a self) -> &'a T::Output {..}
fn get_mut<'a>(&'a mut self) -> &'a mut T::Output {..}
}
// some kind of `Deserialize` impl that can construct this object zero-copy using `LifetimeTransform::make()`
// where you zero-copy deserialize a `T`, `LifetimeTransform::make()` it, and wrap it in a `SharedData<T>` with
// the correct `Rc` Make it pleasantFirstly we can swap the manual impl of We do care about patching, and We can allow for non-Rc-backed data by simply making We may eventually need a Major benefits
Lingering concernsIs |
Fixed by #745 |
We use
's
in data structs to allow them to borrow data via zero-copy deserialization. If data is statically allocated, we simply point to that pre-loaded buffer, as ICU4C does now. However, how do we handle ownership for dynamically loaded data, such as from a filesystem, RPC service, or network request?Currently, we always return data structs that own their data (
's: 'static
), due to #632. When we fix #632, we will need to have a story for how data providers that dynamically load data, like FsDataProvider, fit into the picture.We also must think about how
'd
, the lifetime of the data structs themselves, fits into the picture.I see
twothree directions.Option 1: FsDataProvider owns the data
Option 1 means FsDataProvider would lazy-load data into memory and then return pointers to it.
's
equals the lifetime of the FsDataProvider.This is challenging because a singleton FsDataProvider may have unbounded memory growth, since we cannot "delete" old data since we can't know if someone still has a reference to it. (See Option 3 for a solution based on reference counting.)
We could limit the scope of this problem by recommending or requiring that:
The other problem with this option is that it basically turns FsDataProvider into an (unbounded) data cache. I was hoping to keep the data cache as a separate component.
Option 2: Data struct owns the data
Option 2 means FsDataProvider would only ever return data that is owned by the data struct, such that the data provider is not tied to the lifetime of the data structs it returns.
This is challenging because after #632 is fixed, we would need a way to programmatically convert a data struct to owned (
's: 'static
). We would need to make a trait, sayIntoOwned
, that converts data structs into'static
, and implement that trait on all data structs and all types that are encapsulated by data structs (like ZeroVec). We would want to make sure thatIntoOwned
is as efficient as possible, since this would be on a critical path in data loading.Option 2 could be paired with a cache to help with performance and avoiding repeated loads of data. However, doing so would preclude the possibility of returning
'd
(borrowing the data struct), because then we fall into the same trap as Option 1.Option 3: Reference count the data
With this solution, we remove the
's
lifetime from the data provider and "replace" it with a reference-counted buffer. @Manishearth explains how this could work in a reply farther down in this thread.With reference counted buffers, data caches can safely drop data without invalidating downstream users of that data, because the downstream users will continue to retain an
Rc
to the data. Once the downstream users are dropped, the reference count will reach 0 and the buffers will be deallocated.There are two locations where we can add the
Rc
s:Rc<[u8]>
Rc<[u8]>
alongside the data struct and use self-referencesOption 3.1 has several problems, not the least of which is that it forces the
Rc
s to "infect" all fields of the data struct, which likely comes with a performance impact. The only real disadvantage of Option 3.2 is that self-references require a bit of unsafe code in Rust that may be difficult to reason about. I prefer Option 3.2 and taking the time to verify its correctness.With this solution, we would also drop
'd
, because calling.clone()
on a data struct will be a cheap operation, since it won't require allocating any memory.Summary
Option 2 fits best with my mental model. However, we may want to support both Options 1 and 2, because they are useful in different cases. For example, a data provider that downloads a whole language pack may wish to adopt Option 1, whereas one that pulls from an RPC service may wish to use Option 2.UPDATE: I now think that Option 3.2 is a workable solution that solves our known problems.
CC @Manishearth @nciric
The text was updated successfully, but these errors were encountered: