Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: define shared Commit type #4351

Merged
merged 19 commits into from
Nov 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tensorboard/data/server/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ rust_library(
name = "rustboard_core",
srcs = [
"lib.rs",
"commit.rs",
"data_compat.rs",
"event_file.rs",
"masked_crc.rs",
Expand Down
138 changes: 138 additions & 0 deletions tensorboard/data/server/commit.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

//! Shared state for sampled data available to readers.

use std::collections::HashMap;
use std::sync::RwLock;

use crate::proto::tensorboard as pb;
use crate::reservoir::Basin;
use crate::types::{Run, Step, Tag, WallTime};

/// Current state of in-memory sampled data.
///
/// A commit is an internally mutable structure. All readers and writers should keep a shared
/// reference to a single commit. When writers need to update it, they grab an exclusive lock to
/// the contents.
///
/// Deadlock safety: any thread should obtain the outer lock (around the hash map) before an inner
/// lock (around the run data), and should obtain at most one `RunData` lock at once.
#[derive(Debug, Default)]
pub struct Commit {
pub runs: RwLock<HashMap<Run, RwLock<RunData>>>,
}

impl Commit {
/// Creates a new, empty commit.
pub fn new() -> Self {
Commit::default()
}
}

/// Data for a single run.
///
/// This contains all data and metadata for a run. For now, that data includes only scalars;
/// tensors and blob sequences will come soon.
#[derive(Debug, Default)]
pub struct RunData {
/// The time of the first event recorded for this run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've wanted for a while to opportunistically parse this from the filename so we don't have to open a file for each run in order to serve the initial list of runs, which is especially bad for remote filesystems. WDYT? Wouldn't be implemented here of course, but might affect the semantics we describe for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’d be fine with me, yep. I think that fits naturally into the logdir
loader structure.

///
/// Used to define an ordering on runs that is stable as new runs are added, so that existing
/// runs aren't constantly changing color.
pub start_time: Option<WallTime>,

/// Scalar time series for this run.
pub scalars: TagStore<ScalarValue>,
}

pub type TagStore<V> = HashMap<Tag, TimeSeries<V>>;

#[derive(Debug)]
pub struct TimeSeries<V> {
/// Summary metadata for this time series.
pub metadata: Box<pb::SummaryMetadata>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, what's the rationale for this being Boxed? The Book suggests 3 main reasons one might use Box and I'm guessing the reason here is "When you have a large amount of data and you want to transfer ownership but ensure the data won’t be copied when you do so" but then looking at SummaryMetadata I just see data that's already on the heap (except one i32): https://cs.opensource.google/tensorflow/tensorboard/+/master:tensorboard/data/server/tensorboard.pb.rs;l=294;drc=34fe0bfd9d41c4e0da175df16cf716436a4ae263

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that it’s not that bad. Here are the ones that I’m
concerned about, with output from rustc -Zprint-type-sizes:

Type size information
print-type-size type: `proto::tensorboard::summary::Value`: 536 bytes, alignment: 8 bytes
print-type-size     field `.node_name`: 24 bytes
print-type-size     field `.tag`: 24 bytes
print-type-size     field `.metadata`: 104 bytes
print-type-size     field `.value`: 384 bytes
print-type-size type: `proto::tensorboard::TensorProto`: 376 bytes, alignment: 8 bytes
print-type-size     field `.tensor_shape`: 32 bytes
print-type-size     field `.tensor_content`: 24 bytes
print-type-size     field `.half_val`: 24 bytes
print-type-size     field `.float_val`: 24 bytes
print-type-size     field `.double_val`: 24 bytes
print-type-size     field `.int_val`: 24 bytes
print-type-size     field `.string_val`: 24 bytes
print-type-size     field `.scomplex_val`: 24 bytes
print-type-size     field `.int64_val`: 24 bytes
print-type-size     field `.bool_val`: 24 bytes
print-type-size     field `.dcomplex_val`: 24 bytes
print-type-size     field `.resource_handle_val`: 24 bytes
print-type-size     field `.variant_val`: 24 bytes
print-type-size     field `.uint32_val`: 24 bytes
print-type-size     field `.uint64_val`: 24 bytes
print-type-size     field `.dtype`: 4 bytes
print-type-size     field `.version_number`: 4 bytes
print-type-size type: `proto::tensorboard::SummaryMetadata`: 104 bytes, alignment: 8 bytes
print-type-size     field `.plugin_data`: 48 bytes
print-type-size     field `.display_name`: 24 bytes
print-type-size     field `.summary_description`: 24 bytes
print-type-size     field `.data_class`: 4 bytes
print-type-size     end padding: 4 bytes

So yes, “large amount of data” can definitely go higher than 104 bytes,
but that’s the only reason here. It should also make hash map resizes of
the TagStore a bit faster, though that shouldn’t matter too much.

Boxing all three of these together improved end-to-end performance by
~25%, and I would believe that most of that is due to TensorProto and
Value since those directly affect the reservoir size, but I also don’t
mind doing this. If we want to unbox this, that’s okay with me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it was primarily a curiosity question - fine by me to leave it boxed. I didn't realize it'd be 24 bytes per Vec aka per String so it does add up to more than I'd have thought; at least one reason is that in my head I'm still thinking about immutable string types where there isn't an extra 8 bytes for capacity.

I doubt that it makes a big difference in our case, but in general it would seem like some applications with a lot of immutable strings would save space overhead by having an immutable owned string representation with pointer + length only? I guess that must just not come up very often since I don't see much evidence that this exists.

Copy link
Contributor Author

@wchargin wchargin Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right: you basically have:

  • growable vector: Vec<u8>, String; 3 words (ptr, len, cap)
  • slice reference: &[u8], &str; 2 words (ptr, len)
  • owned slice: Box<[u8]>, Box<str>; 2 words (ptr, len)

where in all case the string layout is the same as the vector
layout
, just with a “valid UTF-8” invariant.

So if you want “immutable owned string with no capacity penalty”, you
can take a Box<str>. And you can actually mutate it, too; you just
can’t resize it without reallocating.

Some protobuf-like APIs use this form: iirc, capnproto arrays aren’t
resizable after you construct them. But that’s a lot less ergonomic when
it comes to code that constructs the things, so I’m okay with prost
choosing to use Vecs and Strings. The TensorProto case is
definitely unfortunate, though. If protobuf had repeated oneofs, we
wouldn’t have this problem, since an enum of a bunch of Vec<_>s is
still just 32 bytes (24 + 8 for discriminant).

Rust also does have fixed-size arrays, so a Box<[u8; 12]> is even
smaller since the size is known at compile time. I was considering using
that for the TFRecord reader header rather than a Vec<u8>, but it’d be
a bit annoying, so would definitely need to be benchmarked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, of course, not all languages with immutable strings fit them into 8
or 16 bytes:

>>> import sys
>>> sys.getsizeof("")
49

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I think Box<str> is what I was looking for, and now that you mention it I see there's even https://doc.rust-lang.org/std/string/struct.String.html#method.into_boxed_str.

It hadn't occurred to me since str on its own seems a little exotic, but this explanation helps. What didn't help was the fact that https://doc.rust-lang.org/book/ch04-03-slices.html starts by saying "Another data type that does not have ownership is the slice" when I think what they're really doing is referring to &[_] as a "slice" informally and eliding the distinction between that and [_]. So whether or not a "slice" in the strict meaning of [_] has ownership depends on the type of pointer being used to hold it (namely if it's &, &mut, or Box).


/// Reservoir basin for data points in this time series.
///
/// See [`TimeSeries::valid_values`] for a client-friendly view that omits `DataLoss` points
/// and transposes `Step`s into the tuple.
pub basin: Basin<(WallTime, Result<V, DataLoss>)>,
}

impl<V> TimeSeries<V> {
/// Creates a new time series from the given summary metadata.
pub fn new(metadata: Box<pb::SummaryMetadata>) -> Self {
TimeSeries {
metadata,
basin: Basin::new(),
}
}

/// Gets an iterator over `self.values` that omits `DataLoss` points.
pub fn valid_values(&self) -> impl Iterator<Item = (Step, WallTime, &V)> {
self.basin
.as_slice()
.iter()
.filter_map(|(step, (wall_time, v))| Some((*step, *wall_time, v.as_ref().ok()?)))
}
}

/// A value in a time series is corrupt and should be ignored.
///
/// This is used when a point looks superficially reasonable when it's offered to the reservoir,
/// but at commit time we realize that it can't be enriched into a valid point. This might happen
/// if, for instance, a point in a scalar time series has a tensor value containing a string. We
/// don't care too much about what happens to these invalid values. Keeping them in the commit as
/// `DataLoss` tombstones is convenient, and [`TimeSeries::valid_values`] offers a view that
/// abstracts over this detail by only showing valid data.
#[derive(Debug)]
pub struct DataLoss;

/// The value of a scalar time series at a single point.
#[derive(Debug, Copy, Clone, PartialEq)]
pub struct ScalarValue(pub f64);

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_valid_values() {
let mut ts = TimeSeries::<&str>::new(Box::new(pb::SummaryMetadata::default()));

let mut rsv = crate::reservoir::StageReservoir::new(10);
let wall_time = WallTime::new(0.0).unwrap(); // don't really care
rsv.offer(Step(0), "zero");
rsv.offer(Step(1), "one");
rsv.offer(Step(2), "two");
rsv.offer(Step(3), "three");
rsv.offer(Step(5), "five");
rsv.commit_map(&mut ts.basin, |s| {
(wall_time, if s == "three" { Err(DataLoss) } else { Ok(s) })
});

assert_eq!(
ts.valid_values().collect::<Vec<_>>(),
vec![
(Step(0), wall_time, &"zero"),
(Step(1), wall_time, &"one"),
(Step(2), wall_time, &"two"),
// missing: Step(3)
(Step(5), wall_time, &"five")
]
);
}
}
1 change: 1 addition & 0 deletions tensorboard/data/server/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ limitations under the License.

//! Core functionality for TensorBoard data loading.

pub mod commit;
pub mod data_compat;
pub mod event_file;
pub mod masked_crc;
Expand Down
24 changes: 24 additions & 0 deletions tensorboard/data/server/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,19 @@ impl Borrow<str> for Tag {
}
}

/// The name of a TensorBoard run.
///
/// Run names are derived from directory names relative to the logdir, but are lossily converted to
/// valid Unicode strings.
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
pub struct Run(pub String);

impl Borrow<str> for Run {
fn borrow(&self) -> &str {
&self.0
}
}

#[cfg(test)]
mod tests {
use super::*;
Expand All @@ -85,6 +98,17 @@ mod tests {
assert_eq!(m.get("xent"), None);
}

#[test]
fn test_run_hash_map_str_access() {
use std::collections::HashMap;
let mut m: HashMap<Run, i32> = HashMap::new();
m.insert(Run("train".to_string()), 1);
m.insert(Run("test".to_string()), 2);
// We can call `get` given only a `&str`, not an owned `Run`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you said on the other PR that Borrow is what enables this. Do you mind explaining the intuition I should use for why that's the case?

Copy link
Contributor Author

@wchargin wchargin Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. To start, the docs for Borrow have this as an example
case, which I found helpful.

Probably the most foolproof way to see what’s happening is to look at
the type signature for HashMap<K, V>::get:

pub fn get<Q: ?Sized>(&self, k: &Q) -> Option<&V>
where
    K: Borrow<Q>,
    Q: Hash + Eq, 

So here we’re calling get with an &str, so Q = str, so it needs to
be the case that impl Borrow<str> for Run to satisfy the where
constraint. And internally, this calls k.eq(x.0.borrow()) where
(presumably) x.0 is the map-owned key, so it boils down to checking
whether "train" == Run("train").borrow().

Does this help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! I didn't read far enough down the Borrow page to see the HashMap example but now it's quite clear, my bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Pointing you to the right docs is a good use of my time. :-)

assert_eq!(m.get("train"), Some(&1));
assert_eq!(m.get("val"), None);
}

#[test]
fn test_wall_time() {
assert_eq!(WallTime::new(f64::INFINITY), None);
Expand Down