-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rust: implement ListScalars
and ReadScalars
#4386
Conversation
Summary: This module performs conversions from legacy on-disk data formats. It will correspond to both the `data_compat` and `dataclass_compat` modules from Python TensorBoard. For now, we have just one function, to determine the initial summary metadata of a time series. And, for now, we only implement this for scalars, to keep things simple. Test Plan: Unit tests included. wchargin-branch: rust-data-compat wchargin-source: ef510b422b00a66a23efc1406947844fa8d1a798
Summary: Our errors now have nice string formatting, easier `From` conversions, and `std::error::Error` implementations. Test Plan: Included some tests for the `Display` implementations. They’re often not necessary—one benefit of deriving traits is that you can be confident in the implementation without manually testing it. But sometimes, if the format string is non-trivial, it can be nice to actually see the full text written out. wchargin-branch: rust-use-thiserror wchargin-source: 0f8009c3b96619f2eb4649353487dc996b45a712
Summary: We’d defined a `Step` [newtype] in the `reservoir` module, but that type will be useful more broadly. We thus move it into a new `types` module, and define `WallTime` and `Tag` siblings, which we’ll need shortly. [newtype]: https://doc.rust-lang.org/rust-by-example/generics/new_types.html Test Plan: Most behavior is just `#[derive]`d; unit tests included for the rest. wchargin-branch: rust-types-module wchargin-source: b0f5475b247ef8f9cb8bf2e0bf94d2b648b8009e
…rigin/wchargin-rust-use-thiserror' and 'origin/wchargin-rust-types-module' into HEAD
wchargin-branch: rust-run-loader wchargin-source: 35359d6e233139973edac8b2c1440cc48ed4e710
wchargin-branch: rust-data-compat wchargin-source: 467974fa4e0e55dea0872bb1394a768a38811ac5
wchargin-branch: rust-run-loader wchargin-source: d190e3d45c304d976a3fc14a73d99572a6ad377f
Summary: This patch refactors `run`’s `StagePayload` type into an `EventValue` type in `data_compat`. This way, we can define enriching conversions from `EventValue`s to committed values in a layer that’s not specific to the run loader. We also implement `initial_metadata` for the `GraphDef` variant, for symmetry and to make the abstraction clearer. The run loader still doesn’t handle graphs. Test Plan: Unit tests expanded for the new code, and continue to cover the old. wchargin-branch: rust-promote-eventvalue wchargin-source: 96af44556af064cf9ef933154ca3a522a8829798
wchargin-branch: rust-run-loader wchargin-source: 170c77aae53df6716520df779e4838d3fc3b93b9 # Conflicts: # tensorboard/data/server/data_compat.rs
wchargin-branch: rust-run-loader wchargin-source: 170c77aae53df6716520df779e4838d3fc3b93b9
wchargin-branch: rust-promote-eventvalue wchargin-source: 106b4519aba1a966d8273b6b2509f1798131dc5a
Summary: This patch introduces the *commit*, which is the link between loading threads and serving threads. The commit structure is designed for concurrency: many threads can read from it at once, and when the commit is updated, a loader acquires exclusive access only to its own run. This improves latency for both readers and writers. The commit itself doesn’t have much logic; it’s mostly just the correct composition of substructures, most notably the reservoir basins. See also the RustBoard technical design doc (#4350) for discussion. Test Plan: There’s one method on `commit::TimeSeries`; it has tests. wchargin-branch: rust-commit-type wchargin-source: d69b4e0034b68d1d8dadc6aa0f6dc3cc48efccde
Summary: This patch integrates `RunLoader` and `Commit`. A run loader now commits all changes at the end of its load cycle. This isn’t sufficient—we also need it to periodically commit while it’s still loading. But it’s a good start for now. Much of the weight is actually in `data_compat`, which owns the logic for “enriching” raw events into their class-specific representations. This way, that logic can be reused: e.g., for an alternate consumer that exposes a stream of all events, without downsampling. Test Plan: The new enrichment functionality has dedicated tests, and existing tests for the run loader have been updated to operate on a `Commit` rather than poking into the internals. wchargin-branch: rust-commit-run-loader wchargin-source: 1e0ae52eb9cd9ead6f53c9716306a5fcce178e06
wchargin-branch: rust-run-loader wchargin-source: 793ff3e287c2571fd1bc06d5c1940ed3b3bdc796
wchargin-branch: rust-promote-eventvalue wchargin-source: 432923653efb6fe768bdd507acac833672ec3e06
wchargin-branch: rust-commit-type wchargin-source: 1850f510bac1f515fa4b7f73c1ec9a7a47fea283
wchargin-branch: rust-commit-run-loader wchargin-source: 14baca16a82293a24a1b7f58869305f3b4ff7a6f # Conflicts: # tensorboard/data/server/run.rs
wchargin-branch: rust-commit-run-loader wchargin-source: 14baca16a82293a24a1b7f58869305f3b4ff7a6f
wchargin-branch: rust-promote-eventvalue wchargin-source: dd7917e2bd9084f8ff048657d94572168a221a7e # Conflicts: # tensorboard/data/server/data_compat.rs # tensorboard/data/server/run.rs
wchargin-branch: rust-promote-eventvalue wchargin-source: dd7917e2bd9084f8ff048657d94572168a221a7e
wchargin-branch: rust-commit-type wchargin-source: 38179a4b9af7538f133ee04f771ac895a8db21a9
wchargin-branch: rust-commit-run-loader wchargin-source: f84d5842282dba99385fba34ece75f6ac7f82d2e
wchargin-branch: rust-commit-type wchargin-source: b62b0669a50b5086a1f6af18effe62fbfd1e5431
wchargin-branch: rust-commit-run-loader wchargin-source: a8130d52ce4b02636f59cab02ebef5cb27055fc9
wchargin-branch: rust-commit-run-loader wchargin-source: a8130d52ce4b02636f59cab02ebef5cb27055fc9
wchargin-branch: rust-commit-run-loader wchargin-source: 869544fd17ac2d740e60f1501b9540490d06906b # Conflicts: # tensorboard/data/server/commit.rs
wchargin-branch: rust-commit-run-loader wchargin-source: 869544fd17ac2d740e60f1501b9540490d06906b
wchargin-branch: rust-commit-run-loader wchargin-source: e0ee69aea5631ccc6f2197922249d766aaa84cb6
wchargin-branch: rust-commit-run-loader wchargin-source: e0ee69aea5631ccc6f2197922249d766aaa84cb6
wchargin-branch: rust-commit-run-loader wchargin-source: 3f18bbb73a892785e6dfe5255b2831289ac3e789 # Conflicts: # tensorboard/data/server/run.rs
wchargin-branch: rust-logdir-loader-discovery wchargin-source: 761b320491045f4ceb6ced9eb596242ba860e7a9
wchargin-branch: rust-logdir-loader-data wchargin-source: 5f6ccd3318e76347bc3f07d78c459070fe15e65d
Summary: This patch connects the `//tensorboard/data/server` main module to the `Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as a proof of concept. For now, we just take a logdir path as a positional command line argument. A future patch will bring nicer command-line argument parsing. Test Plan: Run both the data server and a copy of TensorBoard concurrently: ``` bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist & bazel run //tensorboard -- \ --bind_all \ --generic_data true \ --grpc_data_provider localhost:6806 \ ; ``` Then navigate to `localhost:6006/data/runs` and note that the response includes a correct list of runs. Or, for a manual `grpc_cli` test, start just the data server, and run: ``` $ grpc_cli --channel_creds_type=insecure \ > --protofiles tensorboard/data/proto/data_provider.proto \ > call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null runs { name: "lr_1E-03,conv=1,fc=2" start_time: 1563406327 } // etc. ``` Unit tests also included. Note that testing these data provider methods is nice and easy: you just call them. No mocking or networking required. wchargin-branch: rust-listruns wchargin-source: eb471eca310905eb45d730936279b57bdae70927
Summary: The Rust gRPC server now implements the RPCs required to serve scalar data to the TensorBoard frontend. There is some repetition in the implementation: `ListScalars` and `ReadScalars` have very similar bodies. There will be more repetition with the tensor RPCs, and to a lesser degree with those for blob sequences. I plan to consider refactoring that later, once all the raw materials are there, but I’m open to feedback. Test Plan: Unit tests included. For an end-to-end test, launch the data server (`bazel run //tensorboard/data/server -- LOGDIR`) and concurrently launch TensorBoard with the `--grpc_data_provider localhost:6806` and `--generic_data true` flags. wchargin-branch: rust-listscalars-readscalars wchargin-source: 00ca5a4502bbb5e7a2f71fc2d48e262a60d69191
wchargin-branch: rust-logdir-loader-discovery wchargin-source: 44cd7a803ba04324ce11a892d8a8414792dae95c
wchargin-branch: rust-logdir-loader-data wchargin-source: 29289362b8ff8294883bd381401613a6269e2ebc
wchargin-branch: rust-listruns wchargin-source: 8187059ac28210897b1c90e4a31cd8237993ec4b
wchargin-branch: rust-listscalars-readscalars wchargin-source: d13d984fbe01390aaeb19c9322969289f899661f
wchargin-branch: rust-logdir-loader-data wchargin-source: ab88209f4c0f5874b7c0e80a7c670a8f7cef810d # Conflicts: # tensorboard/data/server/logdir.rs
wchargin-branch: rust-logdir-loader-data wchargin-source: ab88209f4c0f5874b7c0e80a7c670a8f7cef810d
wchargin-branch: rust-listruns wchargin-source: bf3865887dddcdf516bc73806ea40f1f51ca0738
wchargin-branch: rust-listscalars-readscalars wchargin-source: ad93125f2159cbc427c29dcbdbd3470d3247dcfb
wchargin-branch: rust-listruns wchargin-source: 805e652280c9c72e2f1ea5cd0be2ae74de1538bb
wchargin-branch: rust-listscalars-readscalars wchargin-source: bead71240b2e82e0b81ad8c11598b6a0ce9e8261
wchargin-branch: rust-listruns wchargin-source: 4314f515f85db55f912bccc1ea0eb22078e6d783
wchargin-branch: rust-listscalars-readscalars wchargin-source: 865b3367329b12a9fc915f1848c82d3a596260dc
tensorboard/data/server/server.rs
Outdated
@@ -172,9 +251,69 @@ impl TensorBoardDataProvider for DataProviderHandler { | |||
} | |||
} | |||
|
|||
/// Parses a request plugin filter. Returns the desired plugin name, or an error if that's empty. | |||
fn parse_plugin_filter(pf: Option<data::PluginFilter>) -> Result<String, Status> { | |||
let want_plugin = pf.map(|x| x.plugin_name).unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be pf.unwrap_or_default().plugin_name
right? (akin to parse_rtf
below). That seems somewhat more straightforward and if I have Done The Thing correctly it looks like it optimizes to the same code: https://rust.godbolt.org/z/YWsrf3
tensorboard/data/server/server.rs
Outdated
@@ -286,6 +425,36 @@ mod tests { | |||
); | |||
} | |||
|
|||
/// Converts a list-of-lists back into a map, for easy assertions that don't depend on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I get why we have this: to abstract over the various run-tag-indexed response types and avoid having to do something like res.runs.iter().find(|r| r.run_name == "train").unwrap()
inside the test, which would certainly be ugly. That said, I'm a bit lukewarm about the map conversion approach too; it just feels a bit clunky to have to pass in the conversion specifier closures each time in order to unwrap each response, and it's a little hard for me to look at the callsite of unroll_map()
and immediately understand what the parameters are doing.
What if we instead we tucked the cumbersome indexing expression inside a tests-only Index
impl for the protobuf response types?
impl Index<&str> for data::ListScalarsResponse {
type Output = data::list_scalars_response::RunEntry;
fn index(&self, key: &str) -> &Self::Output {
self.runs.iter().find(|r| r.run_name == key).unwrap()
}
}
This would allow us in the test itself to just go let train_run = &rsp.runs["train"]
.
It's true that we would need to repeat this boilerplate once per Run + Tag for each response type, but for test code I don't mind boilerplate so much, and we would only need to write the impls once and then no matter how many test methods we have they all get the nice concise lookup syntax.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s quite clever. I will also note that it’s also “barely” allowed:
the orphan rules say that you can only impl Trait for Type
if either
Trait
or Type
is defined in your crate, to ensure trait coherence
(i.e., that there aren’t ever conflicting impls). And the Prost bindings
are in our crate, but I don’t view that as set in stone. It’s true
that that’s the formulation that typical prost-build
/tonic-build
users will use, but a more Bazel-idiomatic workflow would probably put
each proto_library
in its own crate with proper dependencies among
them. And indeed rules_rust
provides rust_proto_library
of
this form, and bazelbuild/rules_rust#479 implements similar
functionality for Prost/Tonic (written by a Googler).
So I’m a little hesitant to depend on this. We could newtype it, but
that starts getting less ergonomic.
Here’s an approach that avoids having to pass in the conversion
specifier closures, and removes the type parameters entirely. We take
advantage of the shared run_name
/tag_name
structure via a macro.
Very Pythonic. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I'll take a "clever" :) Yeah, I realize it depends on the orphan rule; originally I was thinking it would just be a new test-only trait with fn get()
but then I realized Index
could be used and makes it even pithier.
But I think the macro version is quite nice and addresses the main things I was lukewarm about with unroll_map
so I'm perfectly happy to keep it the way you have it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would also be totally fine with a test-only extension trait (no
orphan rule issues since the trait is in-crate). And I would probably
use a macro to implement that trait for each response type.
I still might switch to that. It has type information earlier, so it’s
easier to return a HashMap<Run, HashMap<Tag, V>>
given a mapping
function FnMut(Self::TagEntry) -> V
. And that would be nice to avoid
the newly introduced .as_ref()
dances. You can do it with the macro,
but it’s a bit annoying because of inference details. (Basically: you
can’t write (|x| x.foo())(some_x)
without explicitly specifying the
type for x
, but you can work around this by capturing x
’s type, Java
style: std::iter::once(some_x).map(|x| x.foo()).nth(0).unwrap()
is
one awkward way to do so.)
tensorboard/data/server/server.rs
Outdated
}), | ||
..Default::default() | ||
}); | ||
handler.read_scalars(req).await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not want to actually assert anything on the response? Seems like it might not hurt to check that there are indeed no points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are points, because we don’t actually downsample yet. :-)
We can assert on the keys, though, and I’ll add a draft assertion to
enable once downsampling is implemented.
}); | ||
res.runs.push(run); | ||
for (run, data) in runs.iter() { | ||
if !run_filter.want(run) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, iterating over the filter instead of the keyset would seem likely to save effort on net (assuming single-run lookups against a big experiment are fairly plausible, while sending a filter with many non-existent runs is unlikely; I guess most optimal would be to iterate over the shorter one). Thoughts? It probably does make the code a little messier so maybe better to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It does complicate the code. My first attempt is:
// call site: nice and easy, modulo `&*` (for the `RwLockReadGuard`)
for (run, data) in run_filter.pluck(&*runs) {
// ...
}
// implementation
impl<T: Hash + Eq> Filter<T> {
fn pluck<'f, 'm: 'f, V: 'm>(
&'f self,
map: &'m HashMap<T, V>,
) -> impl Iterator<Item = (&'m T, &'m V)> + 'f
where
T: Eq + Hash + 'm,
{
match self {
Filter::All => Either::Left(map.iter()),
Filter::Just(which) => {
let it = which.iter().filter_map(move |k| map.get_key_value(k));
Either::Right(it)
}
}
}
}
enum Either<A, B> {
Left(A),
Right(B),
}
impl<A, B, I> Iterator for Either<A, B>
where
A: Iterator<Item = I>,
B: Iterator<Item = I>,
{
type Item = I;
fn next(&mut self) -> Option<Self::Item> {
match self {
Either::Left(i) => i.next(),
Either::Right(i) => i.next(),
}
}
}
…where the Either
layer is needed to form the output type of pluck
,
since the iterators in the two branches are of different types: one is
a hash_map::Iter<'m, T, V>
, and the other is something like
a iter::FilterMap<hash_set::Iter<'f, T>, C>
where C
is the
anonymous type of the closure.
(edited: in an earlier version of this comment, I had the right
structure, but forgot to actually make the important change and iterate
over which
rather than just iterating over the input map. The main
point stands.)
Another approach involves not using Iterator::filter_map
and instead
just implementing our own iterator struct explicitly. This might be
simpler (I haven’t written it out). But it’s still not trivial.
Another approach uses Box<dyn Iterator<Item = (&K, &V)>>
in place of
the Either
layer, pushing more dispatching to runtime.
I suspect that generators would make this a lot easier, but they’re
not yet stable (and not really close, afaict).
It’s hard for me to write that and then claim with a straight face that
Rust reads like Python. Given that this only affects RPC times (not load
throughput) and that even “large” experiments probably won’t have more
than a few thousand entries in these hash maps, I’m inclined to punt, at
least until we can measure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed thoughts! Didn't mean to make this too much of a digression, totally fine to leave as just a potential TODO for now. Maybe if I have any particularly good ideas I'll take a stab at this myself :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, TODO included. And, sure, feel free to grab it. :-)
None => continue, | ||
Some((step, _, _)) => step, | ||
}; | ||
let max_wall_time = ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a TODO to track this on TimeSeries
? Would be nice not to have to rescan every single scalar wall time each time we hit ListScalars
. I guess it's hard to get this entirely for "free" on TimeSeries
due to preemptions, but still seems overall cheaper to compute it there (where we could take a cheap path if no preemption happened).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (as TODO). Agreed on all counts.
wchargin-branch: rust-listscalars-readscalars wchargin-source: 4ce71f913bc11f4dbf262405d0e3603630a33358 # Conflicts: # tensorboard/data/server/server.rs
wchargin-branch: rust-listscalars-readscalars wchargin-source: 4ce71f913bc11f4dbf262405d0e3603630a33358
wchargin-branch: rust-listscalars-readscalars wchargin-source: 9989e5c9f2f0c7197a756c4169afa1a194d94193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL re: run_tag_map!
; the rest is mostly just done.
tensorboard/data/server/server.rs
Outdated
@@ -286,6 +425,36 @@ mod tests { | |||
); | |||
} | |||
|
|||
/// Converts a list-of-lists back into a map, for easy assertions that don't depend on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I'll take a "clever" :) Yeah, I realize it depends on the orphan rule; originally I was thinking it would just be a new test-only trait with fn get()
but then I realized Index
could be used and makes it even pithier.
But I think the macro version is quite nice and addresses the main things I was lukewarm about with unroll_map
so I'm perfectly happy to keep it the way you have it now.
}); | ||
res.runs.push(run); | ||
for (run, data) in runs.iter() { | ||
if !run_filter.want(run) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed thoughts! Didn't mean to make this too much of a digression, totally fine to leave as just a potential TODO for now. Maybe if I have any particularly good ideas I'll take a stab at this myself :)
Summary:
The Rust gRPC server now implements the RPCs required to serve scalar
data to the TensorBoard frontend.
There is some repetition in the implementation:
ListScalars
andReadScalars
have very similar bodies. There will be more repetitionwith the tensor RPCs, and to a lesser degree with those for blob
sequences. I plan to consider refactoring that later, once all the raw
materials are there, but I’m open to feedback.
Test Plan:
Unit tests included. For an end-to-end test, cherry-pick #4356, launch
the data server (
bazel run //tensorboard/data/server -- LOGDIR
), andconcurrently run TensorBoard with
--grpc_data_provider localhost:6806
and
--generic_data true
flags.wchargin-branch: rust-listscalars-readscalars