-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data: implement gRPC data provider for scalars #4356
Conversation
Summary: The data provider API uses `list_plugins` to populate the list of active dashboards. I forgot to add a corresponding RPC in #4314, since I’d hacked around it in my prototypes, but we do want to be able to implement this cleanly. Test Plan: It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`. wchargin-branch: data-list-plugins-rpc wchargin-source: 6ab2a2acb154faac3310c4bac690554d7a9e1e74
Summary: The `//tensorboard/data/server` binary now provides a gRPC server that implements the `TensorBoardDataProvider` service. It only supports scalars, and it has entirely fake data. But it’s enough to be wired up to TensorBoard and show real charts in the UI. Test Plan: Run the server with `bazel run -c opt //tensorboard/data/server`. (Running with `-c opt` is not necessary, but the server is notably faster that way.) Then, in a different shell, use `grpc_cli`: ``` $ grpc_cli --channel_creds_type=insecure \ > --protofiles tensorboard/data/proto/data_provider.proto \ > call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null runs { name: "train" start_time: 1605752017 } ``` Interestingly, this takes 13.9 ± 4.9 ms on my machine, whereas the demo server added in #4318 took only 5.2 ± 2.6 ms. Both are basically doing no work on the server, so I suspect that the difference may be due to `grpc_cli` having to do more work to parse our real proto files. And, indeed, making the calls from Python instead takes only 0.8–1.1 ms. wchargin-branch: rust-real-fake-data-provider wchargin-source: 67e69e8512a96443568ca916dd9559f4457ddf27
Summary: A new `GrpcDataProvider` talks over the `TensorBoardDataProvider` gRPC interface to query for data. This enables it to talk to a RustBoard process running concurrently. Pass `--grpc_data_provider ADDR` to connect to a server running on the given address, like `localhost:6806`. For now, this data provider only supports scalars. We’ll add more data classes as we go along. (Easier to review and test this way.) Test Plan: Unit tests included. For an end-to-end test, run: ``` bazel run -c opt //tensorboard/data/server & bazel run -c opt //tensorboard -- \ --bind_all \ --generic_data true \ --grpc_data_provider localhost:6806 \ --verbosity 0 \ ; ``` …then navigate to your TensorBoard in the usual fashion. You should see one run, one chart, and 5 scalar points. Also tested that normal TensorBoard still works both in open-source and after a test sync into Google. wchargin-branch: data-grpc-provider wchargin-source: 29d4b4d13cdf79e963476b8fd937dcc47f40940e
wchargin-branch: data-grpc-provider wchargin-source: efbe7082d4307747ff3784f268152a2f5d0bd87c
wchargin-branch: rust-real-fake-data-provider wchargin-source: f681817f2483ced7ea49b5757fad20ae4fbd0655
wchargin-branch: rust-real-fake-data-provider wchargin-source: 8b0ae806f174d87b71211f679fa67099b2a22062
wchargin-branch: data-grpc-provider wchargin-source: 4d1796f16daeada6a9add1ba3c5eae2d477eee36
wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: rust-real-fake-data-provider wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: data-grpc-provider wchargin-source: 62905530ad664a99fe799191088f5f2496d83e33
wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: rust-real-fake-data-provider wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: data-grpc-provider wchargin-source: 84616afaf867100915850823f58ad6f2594d7099
Despite the core:rustboard tag, this is just Python. :-) Still of course happy to answer any questions, Rust- or otherwise. |
wchargin-branch: data-grpc-provider wchargin-source: b274c0db89bc70dd5b94337c5e7e9bc6c109b0b1
Summary: This patch connects the `//tensorboard/data/server` main module to the `Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as a proof of concept. For now, we just take a logdir path as a positional command line argument. A future patch will bring nicer command-line argument parsing. Test Plan: Unit tests included. Note that testing these data provider methods is nice and easy: you just call them. No mocking or networking required. For a manual `grpc_cli` test, start just the data server, and run: ``` $ grpc_cli --channel_creds_type=insecure \ > --protofiles tensorboard/data/proto/data_provider.proto \ > call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null runs { name: "lr_1E-03,conv=1,fc=2" start_time: 1563406327 } // etc. ``` For an end-to-end test, cherry-pick #4356, then run both the data server and a copy of TensorBoard concurrently: ``` bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist & bazel run //tensorboard -- \ --bind_all \ --generic_data true \ --grpc_data_provider localhost:6806 \ ; ``` Then navigate to `localhost:6006/data/runs` and note that the response includes a correct list of runs. wchargin-branch: rust-listruns
if run_tag_filter is None: | ||
return | ||
if run_tag_filter.runs is not None: | ||
rtf_proto.runs.names[:] = sorted(run_tag_filter.runs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does traditional TB also sort the runs/tags, or is this sorting logic new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.
res = self._stub.ListRuns(req) | ||
return [ | ||
provider.Run( | ||
run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are run_id
s a new concept introduced in this grpc design, which is supposed to be unique "across all experiments"?
If the plan is to support globally unique run ids in a world with multiple experiments (e.g. upgrade run_id=run.name
into run_id=get_run_id(exp_id, run.name)
or something), it might be worth adding a comment saying that this is intended to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs
, but ignored, and not used in
any of the other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I also see that MultiplexerDataProvider
also uses the run_name as the run_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. (And same for TensorBoard.dev, fwiw.)
"""Returns `(data_provider, deprecated_multiplexer)`.""" | ||
grpc_addr = self.flags.grpc_data_provider | ||
if grpc_addr: | ||
return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, how does the grpc_data_provider
know about the logdir, if we don't pass it the CLI flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grpc_data_provider
doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801
based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.
In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In at least the medium term, I plan for them to remain separate
processes (rather than an in-process C extension), but I do plan for
TensorBoard to launch the RPC server. A sibling of LocalDataIngester
will probably take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Responded, but I don’t see any actionable changes. PTAL.
res = self._stub.ListRuns(req) | ||
return [ | ||
provider.Run( | ||
run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs
, but ignored, and not used in
any of the other methods.
if run_tag_filter is None: | ||
return | ||
if run_tag_filter.runs is not None: | ||
rtf_proto.runs.names[:] = sorted(run_tag_filter.runs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.
"""Returns `(data_provider, deprecated_multiplexer)`.""" | ||
grpc_addr = self.flags.grpc_data_provider | ||
if grpc_addr: | ||
return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grpc_data_provider
doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801
based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.
Summary: The Rust gRPC server now implements the RPCs required to serve scalar data to the TensorBoard frontend. There is some repetition in the implementation: `ListScalars` and `ReadScalars` have very similar bodies. There will be more repetition with the tensor RPCs, and to a lesser degree with those for blob sequences. I plan to consider refactoring that later, once all the raw materials are there. Test Plan: Unit tests included. For an end-to-end test, cherry-pick #4356, launch the data server (`bazel run //tensorboard/data/server -- LOGDIR`), and concurrently run TensorBoard with `--grpc_data_provider localhost:6806` and `--generic_data true` flags. wchargin-branch: rust-listscalars-readscalars
wchargin-branch: data-grpc-provider wchargin-source: 1cc98475f3cd1f4f198a73792775d5e708266d8e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After trying it out manually, and it works!
Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &
res = self._stub.ListRuns(req) | ||
return [ | ||
provider.Run( | ||
run_id=run.name, run_name=run.name, start_time=run.start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I also see that MultiplexerDataProvider
also uses the run_name as the run_id.
"""Returns `(data_provider, deprecated_multiplexer)`.""" | ||
grpc_addr = self.flags.grpc_data_provider | ||
if grpc_addr: | ||
return (self._make_grpc_provider(grpc_addr), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.
In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?
It was correct when the PR was originally sent, and then it was |
Summary:
A new
GrpcDataProvider
talks over theTensorBoardDataProvider
gRPCinterface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass
--grpc_data_provider ADDR
toconnect to a server running on the given address, like
localhost:6806
.For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)
Test Plan:
Unit tests included. For an end-to-end test, run:
…and you should see a working scalars dashboard for the given logdir.
Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.
wchargin-branch: data-grpc-provider