Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data: implement gRPC data provider for scalars #4356

Merged
merged 15 commits into from
Nov 30, 2020

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Nov 19, 2020

Summary:
A new GrpcDataProvider talks over the TensorBoardDataProvider gRPC
interface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass --grpc_data_provider ADDR to
connect to a server running on the given address, like localhost:6806.

For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)

Test Plan:
Unit tests included. For an end-to-end test, run:

bazel run -c opt //tensorboard/data/server -- LOGDIR &
bazel run -c opt //tensorboard -- \
    --bind_all \
    --generic_data true \
    --grpc_data_provider localhost:6806 \
    --verbosity 0 \
    ;

…and you should see a working scalars dashboard for the given logdir.

Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.

wchargin-branch: data-grpc-provider

Summary:
The data provider API uses `list_plugins` to populate the list of active
dashboards. I forgot to add a corresponding RPC in #4314, since I’d
hacked around it in my prototypes, but we do want to be able to
implement this cleanly.

Test Plan:
It builds: `bazel build //tensorboard/data/proto:protos_all_py_pb2_grpc`.

wchargin-branch: data-list-plugins-rpc
wchargin-source: 6ab2a2acb154faac3310c4bac690554d7a9e1e74
Summary:
The `//tensorboard/data/server` binary now provides a gRPC server that
implements the `TensorBoardDataProvider` service. It only supports
scalars, and it has entirely fake data. But it’s enough to be wired up
to TensorBoard and show real charts in the UI.

Test Plan:
Run the server with `bazel run -c opt //tensorboard/data/server`.
(Running with `-c opt` is not necessary, but the server is notably
faster that way.) Then, in a different shell, use `grpc_cli`:

```
$ grpc_cli --channel_creds_type=insecure \
>   --protofiles tensorboard/data/proto/data_provider.proto \
>   call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null
runs {
  name: "train"
  start_time: 1605752017
}
```

Interestingly, this takes 13.9 ± 4.9 ms on my machine, whereas the demo
server added in #4318 took only 5.2 ± 2.6 ms. Both are basically doing
no work on the server, so I suspect that the difference may be due to
`grpc_cli` having to do more work to parse our real proto files. And,
indeed, making the calls from Python instead takes only 0.8–1.1 ms.

wchargin-branch: rust-real-fake-data-provider
wchargin-source: 67e69e8512a96443568ca916dd9559f4457ddf27
Summary:
A new `GrpcDataProvider` talks over the `TensorBoardDataProvider` gRPC
interface to query for data. This enables it to talk to a RustBoard
process running concurrently. Pass `--grpc_data_provider ADDR` to
connect to a server running on the given address, like `localhost:6806`.

For now, this data provider only supports scalars. We’ll add more data
classes as we go along. (Easier to review and test this way.)

Test Plan:
Unit tests included. For an end-to-end test, run:

```
bazel run -c opt //tensorboard/data/server &
bazel run -c opt //tensorboard -- \
    --bind_all \
    --generic_data true \
    --grpc_data_provider localhost:6806 \
    --verbosity 0 \
    ;
```

…then navigate to your TensorBoard in the usual fashion. You should see
one run, one chart, and 5 scalar points.

Also tested that normal TensorBoard still works both in open-source and
after a test sync into Google.

wchargin-branch: data-grpc-provider
wchargin-source: 29d4b4d13cdf79e963476b8fd937dcc47f40940e
@wchargin wchargin added the core:rustboard //tensorboard/data/server/... label Nov 19, 2020
@google-cla google-cla bot added the cla: yes label Nov 19, 2020
wchargin-branch: data-grpc-provider
wchargin-source: efbe7082d4307747ff3784f268152a2f5d0bd87c
wchargin-branch: rust-real-fake-data-provider
wchargin-source: f681817f2483ced7ea49b5757fad20ae4fbd0655
wchargin-branch: rust-real-fake-data-provider
wchargin-source: 8b0ae806f174d87b71211f679fa67099b2a22062
wchargin-branch: data-grpc-provider
wchargin-source: 4d1796f16daeada6a9add1ba3c5eae2d477eee36
wchargin-branch: rust-real-fake-data-provider
wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: rust-real-fake-data-provider
wchargin-source: d19fc0c8657e93aad6f3b8a426cc6c4e87b92d44
wchargin-branch: data-grpc-provider
wchargin-source: 62905530ad664a99fe799191088f5f2496d83e33
wchargin-branch: rust-real-fake-data-provider
wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: rust-real-fake-data-provider
wchargin-source: 298acfd7f14bdc2896f04ac37a87481673564e8f
wchargin-branch: data-grpc-provider
wchargin-source: 84616afaf867100915850823f58ad6f2594d7099
@wchargin wchargin requested a review from psybuzz November 24, 2020 02:34
@wchargin
Copy link
Contributor Author

Despite the core:rustboard tag, this is just Python. :-)

Still of course happy to answer any questions, Rust- or otherwise.

Base automatically changed from wchargin-rust-real-fake-data-provider to master November 24, 2020 03:53
wchargin-branch: data-grpc-provider
wchargin-source: b274c0db89bc70dd5b94337c5e7e9bc6c109b0b1
wchargin added a commit that referenced this pull request Nov 25, 2020
Summary:
This patch connects the `//tensorboard/data/server` main module to the
`Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as
a proof of concept.

For now, we just take a logdir path as a positional command line
argument. A future patch will bring nicer command-line argument parsing.

Test Plan:
Unit tests included. Note that testing these data provider methods is
nice and easy: you just call them. No mocking or networking required.

For a manual `grpc_cli` test, start just the data server, and run:

```
$ grpc_cli --channel_creds_type=insecure \
>   --protofiles tensorboard/data/proto/data_provider.proto \
>   call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null
runs {
  name: "lr_1E-03,conv=1,fc=2"
  start_time: 1563406327
}
// etc.
```

For an end-to-end test, cherry-pick #4356, then run both the data server
and a copy of TensorBoard concurrently:

```
bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist &
bazel run //tensorboard -- \
    --bind_all \
    --generic_data true \
    --grpc_data_provider localhost:6806 \
    ;
```

Then navigate to `localhost:6006/data/runs` and note that the response
includes a correct list of runs.

wchargin-branch: rust-listruns
if run_tag_filter is None:
return
if run_tag_filter.runs is not None:
rtf_proto.runs.names[:] = sorted(run_tag_filter.runs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does traditional TB also sort the runs/tags, or is this sorting logic new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.

res = self._stub.ListRuns(req)
return [
provider.Run(
run_id=run.name, run_name=run.name, start_time=run.start_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are run_ids a new concept introduced in this grpc design, which is supposed to be unique "across all experiments"?

If the plan is to support globally unique run ids in a world with multiple experiments (e.g. upgrade run_id=run.name into run_id=get_run_id(exp_id, run.name) or something), it might be worth adding a comment saying that this is intended to change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. (And same for TensorBoard.dev, fwiw.)

"""Returns `(data_provider, deprecated_multiplexer)`."""
grpc_addr = self.flags.grpc_data_provider
if grpc_addr:
return (self._make_grpc_provider(grpc_addr), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, how does the grpc_data_provider know about the logdir, if we don't pass it the CLI flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.

In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In at least the medium term, I plan for them to remain separate
processes (rather than an in-process C extension), but I do plan for
TensorBoard to launch the RPC server. A sibling of LocalDataIngester
will probably take care of that.

Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Responded, but I don’t see any actionable changes. PTAL.

res = self._stub.ListRuns(req)
return [
provider.Run(
run_id=run.name, run_name=run.name, start_time=run.start_time,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, on the contrary: the gRPC design does not have run IDs, because we
do not yet know what the right semantics are. It is the Python data
provider API that (optimistically) has run IDs, which are a bit of a
wart: they’re returned from list_runs, but ignored, and not used in
any of the other methods.

if run_tag_filter is None:
return
if run_tag_filter.runs is not None:
rtf_proto.runs.names[:] = sorted(run_tag_filter.runs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traditional TensorBoard doesn’t send an RPC at all here, so there’s no
corresponding code. This sorting logic just makes the wire format—which
is logically a set—deterministic in representation. I think that that’s
a good habit when performance isn’t critical. It doesn’t really matter.

"""Returns `(data_provider, deprecated_multiplexer)`."""
grpc_addr = self.flags.grpc_data_provider
if grpc_addr:
return (self._make_grpc_provider(grpc_addr), None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grpc_data_provider doesn’t know about the logdir, nor does it need
to. The data location is populated as grpc://localhost:6801 based on
the address to which it connects. The logdir that the RPC server reads
from is given as a command-line flag to the server, which is a separate
process entirely.

@wchargin wchargin requested a review from psybuzz November 25, 2020 22:36
wchargin added a commit that referenced this pull request Nov 25, 2020
Summary:
The Rust gRPC server now implements the RPCs required to serve scalar
data to the TensorBoard frontend.

There is some repetition in the implementation: `ListScalars` and
`ReadScalars` have very similar bodies. There will be more repetition
with the tensor RPCs, and to a lesser degree with those for blob
sequences. I plan to consider refactoring that later, once all the raw
materials are there.

Test Plan:
Unit tests included. For an end-to-end test, cherry-pick #4356, launch
the data server (`bazel run //tensorboard/data/server -- LOGDIR`), and
concurrently run TensorBoard with `--grpc_data_provider localhost:6806`
and `--generic_data true` flags.

wchargin-branch: rust-listscalars-readscalars
wchargin-branch: data-grpc-provider
wchargin-source: 1cc98475f3cd1f4f198a73792775d5e708266d8e
Copy link
Contributor

@psybuzz psybuzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After trying it out manually, and it works!

Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &

res = self._stub.ListRuns(req)
return [
provider.Run(
run_id=run.name, run_name=run.name, start_time=run.start_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I also see that MultiplexerDataProvider also uses the run_name as the run_id.

"""Returns `(data_provider, deprecated_multiplexer)`."""
grpc_addr = self.flags.grpc_data_provider
if grpc_addr:
return (self._make_grpc_provider(grpc_addr), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right, the technical design doc even clarifies that the data loading server is in a separate process. I forgot about that.

In the longer term, is the plan to keep these 2 processes launched separately? Or will we have the main TB process spawn the RPC serving process, and pass it a subset of the command line flags?

@wchargin
Copy link
Contributor Author

Note: if copying the description into the commit message, I would change the first line from
bazel run -c opt //tensorboard/data/server &
to
bazel run -c opt //tensorboard/data/server <logdir> &

It was correct when the PR was originally sent, and then it was
<logdir>; as of #4397, it’s --logdir <logdir>. Progress is fast. :-)

@wchargin wchargin merged commit 8bec40d into master Nov 30, 2020
@wchargin wchargin deleted the wchargin-data-grpc-provider branch November 30, 2020 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes core:rustboard //tensorboard/data/server/...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants