Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for transferring compressed blobs via ByteStream #168

Merged
merged 1 commit into from
Mar 9, 2021

Conversation

mostynb
Copy link
Contributor

@mostynb mostynb commented Sep 3, 2020

In many cases, it is desirable for blobs to be sent in compressed form to and from the cache. While gRPC supports channel-level compression, the generated bindings APIs require that implementers provide data in unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid re-compressing the same data on each request. The ByteStream API stands to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements #147.

@googlebot googlebot added the cla: yes Pull requests whose authors are covered by a CLA with Google. label Sep 3, 2020
@@ -207,6 +207,20 @@ service ActionCache {
// `{instance}/uploads/{uuid}/blobs/{hash}/{size}/foo/bar/baz.cc`. Anything
// after the `size` is ignored.
//
// Clients can upload compressed data with a `resource_name` of the form
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you restructure this a little bit to make it 100% clear that the same comments on uuid and trailing filename also apply to compressed-blobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. While doing so, I dropped the part which described a potential use for the optional metadata.

// Clients can upload compressed data with a `resource_name` of the form
// `{instance_name}/uploads/{uuid}/compressed-blobs/{compressor}/{uncompressed_hash}/{uncompressed_size}{/optional_metadata}`
// where `compressor` is the lowercase string form of a `Compressor.Value` enum
// other than `identity` which is advertised by the server in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this explicitly saying that 'identity' cannot be served over compressed-blobs? I think probably so--the 'rules' for uncompressed data are slightly different (e.g., in some cases using gRPC compression might make sense)--but want to be sure this is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, so clients don't try to use compressed-blobs resource names with old servers that do not populate CacheCapabilities.supported_compressor.

// keywords: `blobs`, `uploads`, `actions`, `actionResults`, `operations` and
// `capabilities`.
// keywords: `blobs`, `uploads`, `actions`, `actionResults`, `operations`,
// `capabilities` and `compressed-blobs`.
//
// When attempting an upload, if another client has already completed the upload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploads of the same data may occur on both 'blobs' and 'compressed-blobs' concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added such a statement in the restructuring.

@@ -235,6 +249,27 @@ service ActionCache {
// `instance_name` is the instance name (see above), and `hash` and `size` are
// the [Digest][build.bazel.remote.execution.v2.Digest] of the blob.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same rules about verifying data and retrying apply here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I covered this in the restructuring.

// once decompressed. Clients SHOULD verify that the `uncompressed_hash` and
// `uncompressed_size` match that of the downloaded data once uncompressed, and
// take appropriate steps in the case of failure such as retrying a limited
// number of times or surfacing an error to the user. Servers MUST provide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make this a bit more general: 'servers MUST provide both compressed and uncompressed data for any blobs which is advertised as being available, regardless of how it was uploaded' or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a statement to this effect.

// the case of failure such as retrying a limited number of times or
// surfacing an error to the user.
//
// Servers MAY use any compression level they choose, including different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be explicit wording that clients may also use any compression level they choose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't hurt to mention it- added in the most recent fixup commit.

// [BatchReadBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchReadBlobs].
//
// For large downloads, the client must use the
// [Read method][google.bytestream.ByteStream.Read] of the ByteStream API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be explicit that the read_offset, read_limit (and similarly, write_offset on WriteRequest) all refer to the uncompressed data? I don't see a way where it makes sense for it to refer to the compressed data, given the server may choose any compression level it wants, but it sounds useful to be explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions.

First, some points to consider:

  • Different implementations of the same compression algorithm can give different compressed data with the same compression level.
  • There is not necessarily a direct mapping between a position in uncompressed data and a position in the compressed data.
  • Servers can provide different compressed representations of the same data for any Read, but this must not change within a given Read.
  • Servers want to store data in compressed form, and probably not also in uncompressed form.

Let's consider downloads first.

  • The server can't cheaply find an arbitrary read_offset in the uncompressed data, because it's likely to be stored in compressed form. (This would require uncompressing some or all of the data, then re-compressing it from an offset.)
  • The client can't start with an arbitrary read_offset in compressed data, because it does not know which compressed representation it will get.
  • So clients must always start with a read_offset of 0, and it should refer to the specific compressed representation of the data the server is currently providing, and must not be larger than the size of the stream sent by the server so far.
  • Clients don't know the size of the compressed data ahead of time, to they must always specify 0 ("no limit").

Unfortunately, this means that clients cannot start a new Read to complete a previous interrupted/incomplete Read.

Now consider uploads, which are a bit simpler:

  • The write_offset should refer to the compressed data representation that the client is sending.
  • The write_offset should either always begin at 0, or perhaps servers could choose whether or not to allow resumed uploads based on the uniqueness of the guid and uncompressed_hash pair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricBurnett: my reasoning above differs from yours IIRC, for downloads in particular. It mostly hinges on how difficult it is to find a particular offset in the uncompressed data given only a compressed version of it.

In the worst case scenario I suspect this might devolve into decompressing and recompressing almost the entire blob, in which case you could instead use gRPC level compression with the uncompressed bytestream resource.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow your reasoning correctly, it's conditioned on the concern:

The server can't cheaply find an arbitrary read_offset in the uncompressed data, because it's likely to be stored in compressed form.

This is not quite true.

  • A basic server can start simply as you suggest, and only support reads and writes from offset 0. Totally reasonable imho.
  • One step up from that, a server can seek and discard server-side for offset reads (better than sending it over the wire unnecessarily), and can append new chunks to the end of the compressed contents for resumable writes.
    • Reads are not super efficient, but semantically appropriate, and gracefully transition to more complicated implementations if usage is high enough to require the efficiency.
    • Resuming broken writes requires recompressing from offset N. (Note that the server must close the compressor and write it out cleanly for the initial partial upload - appending to a truncated stream would not work). Also works for e.g. LogStream - each upload is an independently compressed chunk of bytes, with no knowledge of the compression of the ones that came before required.
  • Finally, servers can store data compressed in chunks, to permit efficient offset reads also. (Decompress nearest chunk and recompress desired portion of it).
    • Of note here, for most compressors (including zstd and gzip) the output is concatenatable - decompress(compress("foo")+compress("bar"))=="foobar". So servers can instead of storing compress(file), store [compress(file.chunk1), compress(file.chunk2), ...] . And then to return the whole stream to the user, return the concatenation of all these chunks; to return contents from a later offset, recompress the one partial chunk (compress(decompress(file.chunkK)[N:])) and return that followed by the concatenation of later chunks in sequence.

Compare with the semantics of offsets expressed in compressed bytes:

  • Arbitrary offset reads are meaningless - clients will not be able to decode from byte N without knowing the state of the bytes preceding.
  • Offset reads to resume a broken stream are possible, assuming the server guarantees to return exactly the same byte sequence. Without that guarantee, not usable.
  • Offset writes are similarly useful to resume a broken stream. Observing a broken write and re-compressing bytes starting at offset N wouldn't work; the exact compressor output must be kept, so it must be literal resumption of the exact output.
  • This is all considerably harder if the server supports multiple compression schemes and recompresses on the fly for any of them - it would have to be able to reproduce state without simply writing to or reading from the compressed bytes as provided by the user.

Overall, we can see offsets into compressed bytes as more of a "resumption token" than anything else - use of them doesn't permit any other use-cases.

To me, then, offsets into the uncompressed bytes are much better: they can be used to permit efficient and semantically-meaningful offset reads, broken and multi-party writes, and work well with recompression between compressors. They're slightly harder to use for resuming broken writes, but in all other respects seem like a clear win to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow your reasoning correctly, it's conditioned on the concern:

The server can't cheaply find an arbitrary read_offset in the uncompressed data, because it's likely to be stored in compressed form.

This is not quite true.

  • A basic server can start simply as you suggest, and only support reads and writes from offset 0. Totally reasonable imho.

This is reasonable, if there's a clear way to tell the client "seeking not supported". OUT_OF_RANGE kind of makes sense, but technically the request is in range but unsupported. OUT_OF_RANGE feels even less accurate for discontiguous ReadRequests.

If we can find a clean solution to this, then I'm fine with offsets referring to uncompressed data.

  • Finally, servers can store data compressed in chunks, to permit efficient offset reads also. (Decompress nearest chunk and recompress desired portion of it).

    • Of note here, for most compressors (including zstd and gzip) the output is concatenatable - decompress(compress("foo")+compress("bar"))=="foobar". So servers can instead of storing compress(file), store [compress(file.chunk1), compress(file.chunk2), ...] . And then to return the whole stream to the user, return the concatenation of all these chunks; to return contents from a later offset, recompress the one partial chunk (compress(decompress(file.chunkK)[N:])) and return that followed by the concatenation of later chunks in sequence.

One annoying part here is that (I suspect) splitting compressed data into concatenatable chunks is probably difficult. In practice this probably means that the server would need to recompress the blob before storing it in chunked form. Though I supposed it would need to be done at most once per blob, and an optional implementation detail.

  • This is all considerably harder if the server supports multiple compression schemes and recompresses on the fly for any of them - it would have to be able to reproduce state without simply writing to or reading from the compressed bytes as provided by the user.

Right, but isn't the goal to minimise server side recompression?

Overall, we can see offsets into compressed bytes as more of a "resumption token" than anything else - use of them doesn't permit any other use-cases.

To me, then, offsets into the uncompressed bytes are much better: they can be used to permit efficient and semantically-meaningful offset reads, broken and multi-party writes, and work well with recompression between compressors. They're slightly harder to use for resuming broken writes, but in all other respects seem like a clear win to me.

I agree re semantic meaningfulness, and to a lesser extent re efficiency (still OK with me if servers can tell the clients to just start over).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable, if there's a clear way to tell the client "seeking not supported".

Good point! Yes, we should definitely clarify that if it's to be allowed. Though that said, that also puts the burden on clients to have a codepath for falling back if an offset read is rejected, and would be considered a backwards-incompatible API change (since clients presumably don't handle this today). The more I think about it, the more I realize it's probably better to require servers at least implement an inefficient read-and-discard path. That's arguably simpler on net (the client-side fallback would be uglier and embed read-and-discard within it), and in most cases it'll be rarely enough hit that inefficient is probably fine?

One annoying part here is that (I suspect) splitting compressed data into concatenatable chunks is probably difficult. In practice this probably means that the server would need to recompress the blob before storing it in chunked form.

Yeah, it means the server needs to recompress, but that's just a combination of two things it has to do anyways - for compressed uploads it must be at least decompressing so it can check the digest matches; for uncompressed uploads it will be compressing as it puts the file into storage; now it needs to chain both. And yeah, chunking is more complicated if you choose to implement that, but it's not too bad. (I've already written it for our own compressed storage implementation fwiw - didn't take me too long).

Though I supposed it would need to be done at most once per blob

And for most use cases I'd expect reads to outweigh writes by a decent margin (we observe 9:1), so in practice only a small fraction of the bytes being handled by the server require compression. I found it to have a negligible impact on net server CPU usage, when we switched from uncompressed storage to chunk-compressed zstd-1. YMMV.

This is all considerably harder if the server supports multiple compression schemes and recompresses on the fly for any of them - it would have to be able to reproduce state without simply writing to or reading from the compressed bytes as provided by the user.

Right, but isn't the goal to minimise server side recompression?

I realized my example was over-complicated - we don't even need to bring other compressors into it. Any uncompressed read of a blob that's stored compressed is exactly the same - a read with an optional offset into the uncompressed contents - which is something everyone has to implement anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable, if there's a clear way to tell the client "seeking not supported".

Good point! Yes, we should definitely clarify that if it's to be allowed. Though that said, that also puts the burden on clients to have a codepath for falling back if an offset read is rejected, and would be considered a backwards-incompatible API change (since clients presumably don't handle this today). The more I think about it, the more I realize it's probably better to require servers at least implement an inefficient read-and-discard path. That's arguably simpler on net (the client-side fallback would be uglier and embed read-and-discard within it), and in most cases it'll be rarely enough hit that inefficient is probably fine?

I guess we should make an explicit statement here, one of:

  1. Servers MUST support compressed reads from arbitrary offsets.
  2. Servers MAY return error code X for non-zero offset reads, if they are not supported. Maybe ABORTED makes sense, based on the discussion in https://developers.google.com/maps-booking/reference/grpc-api/status_codes#failed_precondition_vs_aborted_vs_unavailable

One annoying part here is that (I suspect) splitting compressed data into concatenatable chunks is probably difficult. In practice this probably means that the server would need to recompress the blob before storing it in chunked form.

Yeah, it means the server needs to recompress, but that's just a combination of two things it has to do anyways - for compressed uploads it must be at least decompressing so it can check the digest matches; for uncompressed uploads it will be compressing as it puts the file into storage; now it needs to chain both. And yeah, chunking is more complicated if you choose to implement that, but it's not too bad. (I've already written it for our own compressed storage implementation fwiw - didn't take me too long).

Though I supposed it would need to be done at most once per blob

And for most use cases I'd expect reads to outweigh writes by a decent margin (we observe 9:1), so in practice only a small fraction of the bytes being handled by the server require compression. I found it to have a negligible impact on net server CPU usage, when we switched from uncompressed storage to chunk-compressed zstd-1. YMMV.

After some experimentation this seems OK (though I haven't benchmarked at scale yet). I'll try to update the PR tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable, if there's a clear way to tell the client "seeking not supported".

Good point! Yes, we should definitely clarify that if it's to be allowed. Though that said, that also puts the burden on clients to have a codepath for falling back if an offset read is rejected, and would be considered a backwards-incompatible API change (since clients presumably don't handle this today). The more I think about it, the more I realize it's probably better to require servers at least implement an inefficient read-and-discard path. That's arguably simpler on net (the client-side fallback would be uglier and embed read-and-discard within it), and in most cases it'll be rarely enough hit that inefficient is probably fine?

I guess we should make an explicit statement here, one of:

  1. Servers MUST support compressed reads from arbitrary offsets.
  2. Servers MAY return error code X for non-zero offset reads, if they are not supported. Maybe ABORTED makes sense, based on the discussion in https://developers.google.com/maps-booking/reference/grpc-api/status_codes#failed_precondition_vs_aborted_vs_unavailable
  1. Should we add support for offset reads to the capabilities? That would make choosing the client side codepath dead easy, and would save servers from needing to implement offset reads. I'm leaning against it--the throw-bytes-away approach is simple enough--but wondering how others feel.

  2. If we make not supporting (non-zero) offset reads an option, then I think that INVALID_INPUT is the correct error code. The non-zero offset is not a supported input.

One annoying part here is that (I suspect) splitting compressed data into concatenatable chunks is probably difficult. In practice this probably means that the server would need to recompress the blob before storing it in chunked form.

Yeah, it means the server needs to recompress, but that's just a combination of two things it has to do anyways - for compressed uploads it must be at least decompressing so it can check the digest matches; for uncompressed uploads it will be compressing as it puts the file into storage; now it needs to chain both. And yeah, chunking is more complicated if you choose to implement that, but it's not too bad. (I've already written it for our own compressed storage implementation fwiw - didn't take me too long).

Though I supposed it would need to be done at most once per blob

And for most use cases I'd expect reads to outweigh writes by a decent margin (we observe 9:1), so in practice only a small fraction of the bytes being handled by the server require compression. I found it to have a negligible impact on net server CPU usage, when we switched from uncompressed storage to chunk-compressed zstd-1. YMMV.

After some experimentation this seems OK (though I haven't benchmarked at scale yet). I'll try to update the PR tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we add support for offset reads to the capabilities? That would make choosing the client side codepath dead easy, and would save servers from needing to implement offset reads. I'm leaning against it--the throw-bytes-away approach is simple enough--but wondering how others feel.
  2. If we make not supporting (non-zero) offset reads an option, then I think that INVALID_INPUT is the correct error code. The non-zero offset is not a supported input.

I don't think it's worth adding a compressed-offset-reads boolean to the capabilities.

If we allow servers to return INVALID_INPUT for positive offset compressed reads, clients will have to do some guesswork to figure out which field was considered invalid. Whereas I think throw-the-bytes-away for compressed offset reads is simple enough for servers to implement, and uncommon enough to not spend time worrying about the performance impact.

Should we add some text to state that servers must support compressed reads from positive (in range) offsets?

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved
// * `instance_name` and `compressor` are defined as for uploads.
// * `uncompressed_hash` and `uncompressed_size` refer to the
// [Digest][build.bazel.remote.execution.v2.Digest] of the data being
// downloaded, once uncompressed. Clients MUST verify that these match
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand making verification a requirement for the server, because clearly specifying the error behavior is necessary. Is it really an API requirement that clients verify the data? What does it defend against?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I added this in response to your earlier issue: d342599#r483648050 "The same rules about verifying data and retrying apply here, too". In commit ed21a18 this text was changed from "Clients SHOULD" to "Clients MUST". Perhaps I misunderstood your earlier comment?

Verifying these hashes protects against data corruption (on disk, or in transit, or perhaps in compression).

It is undesirable for servers to perform this check, because that would require that they decompress + hash these compressed blobs- whereas one of the goals of this feature is to avoid this work and stream compressed blobs directly.

Clients on the other hand need to decompress this data anyway, and if servers aren't verifying then I think the clients should be. I'm not sure if they "must", though.

rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Oct 30, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Oct 30, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 6, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 6, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 9, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 9, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 11, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 12, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 12, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 13, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 13, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
@mostynb
Copy link
Contributor Author

mostynb commented Nov 16, 2020

Here is an early prototype of compressed-blobs support for bazel-remote:
https://github.com/mostynb/bazel-remote/tree/compressed_blobs_prototype_1

I haven't finalized the on-disk format for cas blobs, so please do not test this on important/production caches yet (existing CAS blobs will be rewritten to a new format on startup). To avoid accidents, this version of bazel-remote will refuse to start unless you specify the --destructive_compression_prototype command line flag.

rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 16, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 17, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 17, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 17, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 17, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 23, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 23, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 23, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 24, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 24, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 24, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 24, 2020
This follows the current tentative API being worked on in
bazelbuild/remote-apis#168. While there's technically room for it to
change, it has reached a somewhat stable point worth implementing.
rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 24, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
@mostynb
Copy link
Contributor Author

mostynb commented Nov 24, 2020

Squashed the commits, undid the changes to the generated go bindings, rebased on master and added a fixup with a note about the bytestream ReadRequest.read_limit referring to the compressed data and warning clients not to assume that this value is <= the uncompressed data size.

(I can regenerate the go bindings and squash everything down to a single commit when this is ready to land.)

rubensf added a commit to bazelbuild/remote-apis-sdks that referenced this pull request Nov 25, 2020
It follows the specs specified in bazelbuild/remote-apis#168, and
it is similar to #232. Note that while the API still has room to
change, it is mostly finalized and worth implementing.

A caveat of this implementation is that while the `offset` in reads
refers to the uncompressed bytes, the `limit` refers to the compressed
bytes.
//
// Note that when reading compressed blobs, the `ReadRequest.read_offset`
// refers to the offset in the uncompressed form of the blob, and the
// `ReadRequest.read_limit` refers to the offset in the specific compressed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment. read_limit is not an offset akin to read_offset, it's a limit on the number of bytes to be read from that point. (I read this sentence as implying offset from beginning, though I'm not sure if that's how it was intended).

When would a caller want to set a read_limit to constrain the number of compressed bytes returned? I suppose I could imagine it as a defensive mechanism to avoid unbounded reading (since technically the compressed size could be higher than the uncompressed size), but the client can also just abort the read if they're getting too many raw bytes back. But it seems like it'd be more semantically meaningful for it to also refer to the uncompressed bytes, so that one can express "read me 1MB of bytes starting at offset 100MB", independent of compression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment. read_limit is not an offset akin to read_offset, it's a limit on the number of bytes to be read from that point. (I read this sentence as implying offset from beginning, though I'm not sure if that's how it was intended).

They are only akin in the sense that they're siblings in the ReadRequest message which I think deserve clarification for our use case. Perhaps we should split these into distinct sentences, or dot points?

While it would be easier to understand from the client side if read_limit was specified in uncompressed bytes, I don't think it's possible to implement well on the server side without sacrificing more important qualities or only providing an approximate limit (eg at the chunk-size resolution if the server implements that).

When would a caller want to set a read_limit to constrain the number of compressed bytes returned? I suppose I could imagine it as a defensive mechanism to avoid unbounded reading (since technically the compressed size could be higher than the uncompressed size), but the client can also just abort the read if they're getting too many raw bytes back. But it seems like it'd be more semantically meaningful for it to also refer to the uncompressed bytes, so that one can express "read me 1MB of bytes starting at offset 100MB", independent of compression.

I suspect read_limit is something that is useful for other bytestream use cases, eg a defensive mechanism as you mentioned, when the size of the thing being requested isn't specified in the resource_name. I do not think it's useful for us here.

I would be OK with instead stating that server should ignore this field. Should we do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait- I meant for this new text to say something like "refers to a limit of the number of compressed bytes in the specific compressed form."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only akin in the sense that they're siblings in the ReadRequest message which I think deserve clarification for our use case. Perhaps we should split these into distinct sentences, or dot points?

Dot points like the other arguments above sound useful.

I see a couple other semantic problems with having read_limit refer to the uncompressed size:

  • In that case, Bytestream.Read now has to be aware of two limits: the one to advertise in the request, and the one to enforce for aborting the connection.
  • If the client is aborting the connection as a way of limitting, read_limit becomes a somewhat redundant argument when reading uncompressed bytes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it would be easier to understand from the client side if read_limit was specified in uncompressed bytes, I don't think it's possible to implement well on the server side without sacrificing more important qualities or only providing an approximate limit (eg at the chunk-size resolution if the server implements that).

It would be a bit fiddly, but not atrocious: uncompress, truncate, and recompress the final block, similar to what we have to do for the initial block of an offset read. I can definitely see the argument it's not worth the effort though.

Rubens:

In that case, Bytestream.Read now has to be aware of two limits: the one to advertise in the request, and the one to enforce for aborting the connection.

I'm not sure I understand this point - are you referring to a client implementation of Read? Presumably a client will pick one or the other and not both.

If the client is aborting the connection as a way of limitting, read_limit becomes a somewhat redundant argument when reading uncompressed bytes.

TBH, whether you want to limit on compressed bytes or uncompressed bytes, you can do it by just reading till you've gotten enough and then aborting. Which is maybe an argument towards Mostyn's proposal of just ignoring the field?

The only use-case I actually have in mind for read_limit is UIs - where e.g. "Read 1MB at offset 100MB" is theoretically useful for paginating large files, and would require the limit to be expressed in terms of uncompressed bytes to be useful. Limiting compressed bytes is kinda bizarre - it either requires you to guess how many compressed bytes will be needed to get the right number of uncompressed bytes, or to arbitrarily truncate the read after some arbitrary number of logical bytes have been read, as a way to defend against...something. (But it'll decompress into arbitrarily large output, so not a very good defence against memory blowup?).

But again, I don't know that it matters enough - if you have a preference here, SGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want servers to implement this, should we require servers to return INVALID_ARGUMENT if the value is non-zero, or to ignore the field completely? I think the former would be easier to gracefully migrate from if we decide to add support later.

Copy link
Contributor

@rubensf rubensf Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this point - are you referring to a client implementation of Read? Presumably a client will pick one or the other and not both.

I do mean a client implementation of Read. If ByteStream.ReadRequest.read_limit refers to the uncompressed bytes, and the Read call will abort after a certain number of raw bytes, that's two limits to be aware about. I guess you could argue that this is implementation specific, or not exactly an issue, but this would likely means refactoring of some existing bytestream client implementations to support this - the go remote-apis-sdks being one of them. I'm not sure if that's a significant issue.

I think that, plus having to rework the last block of a blob in the server side, make it slightly harder to support read_limit referring to the uncompressed bytes. But I also agree that this control can be much more useful.

If we don't want servers to implement this, should we require servers to return INVALID_ARGUMENT if the value is non-zero, or to ignore the field completely? I think the former would be easier to gracefully migrate from if we decide to add support later.

Being explicit with INVALID_ARGUMENT sounds better than silent ignores IMO - it saves people trouble debugging if trying to set a read_limit and receiving more bytes than expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a fixup that reorganises the "when reading compressed-blobs" statements into dot points, and require servers to return INVALID_ARGUMENT for non-zero read_limit values.

@mostynb
Copy link
Contributor Author

mostynb commented Jan 21, 2021

@peterebden: here are some compressed-blobs server implementation notes (feel free to add questions/comments over there):
https://docs.google.com/document/d/1aGjg3TghOeNUzZzHG_gywXPhREJTT3Xei6WZBYDDv1U/edit?usp=sharing

In many cases, it is desirable for blobs to be sent in compressed form
to and from the cache. While gRPC supports channel-level compression,
the generated bindings APIs require that implementers provide data in
unserialized and uncompressed form.

By allowing compressed data at the REAPI level instead, we can avoid
re-compressing the same data on each request. The ByteStream API stands
to benefit the most from this, with the least amount of effort.

Thanks to Eric Burnett and Grzegorz Lukasik for helping with this.

Implements bazelbuild#147.
@mostynb mostynb force-pushed the compressed_blobs branch from eaf2779 to 8cb2f1b Compare March 2, 2021 07:02
@mostynb
Copy link
Contributor Author

mostynb commented Mar 2, 2021

Squashed and rebased this on master.

(Also: I landed support for this on bazel-remote's master branch this morning.)

@bergsieker bergsieker merged commit 0943dc4 into bazelbuild:master Mar 9, 2021
@mostynb mostynb deleted the compressed_blobs branch March 9, 2021 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Pull requests whose authors are covered by a CLA with Google.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants