Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Adding ArraySegment-based methods to the System.IO namespace #16635

Closed
jamesqo opened this issue Mar 8, 2016 · 18 comments
Closed
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO
Milestone

Comments

@jamesqo
Copy link
Contributor

jamesqo commented Mar 8, 2016

Background

To avoid unnecessary allocations, lots of code that works with Streams and I/O passes around a 'portion' of an array containing the relevant data instead of copying it to a new one. Since .NET does not yet support multiple return values, this is commonly represented as an ArraySegment<T>, where T is the type of the element contained in the array. Unfortunately, the System.IO namespace does not have overly sophisticated support for ArraySegments (aside from the recently-added TryGetBuffer for MemoryStream), forcing people who use them to 'unravel' them like this:

stream.Write(segment.Array, segment.Offset, segment.Count);

I propose adding convenience methods to the existing System.IO APIs that will make this less verbose, and prevent accidents like e.g. the programmer mixing up Offset and Count.

Proposed API

namespace System.IO
{
    public class BinaryReader : IDisposable
    {
        public int Read(ArraySegment<byte> buffer);
        public int Read(ArraySegment<char> buffer);
    }

    public class BinaryWriter : IDisposable
    {
        public void Write(ArraySegment<byte> buffer);
        public void Write(ArraySegment<char> buffer);
    }

    public class MemoryStream : Stream
    {
        public MemoryStream(ArraySegment<byte> buffer);
        public MemoryStream(ArraySegment<byte> buffer, bool writable);
        public MemoryStream(ArraySegment<byte> buffer, bool writable, bool publiclyVisible);
    }

    public class Stream : IDisposable
    {
        public int Read(ArraySegment<byte> buffer);
        public Task<int> ReadAsync(ArraySegment<byte> buffer);
        public Task<int> ReadAsync(ArraySegment<byte> buffer, CancellationToken cancellationToken);
        public void Write(ArraySegment<byte> buffer);
        public Task WriteAsync(ArraySegment<byte> buffer);
        public Task WriteAsync(ArraySegment<byte> buffer, CancellationToken cancellationToken);
    }

    public abstract class TextReader : IDisposable
    {
        public int Read(ArraySegment<char> buffer);
        public int ReadBlock(ArraySegment<char> buffer);
        public Task<int> ReadBlockAsync(ArraySegment<char> buffer);
    }

    public abstract class TextWriter : IDisposable
    {
        public void Write(ArraySegment<char> buffer);
        public Task WriteAsync(ArraySegment<char> buffer);
        public void WriteLine(ArraySegment<char> buffer);
        public Task WriteLineAsync(ArraySegment<char> buffer);
    }
}
@justinvp
Copy link
Contributor

justinvp commented Mar 8, 2016

Before adding a bunch of additional ArraySegment<T> helper overloads in System.IO, it might be prudent to wait until Span<T>/ReadOnlySpan<T> lands in CoreFX, and then adopt that in System.IO along these lines.

dotnet/roslyn#120
https://github.com/dotnet/corefxlab/blob/master/src/System.Slices/System/Span.cs
https://github.com/dotnet/corefxlab/blob/master/src/System.Slices/System/ReadOnlySpan.cs

@KrzysztofCwalina
Copy link
Member

I think it's fair to say that we will either add support for slices or arraysegment. Let's hope that for slices :-)

@jamesqo
Copy link
Contributor Author

jamesqo commented Mar 10, 2016

@justinvp @KrzysztofCwalina How would that work in terms of the existing API though? As far as I can see, Span<T> does not expose the underlying array it's based on.

@KrzysztofCwalina
Copy link
Member

I think we will need unsafe API on Span to access T*

@benaadams
Copy link
Member

Return ValueTask<int> rather than Task<int> take the opportunity since the return types on the current methods can't be changed.

@jamesqo
Copy link
Contributor Author

jamesqo commented Sep 5, 2016

@benaadams These are just wrappers over the existing methods.

@benaadams
Copy link
Member

@jamesqo Task<int> will convert to ValueTask<int> so they can still be a wrapper; however if they are virtual then derived classes can override them to return actual ValueTask<int>

@karelz
Copy link
Member

karelz commented Oct 11, 2016

UPDATE (old 'obsolete' ask/suggestion deleted):
Please wait on boader design discussion around Span<T> and ArraySegment to happen first.

@jamesqo
Copy link
Contributor Author

jamesqo commented Oct 12, 2016

@karelz, is it possible to Span<T> efficiently to a (T[] array, int offset, int count)? If not then it may not be possible to update this proposal w/o making all of the methods virtual and performing copying by default like is done is Encoding, which would be rather unfortunate...

@jkotas

@karelz
Copy link
Member

karelz commented Oct 13, 2016

@KrzysztofCwalina will kick off much broader public discussion around design and usage of Span<T>, ArraySegment, etc. early next week. That design discussion will probably go on for a while.
Until there is consensus from that discussion we will put on hold all API additions which are likely to be impacted by the outcome of that discussion (like this one). Please stay tuned.

@davidfowl
Copy link
Member

Span may not really work well for some of these APIs because of the stack only nature of it (specifically the async APIs)

@ayende
Copy link
Contributor

ayende commented Apr 23, 2017

@davidfowl but if I have some unmanged memory, I really want to just pass a span to it rather than cipu to managed array. It isn't allocated on the stack

@stephentoub
Copy link
Member

I'm going to close this in favor of dotnet/corefx#21281. @jamesqo, if anything would be covered by this issue that's not covered functionally by the other, please comment over there. Thanks!

@ddotlic
Copy link

ddotlic commented Nov 28, 2019

@stephentoub Let me try to convince you that the item you linked to and all the - very useful! - changes made to the .NET APIs WRT Span and friends (Memory etc.) have almost nothing to do with this proposal 😉

An example: am loading bytes from a Stream (it's a file, but no matter) and processing them (decryption and decompression); this is a .NET Standard 2.1 class library. Am trying to minimize allocations. This is somewhat hard because lots of FX code relies on Stream API, though the changes in dotnet/corefx#21281 help quite a bit.

Ultimately, everything works OK-ish except that MemoryStream I am writing into (will come back to this) allocates way too much because when it doesn't have enough space it doubles and copies.

Sample code:

private static byte[] Decompress(byte[] data) {
    using var output = new MemoryStream();
    using var input = new MemoryStream(data);
    using var gzip = new GZipStream(input, CompressionMode.Decompress);

    zip.CopyTo(output);
    output.Flush();
    return output.ToArray();
}

In practice, I don't need to write into this MemoryStream - the APIs kinda force my hand, that's the only reason. In fact, both byte[] on input and byte[] on output of this method are wasteful.

I could do things differently: the byte[] that is sent to this method was read from a Stream using a "classic" Read(byte[]) overload. I could have used Read(Span<byte>) multiple times (I don't necessarily know the length of the source stream) and have in my possession a list of Span<byte> (or rather list of Memory<byte> because I'd use MemoryPool<byte>.Shared.Rent)

Exactly in the same way in the middle of the above method, instead of CopyTo I could have used Read(Span<byte>) on gzip (in a loop) until I read everything.

In both cases, I now have blocks of memory which logically represent a single contiguous array of bytes. I could create a single block and copy all the bytes, but that's just wasteful.

Since most FX APIs still only work with Streams, it would be ideal if I could "wrap" this list of Memory<byte> with a MemoryStream, very much like the original proposal in this item. Then the rest of the app reading from this MemoryStream would incur close to zero further allocations (and when MemoryStream is disposed, I could return the rented arrays to the pool), assuming the reading code also used Read(Span<byte>) overload.

I've looked almost everywhere, and have seen a few things which could help me, but not quite:

  1. System.IO.Pipelines seems useful, but I'd need adapters (not available?) for Stream wrapping pipe reader; additionally, it looks like a lot of code for something which can be made simpler
  2. Along the same lines, it looks like ReadOnlySequence and BufferSegment could be used to partially model what I have above but only in isolation - there are no classes "virtualizing" a list of BufferSegment or a single ReadOnlySequence(it models several segments but its API is frankly very difficult to understand) such that I can give the virtual Span<byte> or Memory<byte> to other .NET APIs which expect more or less flat array of bytes (if I could do that then I wouldn't need a new version of MemoryStream, the existing one would work fine)
  3. I looked at a lot of extension methods, source of FX and CoreCLR and cannot find anything.

So... am I missing something? It seems to me that all that I need would be an implementation of MemoryStream - read only variant - which would accept in its constructor a list of Memory<byte> or a list of BufferSegment. That could cover a lot of scenarios when combined with existing APIs. Or something which could facade into a single Memory<byte> from a list of the Memory<byte> then use normal MemoryStream.

Is there something in FX that I have missed? I'm sure I have, please advise 😊

Thanks for reading and thanks for the great work everyone is doing in .NET Core!

@ddotlic
Copy link

ddotlic commented Nov 28, 2019

@stephentoub Please disregard all references to BufferSegment: turns out it was only used as an example in @davidfowl article 😕 it's not an FX type...

@davidfowl
Copy link
Member

I think a better solution would be to expose the innards of those Stream implementation as flat APIs (we've discussed this internally for a while now).

public static class GZip
{
    public OperationStatus TryDecompress(ReadOnlySpan<byte> input, Span<byte> output, out int bytesWritten);
}

Something like the above, I haven't thought about it deeply though (it might need to be non static since zlib is stateful)

@stephentoub
Copy link
Member

Let me try to convince you

Thanks. Almost all of the APIs proposed in this thread are addressed, in that they're all Write/Read methods that take ArraySegment<T>, which is effectively superceded by Memory<T>, and are addressed as mentioned by https://github.com/dotnet/corefx/issues/21281. The one set you're commenting on are the MemoryStream constructors, which actually are also mentioned by that issue (see the section titled "System.IO.BufferStream and System.IO.ReadOnlyBufferStream"... that issue was created when Memory<T> was called Buffer<T>). That portion of the issue was split out into https://github.com/dotnet/corefx/issues/22404, which is still open; feel free to provide feedback on it there. There is also https://github.com/dotnet/corefx/issues/21380 about array pooling with MemoryStream.

@ddotlic
Copy link

ddotlic commented Dec 2, 2019

@davidfowl Of course, having a more "modern" API would be much better than struggling with the aging Stream (we are all well aware of its shortcomings) but the reality is that a crapload of code, including FX, still works "best" (meaning most functionality provided out-of-the-box) with the Stream, hence my comments. It's good to know that folks inside Microsoft are aware of this (tiny) aspect of the FX and are looking into ways to improve it. You say you've discussed internally, are there any "in the open" discussions the community at large may contribute to?

@stephentoub Yes, after I wrote all that I stumbled upon dotnet/corefx#21380 which is much more fitting for what I'm trying to do here, in the absence of pervasive changes to the FX along the lines of @davidfowl proposal above. Sorry for not finding that other item sooner. In fact, when I think about it a bit more - the dotnet/corefx#22404 looks exactly like what I would expect to have one day - one stream to read values from and other to write into; in practice most of my memory streams are either read-only or write-only. I do share @KrzysztofCwalina sentiment though - how the hell are we supposed to name this thing? 😉I actually have a working implementation for portions of both items you linked to; I have some concerns which I will raise on there.

Thank you both for your comments clearly made in your own time, over the WE, it's much appreciated!

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO
Projects
None yet
Development

No branches or pull requests

10 participants