Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] - Batch / Data Size Can't Exceed 2 gigs #23776

Open
asfimport opened this issue Jan 7, 2020 · 1 comment
Open

[C#] - Batch / Data Size Can't Exceed 2 gigs #23776

asfimport opened this issue Jan 7, 2020 · 1 comment

Comments

@asfimport
Copy link
Collaborator

While the Arrow spec does not forbid batches larger than 2 gigs, the C# library can not support this in its current form due to limits on managed memory as it tries to put the whole batch into a single Span/Memory

It is possible to fix this by not trying to use Memory/Span/byte[] for the entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This only move the problem 'lower' as it would then still set the limit of a Column Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the data is just one giant string appended to together with offsets and it can get very large quickly.

I think the unfortunate problem is that memory management in the C# managed world is always going to hit the 2 gig limit somewhere. (please correct me if I am wrong on this statement, but I thought i read some where that Memory / Span are limited to int and changing to long would require major framework rewrites - but i may be conflating that with array)

That ultimately means the C# library either has to reject files of certain characteristics (ie validation checks on opening) , or the spec needs put upper limits on certain internal arrow constructs (ie arrow buffer) to eliminate the need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, one has to wonder if at some point, it might just make sense for the C# library to use the C++ library as its memory manager as replicating a very large blocks of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by moving it down to the buffer objects... This might require some re-write of the batch data structures

 

 

Reporter: Anthony Abate / @abbotware

Note: This issue was originally created as ARROW-7511. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Anthony Abate / @abbotware:
Now i remember why I thought Memory and Span can't support more than 2 gigs:

The .Slice() function only takes int32

https://docs.microsoft.com/en-us/dotnet/api/system.memory-1.slice?view=netcore-3.1#System_Memory_1_Slice_System_Int32_System_Int32_

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant