feat: modify how multipart bytes header is built (no space) on http source #1018
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After doing some benchmarks for the http and fsspec sources, I noticed a significant difference between my "requests-only" implementation and using uproot http source. Mine was correctly doing the multipart request while uproot wasn't. After some debugging I found the reason why: uproot separates the byte ranges with a ", " (comma + space) while I was just using a single comma. This space in the header apparently results in some servers not responding as expected.
At first I though that this failure in the request was due to the significantly large header (just adding the space results in 150 additional characters when retrieving 151 tbaskets as my use-case). But then I realised that the server just didn't understand the request if it had the space.
I am using the following endpoint:
https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root
Not using space:
curl -H "Range: bytes=0-100,200-300" https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root --output /tmp/output
. Works fine.Using space:
curl -H "Range: bytes=0-100, 200-300" https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root --output /tmp/output
. Doesn't work, server never replies.In this case uproot doesn't hang and will fallback to not use multipart (after the request times out). This may have caused uproot to work slower as expected since we were not aware that it was falling back in some cases. I am using a cern server here (@nsmith- knows the details of the server better than I do).
I could not clear information about how the actual "Range" header should be formatted. In https://svn.apache.org/repos/asf/labs/webarch/tags/draft-fielding-http/draft00/p5-range.html it appears without the space, but in https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests it appears with the space...
This PR removes the space from the header, which may result in some servers not understanding the request, but removing the space is necessary for other servers to understand it. Removing the space has some advantages: the request will be smaller (which may be the differece between accepting it or not for some servers that have a header length limit).
We could also handle both cases: right now a multipart range request is sent, if it fails we assume server doesn't support multipart. We could send two requests instead (one with space, one without) and if both fails, assume multipart is not supported).
@jpivarski please let me know what do you prefer:
1 - Request with space in range, then without space in range, then fallback
2 - Request without space in range, then with space in range, then fallback
3 - Request without space in range, then fallback
4 - Request with space in range, then fallback (current)
If we want to support servers that can only parse the comma-only separator and servers that can only parse the comma and space separator then we should do (2 -). This would however slightly increase the time to trigger the fallback when multipart is not supported.
If we assume there are no servers that do not allow the comma-only separator (or we just don't care to support them), then we should do (3 -). I would be inclined to choose this at the moment.