Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: modify how multipart bytes header is built (no space) on http source #1018

Merged
merged 3 commits into from
Nov 13, 2023

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Nov 13, 2023

After doing some benchmarks for the http and fsspec sources, I noticed a significant difference between my "requests-only" implementation and using uproot http source. Mine was correctly doing the multipart request while uproot wasn't. After some debugging I found the reason why: uproot separates the byte ranges with a ", " (comma + space) while I was just using a single comma. This space in the header apparently results in some servers not responding as expected.

At first I though that this failure in the request was due to the significantly large header (just adding the space results in 150 additional characters when retrieving 151 tbaskets as my use-case). But then I realised that the server just didn't understand the request if it had the space.

I am using the following endpoint: https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root

Not using space: curl -H "Range: bytes=0-100,200-300" https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root --output /tmp/output. Works fine.

Using space: curl -H "Range: bytes=0-100, 200-300" https://ncsmith.web.cern.ch/rootio/Run2012B_DoubleMuParked.root --output /tmp/output. Doesn't work, server never replies.

In this case uproot doesn't hang and will fallback to not use multipart (after the request times out). This may have caused uproot to work slower as expected since we were not aware that it was falling back in some cases. I am using a cern server here (@nsmith- knows the details of the server better than I do).

I could not clear information about how the actual "Range" header should be formatted. In https://svn.apache.org/repos/asf/labs/webarch/tags/draft-fielding-http/draft00/p5-range.html it appears without the space, but in https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests it appears with the space...

This PR removes the space from the header, which may result in some servers not understanding the request, but removing the space is necessary for other servers to understand it. Removing the space has some advantages: the request will be smaller (which may be the differece between accepting it or not for some servers that have a header length limit).

We could also handle both cases: right now a multipart range request is sent, if it fails we assume server doesn't support multipart. We could send two requests instead (one with space, one without) and if both fails, assume multipart is not supported).

@jpivarski please let me know what do you prefer:
1 - Request with space in range, then without space in range, then fallback
2 - Request without space in range, then with space in range, then fallback
3 - Request without space in range, then fallback
4 - Request with space in range, then fallback (current)

If we want to support servers that can only parse the comma-only separator and servers that can only parse the comma and space separator then we should do (2 -). This would however slightly increase the time to trigger the fallback when multipart is not supported.

If we assume there are no servers that do not allow the comma-only separator (or we just don't care to support them), then we should do (3 -). I would be inclined to choose this at the moment.

@lobis lobis requested review from jpivarski and nsmith- November 13, 2023 00:29
@lobis lobis marked this pull request as ready for review November 13, 2023 00:31
@lobis lobis mentioned this pull request Nov 13, 2023
@agoose77
Copy link
Collaborator

agoose77 commented Nov 13, 2023

@lobis this sniped me!

I looked at the RFCs, and found RFC9110 which defines the semantics of HTTP. In that document, it appears that the grammar for a list is given as:

1#element => element *( OWS "," OWS element )

thus the spacing is arbitrary. It seems that the recipients are making too strong assumptions about the grammar if they can't parse this.

On a second read, it's not 100% clear to me whether this refers to the protocol or to the ABNF grammar used to define it, so let me check back once I'm back from running an errand.

@agoose77
Copy link
Collaborator

Yes, I took another look and Range header satisifies ranges-specifier

range-resp = incl-range "/" ( complete-length / "*" )
range-set = range-spec *( OWS "," OWS range-spec )
range-spec = int-range / suffix-range / other-range
range-unit = token
ranges-specifier = range-unit "=" range-set

@lobis
Copy link
Collaborator Author

lobis commented Nov 13, 2023

In conclusion:

whitespaces (on both sides of the comma) are supported and optional. The server should be able to process the requests with spaces but it isn't (server's fault). We can safely drop the spaces since they are optional and we would produce a smaller request in doing so, which be a good thing even if this issue didn't exist.

@jpivarski
Copy link
Member

That same RFC defines OWS as optional whitespace, so the specification is explicitly allowing spaces before and after the commas in a ranges-specifier. HTTP servers that enable multi-part GET only when there are no spaces are just wrong—it's a bug.

However, those servers are out there and it costs us nothing to drop the space. (I don't know how we would report this, either.)

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it! Drop the space!

@lobis lobis merged commit 983b986 into main Nov 13, 2023
@lobis lobis deleted the http-multipart branch November 13, 2023 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants