Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Split HTTP::Client::Response in two #9901

Open
asterite opened this issue Nov 11, 2020 · 39 comments
Open

[RFC] Split HTTP::Client::Response in two #9901

asterite opened this issue Nov 11, 2020 · 39 comments

Comments

@asterite
Copy link
Member

Context

I think it's a shame that having a compiled and type-safe language we are using the same type for non-streaming (consumed) and streaming responses:

require "http/client"

response = HTTP::Client.get("...")
response.body # okay
response.body_io # not okay

HTTP::Client.get("...") do |response|
  response.body # not okay
  response.body_io # okay
end

The "okay" and "not okay" here mean that when it's "not okay" we get a runtime error. This is just a minor thing because you'll probably notice it right away and change it to use the other. However, it can be quite confusing to understand why you get that error. And also, we could catch this at compile-time.

Proposal

I propose we have this hierarchy:

  • abstract HTTP::Client::Response
    • HTTP::Client::Response::Consumed: returned in the non-block version and has a String body.
    • HTTP::Client::Response::Streaming: returned in the block version and has an IO body

Some decisions to make

  1. Should we do this? I personally think so.
  2. Do we use the same name for the body property in both cases? That means you can always use body (nice) but the type changes depending on whether it's in a block or not. The downsides are that it can be a bit confusing, and it's also a breaking change. Alternatively we can keep it as body and body_io so it's not a breaking change.
@jhass
Copy link
Member

jhass commented Nov 11, 2020

I was having the same thought process for it actually, so I'm all for it. body and body_io always felt weird and out of place. So also 👍 to calling it body on both.

If wanted Streaming could maintain body_io as deprecated alias to body for a while. Consumed will not need body_io as that was never working for its usecases in the first place and body keeps working as expected.

The biggest breakage here will be people typing stuff as Response which in this approach will loose its body and body_io properties, so people will have to update their type declarations to continue using it.

The alternative to subclassing is making body be String|IO and have people type check that... which still is a runtime assertion and barely better than body and body_io.

Third option I see would be to make body always be IO, a prefetched IO::Memory instance for the non-streaming case, to not leak FDs

@asterite
Copy link
Member Author

Another idea: let both have body and body_io:

  • for non-streaming we have body and we could expose body_io as an IO::Memory over body.
  • for streaming we could make a call to body consume the body_io into a String, then replace body_io with IO::Memory over body. The problem with this is if someone consumes a bit of the IO before calling body.

But maybe the downsides are not that bad. And we could keep a single Response type.

But I still feel having two types is better.

@jhass
Copy link
Member

jhass commented Nov 11, 2020

Yeah, that seems like it's equally brittle to use as the current interface. I would prefer forcing users to always call .body.gets_to_end if they want a string explicitly (building on the always IO option outlined above).

@straight-shoota
Copy link
Member

Yeah, this part of the API is really shady and I'm all in for improving that. I'm not to sure about having two separate types, though. Feels a bit overhead and IMO might not be that user-friendly.

There's another solution I've already mentioned in #8461 (comment): Yield the IO as a second argument to the block in the streaming scenario.

I'd like to inspect some real examples to see how this is used in the field. That should help us discuss an optimal improvement for this API.

@jhass
Copy link
Member

jhass commented Nov 11, 2020

Hm, yielding it as second argument still seems to have potential runtime errors or weird behavior when mixing using that IO with calling Response#body at some point.

@straight-shoota
Copy link
Member

According to skimming the code available on Github, by far the most common scenario is a quick and dirty call which doesn't really need a response object and just retrieves the body as a string:

HTTP::Client.get("http://example.com").body

(more elaborate versions also include a .success? check or - rarely - some crude redirect logic)

Maybe there should be a convenience method for this which also handles some HTTP protocol context (like status code).
If we had this, the majority of use cases (the simple ones) would be dealt with. Then there would actually be little reason to have a non-streaming variant of Reponse... All more complex use case which still only want the body as string would just have to call gets_to_end.

I think that might really be a great solution. Essentially, it just means removing Response#body and providing an easy-to-use alternative for the most common use case.

@j8r
Copy link
Contributor

j8r commented Nov 11, 2020

Why there is one that consumes the body to a string, and not the other way around? It's not obvious at a first glance, so I think having a single behavior for both will be less confusing.

We could have:

getter body_io : IO
# Consummes the body IO
def body!
  @body_io.gets_to_end
end

@jhass
Copy link
Member

jhass commented Nov 11, 2020

Why there is one that consumes the body to a string, and not the other way around? It's not obvious at a first glance, so I think having a single behavior for both will be less confusing.

We could have:

getter body_io : IO
# Consummes the body IO
def body!
  @body_io.gets_to_end
end

That's #9901 (comment) with different names, so 👎

@bcardiff
Copy link
Member

If we want to do something for 1.0 I am inclined to:

  • keep a single response type
  • allow the non-streaming to have a body_io with a IO::Memory over body
  • make the streaming response raise on runtime when body is called

That way:

  • body_io is always available if needed as lower level fetch of the body, but potentially only once.
  • There is no ambiguity if some data from the body_io is consumed in the streamed response

If we want to tweak further the design I think is after 1.0.

@j8r
Copy link
Contributor

j8r commented Nov 11, 2020

The one-liner argument is not a good enough one, because we can do:

p HTTP::Client.get("http://example.com").body
p HTTP::Client.get "http://example.com", &.body_io.gets_to_end
# Perhaps even p HTTP::Client.get "http://example.com", &.body!

@straight-shoota
Copy link
Member

@j8r I don't understand your last comment. What point are you trying to make? Is it hat the yielding variant is only marginally longer and the "simple" one isn't necessary?

@j8r
Copy link
Contributor

j8r commented Nov 11, 2020

Yes, I mean that all could be streaming, and as user-friendly as non-streaming. Both kind of users - those needing more low level/performance and those just wanting something easy - will be happy then.

@straight-shoota
Copy link
Member

@bcardiff Sounds good to me. That would allow a later discussion about bigger changes, possibly in conjuction with #6011.

@jhass
Copy link
Member

jhass commented Nov 11, 2020

If we want to do something for 1.0 I am inclined to:

I'm not sure whether that really improves the situation, it just seems to shift the inconsistency away from the particular case that triggered this discussion, so I think I would then rather prefer to leave things as is. (Personally I think 1.0 is the perfect opportunity for a breaking change like this and starting with a clean API into the stable phase).

To me this is about getting rid of having both body and body_io, which is where all the confusion, weirdness and inconsistency stems from. I don't see benefit in changes that don't achieve that goal.

@asterite
Copy link
Member Author

@bcardiff

make the streaming response raise on runtime when body is called

With this RFC I'm trying to avoid runtime errors and move them to compile-time errors.

@straight-shoota

If the most common case is fetching a response without a block and getting the body as a string then it shouldn't matter if we introduce two response types, because most don't have to deal with a specific type name. The body property will continue working for them. The advantage is that when you use the streaming version you will get a compile-time error if you type body instead of body_io (same for the other way around).

@straight-shoota
Copy link
Member

@asterite

If the most common case is fetching a response without a block and getting the body as a string then it shouldn't matter if we introduce two response types

But what's the point in having two separate types when one of them is rarely used and can simply be emulated using the other one?

@asterite
Copy link
Member Author

But what's the point in having two separate types when one of them is rarely used and can simply be emulated using the other one?

Type safety.

@straight-shoota
Copy link
Member

straight-shoota commented Nov 11, 2020

I know that's the goal for this RFC. My suggestion to remove Response#body would be completely type safe. A second type for type safety is only required if there's a real need for a non-streaming response type. And that's what I'm questioning.

@asterite
Copy link
Member Author

A second type for type safety is only required if there's a real need for a non-streaming response type

I think in most cases you want a non-streaming response because responses are small. However, in a few cases you'd like a streaming response, for example when downloading a large file, or when the response is meant to be streaming (like sending lines of JSON that you parse individually, and the connection is kept open).

The way we design it, for the streaming way you have to do it in a block so you can't forget to close the IO. In Go you always get an "IO" and you have to remember to close it, and we didn't like that.

So, we have two use cases and the way you use them in the same except that in one case you deal with a String, in the other you deal with an IO. We could extract that outside of the Response object, but then it would be a bit weird:

response, body = HTTP::Client.get("...")

When you talk about an HTTP response body, the body really belongs to the response, it's not a separate entity. And it's just nice to be able to pass the response to other functions.

What I'm wondering is: if we split it into two types, in which way that produces friction?

@jhass
Copy link
Member

jhass commented Nov 11, 2020

I feel like the most simple breaking change we could do right now is removing body_io, making body return IO always and let that be IO::Memory for the non streaming case (avoiding Go's pitfall). This increases API usage verbosity for the moment, but would also provide a simple baseline for adding convenience API in a non-breaking manner, say for example HTTP.get_body(...) : String.

@straight-shoota
Copy link
Member

@asterite

I totally agree to your assessment of the typical use cases. I already mentioned that in #9901 (comment)
My point is, that when you want the response as a string, chances are high you don't really care about the response object. The vast majority of use cases are literally just HTTP::Client.get("http://example.com").body.
So why not make it easier for these use cases and just return the body (as String) and not the Response with cached body?
For example, we could end up with two API methods like this:

HTTP::Client.get("http://example.com") # returns String
HTTP::Client.get("http://example.com") do |response|
  # here we consume response body io
end

If you just want a simple response body, you use the non-yielding method and get a string result. The method would also provide basic status code handling (for example raise or return nil unless successful). That's more included features for robustness and simpler to use than the current API.

If you have more advanced use case, like responding to specific status codes, handling response headers etc. you use the yielding method and get a response object (which only provides a streaming IO). If you need the body as string, you have to consume the IO.

@asterite
Copy link
Member Author

asterite commented Nov 11, 2020

The vast majority of use cases are literally just HTTP::Client.get("http://example.com").body

Then the vast majority of use cases are wrong. Consuming the body without checking that the status is okay It seems like a latent bug to me.

So why not make it easier for these use cases and just return the body (as String) and not the Response with cached body?

Because a body without an HTTP status code can't be trusted.

@straight-shoota
Copy link
Member

Totally agree and that's why I mentioned that the method should check the status code and fail if it's not a success.

@asterite
Copy link
Member Author

Ouch, I should finish reading before replying 😊

@asterite
Copy link
Member Author

Guess what someone just asked in gitter? https://gitter.im/crystal-lang/crystal?at=5fb01edab4283c208a656378

I can't believe there's so much resistance to make this trivial change.

@asterite
Copy link
Member Author

And I say it's trivial because I already have this implemented in my machine, I'm just waiting for an "okay" so I can send a PR.

@straight-shoota
Copy link
Member

I don't see any resistance to improving this. We just need to agree on the solution.

IMO having to separate types is completely unnecessary. We really need only a single type with IO. I think the best improvement we can do right now is to add IO::Memory for non-streaming instance. So that's pretty much what @bcardiff and @jhass have been proposing as well. There's only different opinions on whether the IO should be named body_io or body (and subsequently what to do with the string-returning method).

Just adding a fallback for body_io is the easiest change. It doesn't break anything. It doesn't improve much either, but it would fix the issue discussed on gitter.
Renaming body_io to body OTOH is a breaking change. It's easy to fix though (just insert to_s). And when it's used in a place where either IO or string do, there's no need to change any code either. So maybe it's fine. However, it's not necessarily better than renaming to body. IMO it doesn't matter whether the property ends up being named body or body_io. Staying with the latter won't break code that is already using the streaming interface. And it would allow to properly deprecate body before removing/breaking its return type.

@asterite
Copy link
Member Author

If you take a look at newer languages like Rust, Elm... even Haskell has this... you define all possible states a type can be (algebraic data type) and then there's no way to represent an invalid state. In my proposal, when you fetch a non streaming response, you can only read it as a string. If you are streaming it, you can only treat it as an IO. Of course you can explicitly go from one to the other (build an IO memory or read everything into a string) but as it's been avoided all the time here, implicit behavior is not good.

I sincerely can't see the downside of having two types other than hypothetical situations.

@straight-shoota
Copy link
Member

straight-shoota commented Nov 14, 2020

A very practical downside from the API design is that IMO a non-streaming type is simply not necessary at all. So we can eventually have just a single type with streaming IO. No type with buffered string body.
It means we can't have a proper fix right now. That's a benefit of the two-type solution. But the final solution to have only one type sounds much better to me.

@asterite
Copy link
Member Author

How can you make it so that users don't forget to close the IO? If you have a solution for that, I can concede.

@straight-shoota
Copy link
Member

I'd suggest to remove the overload that returns a Response and only keep the one that yields the instance.

That would also free up the non-yielding overload for a convenience method that directly returns String as suggested in #9901 (comment). Note that it doesn't have to be that way. The convenience method needs extra discussion. But it would be an option.

@asterite
Copy link
Member Author

I think we are going in circles here so I'll just stop. Sorry :-)

@straight-shoota
Copy link
Member

You're probably right.

I'll still add this to my last response because I might have been thinking to far ahead: We can obviously also keep the non-yielding overload and have it return a Response instance. That one would just not provide access to the network IO, but the buffered body via IO::Memory. So the data situation and the responsibility to close a response are exactly the same as now. What changes is just that you can't access the body as a string directly. Instead you always have to consume an IO, either memory or network depending on how the response is obtained.

@jhass
Copy link
Member

jhass commented Nov 15, 2020

Which is what I suggested in #9901 (comment) and #9901 (comment) already :P

@straight-shoota
Copy link
Member

straight-shoota commented Nov 15, 2020

Yes and I in #9901 (comment) and @bcardiff in #9901 (comment). It essentially just differs on whether to stay with body_io or move to body for the streaming IO.

@asterite
Copy link
Member Author

Oh, I really like that. @jhass I think I missed that third option, sorry.

So, to see if I understand things correctly, the idea is:

  • have a single HTTP::Client::Response type whose body is always an IO
  • In the case of a streaming response, the IO is directly connected to the socket, and it will be closed once the block ends
  • In the case of a non-streaming response, we read the HTTP body, close the response/connection, and set body to an IO::Memory over the string body

The nice thing about this approach is that you always deal with a response in the same way: it's an IO. The not-so-nice thing about this is that for the non-streaming version you have to do response.body.gets_to_end to get a string, which has three "problems":

  • it's a bit more verbose than just body: but I'm fine with
  • IO::Memory#to_s and IO::Memory#gets_to_end will create a duplicate string: however, we can make IO::Memory know that it was initialized from a string, and in that case return the original string. There's still one more memory allocation for the IO::Memory, but it's fine (it's just 80 extra bytes).
  • if the body is always an IO one could pass it down to JSON.parse or X.from_json, or X.from_yaml, and parsing an IO is slower than parsing a String: but you can always read the IO into a string if you want to avoid that performance issue.

So I think I'm 👍 for this approach. It's a breaking change, but it's probably better to do it before 1.0 is released.

@jhass
Copy link
Member

jhass commented Nov 15, 2020

Yes, that's the idea.

  • As mentioned previously it's easy to build helpers on top of it to get rid of the verbosity, both first class in stdlib or second class as or inside a shard.
  • A "read only" mode/flag sounds good, as a generally useful performance optimization too. This way we can use IO::Memory as an IOish view over a string, even a static one (think file embedding).
  • If the parsing slow down due to this turns out to be a real world issue, it seems easy to have JSON and YAML detect an IO::Memory in "read only" mode and access the underlying string instead.

@straight-shoota
Copy link
Member

straight-shoota commented Nov 15, 2020

Great 👍 Sounds like we're pretty much agreed then. It seems body vs. body_io is still to be settled.

body is the better alternative because the _io suffix is simply unnecessary. But the change to a bit more concise name has to match the cost of a breaking change.

Staying with body_io has the benefit of not breaking any existing code using the streaming response. body returning string (for non-streaming response) could be properly deprecated and continue to work for now. So effectively, we wouldn't need to break anything immediately.
IMO it's better to go into 1.0 with a deprecated method than a broken one.

I'm in favor of body_io but going for body is fine with me either.

@jhass
Copy link
Member

jhass commented Nov 15, 2020

I would strongly prefer body, making body_io a deprecated alias to it is fine for me, but not necessary. Turning Response#body from returning a String to returning an IO is a breaking change already anyhow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants