-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mint.TransportError: :closed #62
Comments
Definitely. On checkout, if the connection is closed, we need to automatically reconnect without pushing errors upstream. My understanding is that transferring between process will fail if the connection is closed. Is this correct @whatyouhide? |
Actually, it seems we already reconnect if the socket is closed. Can you provide an example that reproduces the issue? I am thinking it is one of the issues where it only says closed when we write to it. |
I am actually seeing this error in my application too, but only around 50 per second, when I am currently making > 100,000 requests/s, so it hasn't been a very high priority for me, but I definitely agree that we should be able to prevent it. I'll take another stab at making a reproducible example this evening. |
Do you know if it is happening when we send the request? That would be my bet. If we can identify the source, then it should be a (hopefully) easy fix. |
I assume that is where it is happening as well. I can try whipping up a branch, wrapping the errors, and deploy that temporarily, to nail down exactly where this comes from. |
I think it’s possible that the connection is being closed after it’s checked back into the pool. Then the next time we try to use it the connection returns an error. That’s my hypothesis |
Looks like this error is always happening (in my application at least) when we are waiting for the response actually, here: https://github.com/keathley/finch/pull/65/files#diff-c3126e24dc4c2613caa9bec21852ec5aR109 |
I would expect Finch to try to checkout a new worker/connection from NimblePool if an error occurred during the handle_checkout callback |
@sneako it is happening in the response before it sends the headers and stuff? If so... then yeah, Finch is correct in return |
I can do another test later today to check if we have received any part of the response yet when this happens. |
I deployed this branch: https://github.com/keathley/finch/compare/track-closed-error and can confirm that in my case, the closed error is returned before any part of the response has been received at all (always getting "req_recv_empty"), so it seems that Finch is behaving correctly. |
I will try to carve out some time to deploy that branch with some additional tracing/logging information. In the short term, we had to switch to |
We have confirmed that this is the error we're seeing as well with the linked branch. While it may seem that it's behaving correctly, under identical workloads HTTPoison does not exhibit this behavior and Finch has a much more forgiving default timeout for receive ( Is it possible that there are some timing issues around the connection management? If I check out a connection and is due to be reclaimed due to inactivity before I actually send the data for example, it would be closed by the time I try to send my actual GET request. |
@NeilMenne are your errors happening before sending the request or before receiving the response or after sending the response? |
I am working on putting it back in place with more thorough telemetry hooks to tell the difference. When I have a sample, I will have an answer! |
Here lies the order of events as telemetry dumped them, so it looks like it's while waiting for the response but well within the response window as the whole thing happens in about 3 ms. I did try to find relevant logs on the other service as well to correlate, but it appears to have never gotten there. All of this on a re-used connection too...
|
Beautiful. Did it send any response at all? Like headers and statuses? |
There was nothing that I could find. if you have any additional information that I can track, I'm maintaining a branch for this now. |
Thanks for looking in to this with us @NeilMenne ! I have put together this branch here: https://github.com/keathley/finch/compare/track-closed-error Using this branch, Finch will return a 3 element tuple on error, where the second element will be an atom or binary, telling us if any part of the response at all was returned before the :closed error. |
It's the the req_recv_empty error |
Ah right ok, so that means that none of the response was received at all, just like what I saw in my application. 🤔 |
indeed. only happens to reused connections so far |
So I guess that means that both setting the mode on the connection and transferring the owner of the connection can succeed even if the socket has been closed... |
Yeah, that sounds feasible. |
It may be that the connection was closed for write on the other side, which is why it happens. We could automatically retry but we may end-up sending the request twice in some cases. We would need to investigate to make sure what is the best way forward. |
Yeah, I don't think you can reasonably do that; however, if we had some insight into the TCP protocol for this, we could tell more information. Things like keep-alive (which is ostensibly being stripped before I can log it or is not being requested by default) would be contributing factors. At the TCP protocol level, we should get a FIN packet or similar graceful disconnect (depending on the varying levels of abstraction we're interacting with) under normal circumstances. As the remote service is not dying or thrashing when this is happening, it seems like it could be something like that kind of information being ignored/delayed due to having the connection checked out in the client's code. |
Oh, connection: close! Can you please try this branch? https://github.com/keathley/finch/pull/66 |
yeah, i'll get it deployed tomorrow and i'll post results if @sneako doesn't get there first |
Ah that is interesting. We do have a test case for 'connection: close' here: https://github.com/keathley/finch/blob/master/test/finch_test.exs#L202 We should probably increase the iterations there as well, because it only does 10 iterations, with a pool size of 10, so we don't see what happens when we try to reuse a closed connection, but even increasing it 100, always passes locally. I wonder if we actually need to look for the header here: https://github.com/keathley/finch/pull/66/files#diff-c3126e24dc4c2613caa9bec21852ec5aR143 or if |
It does seem that Mint is closing the connection itself when it encounters this header: https://github.com/elixir-mint/mint/blob/9b12d2f3cd53b45864460203f953c202ae6c6a54/lib/mint/http1.ex#L812 I am still trying out the branch though and will post results shortly |
I've found an issue in how mint handles connections after streaming. Its possible that this is related to the issues we've been seeing. I can definitely hypothesize ways that this would result in issues on our end, although I'm not convinced that this is the root issue. |
I also may have found an issue in how the http1 pool is handling errors after a receive. I'm going to do some experimentation to confirm. |
re-opening this until the issues in Mint are fixed. |
To follow up on this, my hypothesis was that if a server finishes a response and closes a connection, Mint could return an error (since it detects that the connection is closed it returns an error). In that scenario, we would have access to the responses in the error tuple that mint returns from At this point, I'm going to work on building a mock http1 webserver so I get a little more control over the underlying tcp connections. Hopefully that'll make it a little easier to induce some of these failures. |
Closing the loop on my end, elixir-mint/mint#280 seems to have done the trick for me. Thanks all! 🙏 |
Alright, going to re-close this now that the fix has landed in Mint. |
Hoping the following code snippet can help the next person who stumbles upon this issue while googling: defp do_request(request, pool_name, opts, start_time \\ :erlang.monotonic_time(:millisecond)) do
case Finch.request(request, pool_name, opts) do
{:ok, response} ->
{:ok, response}
{:error, %Mint.TransportError{} = error} ->
transport_retry_timeout =
Keyword.get(opts, :transport_retry_timeout, @default_transport_retry_timeout)
if :erlang.monotonic_time(:millisecond) < start_time + transport_retry_timeout do
Logger.warn(fn -> "retrying request, reason: #{inspect(error)}" end)
do_request(request, pool_name, opts, start_time)
else
{:error, error}
end
{:error, error} ->
{:error, error}
end
end |
I'm having exactly this problem with the newest versions of finch (0.13.0) and mint (1.4.2). We're using finch in default configuration to call all sorts of APIs. A lot of calls fail with |
Just wanted to add that we are too. Interestingly, it seems to be a cold start issue as the second request (we implement job retries) seems to always succeed. |
For example, this is the logs from an Oban job that calls an external service:
|
Thanks for the additional report. If it is happening often would you be able to grab a packet capture of it happening? That would be much more insightful than these application logs. We are regularly sending over 1.5 million requests/s with Finch and only see a very small amount of these errors here and there. |
The issue seems to be correlated to a setting on the called server. It happened for a specific API most of the time, that's why we at first thought it was an issue on the API side. Since we switched to hackney we're not having any issues, without any changes on the API side. |
I believe Hackney does automatic retries on connection closed errors. The problem is when to retry: that could be dangerous on POST requests for example. However, we could try to see if the error happened before we sent the body (or the empty body). In such cases, it should be safe to retry. |
Hackney is reconnecting closed connections as far as I understand it. It is not retrying requests. |
…mittent errors, see sneako/finch#62
Just chiming in that this is happening a lot for us as well on one particular API. The API does not have any problems like this from other clients, but pretty much always kicks back |
I wonder if you would be able to reproduce this issue by hitting this API using Mint directly. That could give us a clue where a possible culprit might be. |
Good suggestion. I don't know if I can. I thought I did, but I can't replicate. I got this, but it honestly may have been user error as I was interactively getting to grips with querying directly with Mint in Livebook. But it's reliably returning with no problems now. ** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.21>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
(stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
#cell:p7jckfdyq4mpx32ntkhlnbc57sm5zb5u:1: (file) This is the Livebook: https://gist.github.com/cigrainger/a8c876e0f11db16d2b5de15c65fdf351 |
Okay, I can get that reliably if I wait a while and try another request. Should a ** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.12>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
(stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
/Users/chris/Library/Application Support/livebook/autosaved/2023_06_15/14_57_38_mint_closed_gg2j.livemd#cell:ol4eldoq352ibnvzyszealwcmtjvs27g:1: (file) Edit: Maybe this is a problem with handling tcp keepalive? |
In case you're using Req and finding yourself here (like we were): try upgrading to the latest version! |
If you are on AWS this comment may help you: #272 (comment) |
I am using
Finch
in a service that relies on a remote service for data, but I am doing a fair amount of caching and other work at my service to avoid having to make this remote call too often for performance reasons.As a result, we are periodically seeing this socket close error. A related issue on Mint points to different defaults being necessary for
:conn_opts
as part of the initial pool configuration. I don't know if it's actually the fix, but any possibility that this socket being closed can be detected behind the scenes inFinch
? The ideal would be that if a connection gets closed due to inactivity that it wouldn't be surfaced when making a later request.Version Information:
Finch: 0.2.0
Mint: 1.1.0
The text was updated successfully, but these errors were encountered: