Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mint.TransportError: :closed #62

Closed
NeilMenne opened this issue Jun 18, 2020 · 64 comments
Closed

Mint.TransportError: :closed #62

NeilMenne opened this issue Jun 18, 2020 · 64 comments

Comments

@NeilMenne
Copy link

I am using Finch in a service that relies on a remote service for data, but I am doing a fair amount of caching and other work at my service to avoid having to make this remote call too often for performance reasons.

As a result, we are periodically seeing this socket close error. A related issue on Mint points to different defaults being necessary for :conn_opts as part of the initial pool configuration. I don't know if it's actually the fix, but any possibility that this socket being closed can be detected behind the scenes in Finch? The ideal would be that if a connection gets closed due to inactivity that it wouldn't be surfaced when making a later request.

Version Information:
Finch: 0.2.0
Mint: 1.1.0

@josevalim
Copy link
Contributor

Definitely. On checkout, if the connection is closed, we need to automatically reconnect without pushing errors upstream. My understanding is that transferring between process will fail if the connection is closed. Is this correct @whatyouhide?

@josevalim
Copy link
Contributor

Actually, it seems we already reconnect if the socket is closed. Can you provide an example that reproduces the issue? I am thinking it is one of the issues where it only says closed when we write to it.

@sneako
Copy link
Owner

sneako commented Jun 23, 2020

I am actually seeing this error in my application too, but only around 50 per second, when I am currently making > 100,000 requests/s, so it hasn't been a very high priority for me, but I definitely agree that we should be able to prevent it. I'll take another stab at making a reproducible example this evening.

@josevalim
Copy link
Contributor

Do you know if it is happening when we send the request? That would be my bet. If we can identify the source, then it should be a (hopefully) easy fix.

@sneako
Copy link
Owner

sneako commented Jun 23, 2020

I assume that is where it is happening as well. I can try whipping up a branch, wrapping the errors, and deploy that temporarily, to nail down exactly where this comes from.

@keathley
Copy link
Collaborator

I think it’s possible that the connection is being closed after it’s checked back into the pool. Then the next time we try to use it the connection returns an error. That’s my hypothesis

@sneako
Copy link
Owner

sneako commented Jun 23, 2020

Looks like this error is always happening (in my application at least) when we are waiting for the response actually, here: https://github.com/keathley/finch/pull/65/files#diff-c3126e24dc4c2613caa9bec21852ec5aR109

@sneako
Copy link
Owner

sneako commented Jun 23, 2020

I think it’s possible that the connection is being closed after it’s checked back into the pool. Then the next time we try to use it the connection returns an error. That’s my hypothesis

I would expect Finch to try to checkout a new worker/connection from NimblePool if an error occurred during the handle_checkout callback

@josevalim
Copy link
Contributor

@sneako it is happening in the response before it sends the headers and stuff? If so... then yeah, Finch is correct in return :closed.

@sneako
Copy link
Owner

sneako commented Jun 24, 2020

I can do another test later today to check if we have received any part of the response yet when this happens.

@sneako
Copy link
Owner

sneako commented Jun 24, 2020

I deployed this branch: https://github.com/keathley/finch/compare/track-closed-error and can confirm that in my case, the closed error is returned before any part of the response has been received at all (always getting "req_recv_empty"), so it seems that Finch is behaving correctly.

@NeilMenne
Copy link
Author

I will try to carve out some time to deploy that branch with some additional tracing/logging information.

In the short term, we had to switch to HTTPoison since this appears to be unique to Finch in our deployed environment.

@NeilMenne
Copy link
Author

We have confirmed that this is the error we're seeing as well with the linked branch. While it may seem that it's behaving correctly, under identical workloads HTTPoison does not exhibit this behavior and Finch has a much more forgiving default timeout for receive (15_000 I believe) that we are not approaching.

Is it possible that there are some timing issues around the connection management? If I check out a connection and is due to be reclaimed due to inactivity before I actually send the data for example, it would be closed by the time I try to send my actual GET request.

@josevalim
Copy link
Contributor

@NeilMenne are your errors happening before sending the request or before receiving the response or after sending the response?

@NeilMenne
Copy link
Author

I am working on putting it back in place with more thorough telemetry hooks to tell the difference. When I have a sample, I will have an answer!

@NeilMenne
Copy link
Author

NeilMenne commented Jun 25, 2020

Here lies the order of events as telemetry dumped them, so it looks like it's while waiting for the response but well within the response window as the whole thing happens in about 3 ms. I did try to find relevant logs on the other service as well to correlate, but it appears to have never gotten there.

All of this on a re-used connection too...

21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :queue, :start] | %{system_time: 1593122170482380015} | %{host: "redacted", pool: #PID<0.5073.0>, port: 443, scheme: :https}
21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :queue, :stop] | %{duration: 395655, idle_time: 558151263389} | %{host: "redacted", pool: #PID<0.5073.0>, port: 443, scheme: :https}
21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :reused_connection] | %{} | %{host: "redacted", port: 443, scheme: :https}
21:56:10.484 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :request, :start] | %{system_time: 1593122170482943249} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :request, :stop] | %{duration: 1857550} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :response, :start] | %{system_time: 1593122170485031014} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :response, :stop] | %{duration: 714219} | %{error: %Mint.TransportError{reason: :closed}, host: "redacted", path: "/redactedpath", port: 443, scheme: :https}

@josevalim
Copy link
Contributor

Beautiful. Did it send any response at all? Like headers and statuses?

@NeilMenne
Copy link
Author

Beautiful. Did it send any response at all? Like headers and statuses?

There was nothing that I could find. if you have any additional information that I can track, I'm maintaining a branch for this now.

@sneako
Copy link
Owner

sneako commented Jun 25, 2020

Thanks for looking in to this with us @NeilMenne ! I have put together this branch here: https://github.com/keathley/finch/compare/track-closed-error

Using this branch, Finch will return a 3 element tuple on error, where the second element will be an atom or binary, telling us if any part of the response at all was returned before the :closed error.

@NeilMenne
Copy link
Author

It's the the req_recv_empty error

@sneako
Copy link
Owner

sneako commented Jun 25, 2020

Ah right ok, so that means that none of the response was received at all, just like what I saw in my application. 🤔

@NeilMenne
Copy link
Author

indeed. only happens to reused connections so far

@sneako
Copy link
Owner

sneako commented Jun 25, 2020

So I guess that means that both setting the mode on the connection and transferring the owner of the connection can succeed even if the socket has been closed...

@NeilMenne
Copy link
Author

Yeah, that sounds feasible.

@josevalim
Copy link
Contributor

It may be that the connection was closed for write on the other side, which is why it happens. We could automatically retry but we may end-up sending the request twice in some cases. We would need to investigate to make sure what is the best way forward.

@NeilMenne
Copy link
Author

Yeah, I don't think you can reasonably do that; however, if we had some insight into the TCP protocol for this, we could tell more information. Things like keep-alive (which is ostensibly being stripped before I can log it or is not being requested by default) would be contributing factors.

At the TCP protocol level, we should get a FIN packet or similar graceful disconnect (depending on the varying levels of abstraction we're interacting with) under normal circumstances. As the remote service is not dying or thrashing when this is happening, it seems like it could be something like that kind of information being ignored/delayed due to having the connection checked out in the client's code.

@josevalim
Copy link
Contributor

Oh, connection: close! Can you please try this branch? https://github.com/keathley/finch/pull/66

@NeilMenne
Copy link
Author

yeah, i'll get it deployed tomorrow and i'll post results if @sneako doesn't get there first

@sneako
Copy link
Owner

sneako commented Jun 26, 2020

Ah that is interesting. We do have a test case for 'connection: close' here: https://github.com/keathley/finch/blob/master/test/finch_test.exs#L202 We should probably increase the iterations there as well, because it only does 10 iterations, with a pool size of 10, so we don't see what happens when we try to reuse a closed connection, but even increasing it 100, always passes locally.

I wonder if we actually need to look for the header here: https://github.com/keathley/finch/pull/66/files#diff-c3126e24dc4c2613caa9bec21852ec5aR143 or if Mint.HTTP.open?/1 would be enough, since I think that Mint is already checking for that header. I'll try out both

@sneako
Copy link
Owner

sneako commented Jun 26, 2020

It does seem that Mint is closing the connection itself when it encounters this header: https://github.com/elixir-mint/mint/blob/9b12d2f3cd53b45864460203f953c202ae6c6a54/lib/mint/http1.ex#L812

I am still trying out the branch though and will post results shortly

@keathley
Copy link
Collaborator

I've found an issue in how mint handles connections after streaming. Its possible that this is related to the issues we've been seeing. I can definitely hypothesize ways that this would result in issues on our end, although I'm not convinced that this is the root issue.

@keathley
Copy link
Collaborator

I also may have found an issue in how the http1 pool is handling errors after a receive. I'm going to do some experimentation to confirm.

@keathley
Copy link
Collaborator

re-opening this until the issues in Mint are fixed.

@keathley keathley reopened this Sep 28, 2020
@keathley
Copy link
Collaborator

I also may have found an issue in how the http1 pool is handling errors after a receive. I'm going to do some experimentation to confirm.

To follow up on this, my hypothesis was that if a server finishes a response and closes a connection, Mint could return an error (since it detects that the connection is closed it returns an error). In that scenario, we would have access to the responses in the error tuple that mint returns from recv. We would need to see if the response was complete and return it if so. But, I haven't been able to show that that can happen. Looking at Mint's code, I still believe it can happen. But I haven't been able to accomplish it. If this is what is happening, we would expect to see the errors manifest after consuming most of the response. Based on the notes above, when @NeilMenne and @sneako saw this error they received none of the response body. That would seem to indicate that the main culprit is the Mint issue after performing a recv.

At this point, I'm going to work on building a mock http1 webserver so I get a little more control over the underlying tcp connections. Hopefully that'll make it a little easier to induce some of these failures.

@acco
Copy link

acco commented Oct 1, 2020

Closing the loop on my end, elixir-mint/mint#280 seems to have done the trick for me. Thanks all! 🙏

@keathley
Copy link
Collaborator

keathley commented Oct 2, 2020

Alright, going to re-close this now that the fix has landed in Mint.

@derekkraan
Copy link

Hoping the following code snippet can help the next person who stumbles upon this issue while googling:

  defp do_request(request, pool_name, opts, start_time \\ :erlang.monotonic_time(:millisecond)) do
    case Finch.request(request, pool_name, opts) do
      {:ok, response} ->
        {:ok, response}

      {:error, %Mint.TransportError{} = error} ->
        transport_retry_timeout =
          Keyword.get(opts, :transport_retry_timeout, @default_transport_retry_timeout)

        if :erlang.monotonic_time(:millisecond) < start_time + transport_retry_timeout do
          Logger.warn(fn -> "retrying request, reason: #{inspect(error)}" end)
          do_request(request, pool_name, opts, start_time)
        else
          {:error, error}
        end

      {:error, error} ->
        {:error, error}
    end
  end

@jadlr
Copy link

jadlr commented Sep 20, 2022

I'm having exactly this problem with the newest versions of finch (0.13.0) and mint (1.4.2). We're using finch in default configuration to call all sorts of APIs. A lot of calls fail with %Mint.TransportError{reason: :closed}

@AHBruns
Copy link

AHBruns commented Dec 9, 2022

Just wanted to add that we are too. Interestingly, it seems to be a cold start issue as the second request (we implement job retries) seems to always succeed.

@AHBruns
Copy link

AHBruns commented Dec 9, 2022

For example, this is the logs from an Oban job that calls an external service:

19:44:24.549 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":1,"event":"job:start","id":923,"max_attempts":2,"meta":{},"queue":"default","source":"oban","system_time":1670615064549808298,"tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:24.558 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":1,"duration":2333,"error":"** (MatchError) no match of right hand side value: {:error, %Mint.TransportError{reason: :closed}}","event":"job:exception","id":923,"max_attempts":2,"meta":{},"queue":"default","queue_time":-1104483,"source":"oban","state":"failure","tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:39.213 request_id=Fy83kDDzUB2AxBAAACdh [info] GET /status/health
19:44:39.214 request_id=Fy83kDDzUB2AxBAAACdh [info] Sent 200 in 193µs
19:44:39.216 request_id=Fy83kDEcv3J9GUMAADBi [info] GET /status/health
19:44:39.216 request_id=Fy83kDEcv3J9GUMAADBi [info] Sent 200 in 234µs
19:44:39.236 request_id=Fy83kDJF32WPO7kAACdx [info] GET /status/health
19:44:39.236 request_id=Fy83kDJF32WPO7kAACdx [info] Sent 200 in 222µs
19:44:40.217 request_id=Fy83kGzETxlraM8AADBy [info] GET /status/health
19:44:40.217 request_id=Fy83kGzETxlraM8AADBy [info] Sent 200 in 178µs
19:44:40.247 request_id=Fy83kG6PrhQxypEAADCC [info] GET /status/health
19:44:40.247 request_id=Fy83kG6PrhQxypEAADCC [info] Sent 200 in 314µs
19:44:40.266 request_id=Fy83kG-n-wV1D8MAADCS [info] GET /status/health
19:44:40.266 request_id=Fy83kG-n-wV1D8MAADCS [info] Sent 200 in 364µs
19:44:42.479 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":2,"event":"job:start","id":923,"max_attempts":2,"meta":{},"queue":"default","source":"oban","system_time":1670615082479194634,"tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:43.231 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":2,"duration":747632,"event":"job:stop","id":923,"max_attempts":2,"meta":{},"queue":"default","queue_time":923030,"source":"oban","state":"success","tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}

@sneako
Copy link
Owner

sneako commented Dec 11, 2022

Thanks for the additional report. If it is happening often would you be able to grab a packet capture of it happening? That would be much more insightful than these application logs. We are regularly sending over 1.5 million requests/s with Finch and only see a very small amount of these errors here and there.

@jadlr
Copy link

jadlr commented Dec 11, 2022

The issue seems to be correlated to a setting on the called server. It happened for a specific API most of the time, that's why we at first thought it was an issue on the API side. Since we switched to hackney we're not having any issues, without any changes on the API side.

@josevalim
Copy link
Contributor

I believe Hackney does automatic retries on connection closed errors. The problem is when to retry: that could be dangerous on POST requests for example. However, we could try to see if the error happened before we sent the body (or the empty body). In such cases, it should be safe to retry.

@jadlr
Copy link

jadlr commented Dec 12, 2022

Hackney is reconnecting closed connections as far as I understand it. It is not retrying requests.

@cigrainger
Copy link

Just chiming in that this is happening a lot for us as well on one particular API. The API does not have any problems like this from other clients, but pretty much always kicks back %Mint.TransportError{reason: :closed} on the first attempt with Finch. The second attempt is pretty much always fine. We've built retries into the app but it's certainly imperfect. If I can be of any assistance in providing more information that might help figure this out, it happens reliably enough for us that that wouldn't be a problem. Please let me know if I can help.

@wojtekmach
Copy link
Contributor

The API does not have any problems like this from other clients, but pretty much always kicks back %Mint.TransportError{reason: :closed} on the first attempt with Finch

I wonder if you would be able to reproduce this issue by hitting this API using Mint directly. That could give us a clue where a possible culprit might be.

@cigrainger
Copy link

cigrainger commented Jun 15, 2023

Good suggestion. I don't know if I can. I thought I did, but I can't replicate. I got this, but it honestly may have been user error as I was interactively getting to grips with querying directly with Mint in Livebook. But it's reliably returning with no problems now.

** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.21>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
    (stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
    #cell:p7jckfdyq4mpx32ntkhlnbc57sm5zb5u:1: (file)

This is the Livebook: https://gist.github.com/cigrainger/a8c876e0f11db16d2b5de15c65fdf351

@cigrainger
Copy link

cigrainger commented Jun 15, 2023

Okay, I can get that reliably if I wait a while and try another request. Should a conn 'expire'?

** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.12>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
    (stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
    /Users/chris/Library/Application Support/livebook/autosaved/2023_06_15/14_57_38_mint_closed_gg2j.livemd#cell:ol4eldoq352ibnvzyszealwcmtjvs27g:1: (file)

Edit: Maybe this is a problem with handling tcp keepalive?

@acco
Copy link

acco commented Jan 16, 2024

In case you're using Req and finding yourself here (like we were): try upgrading to the latest version!

@MrYawe
Copy link

MrYawe commented Jun 4, 2024

If you are on AWS this comment may help you: #272 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests