Mint.TransportError: :closed #62

NeilMenne · 2020-06-18T19:07:26Z

I am using Finch in a service that relies on a remote service for data, but I am doing a fair amount of caching and other work at my service to avoid having to make this remote call too often for performance reasons.

As a result, we are periodically seeing this socket close error. A related issue on Mint points to different defaults being necessary for :conn_opts as part of the initial pool configuration. I don't know if it's actually the fix, but any possibility that this socket being closed can be detected behind the scenes in Finch? The ideal would be that if a connection gets closed due to inactivity that it wouldn't be surfaced when making a later request.

Version Information:
Finch: 0.2.0
Mint: 1.1.0

The text was updated successfully, but these errors were encountered:

josevalim · 2020-06-23T21:33:46Z

Definitely. On checkout, if the connection is closed, we need to automatically reconnect without pushing errors upstream. My understanding is that transferring between process will fail if the connection is closed. Is this correct @whatyouhide?

josevalim · 2020-06-23T21:46:30Z

Actually, it seems we already reconnect if the socket is closed. Can you provide an example that reproduces the issue? I am thinking it is one of the issues where it only says closed when we write to it.

sneako · 2020-06-23T22:18:48Z

I am actually seeing this error in my application too, but only around 50 per second, when I am currently making > 100,000 requests/s, so it hasn't been a very high priority for me, but I definitely agree that we should be able to prevent it. I'll take another stab at making a reproducible example this evening.

josevalim · 2020-06-23T22:22:23Z

Do you know if it is happening when we send the request? That would be my bet. If we can identify the source, then it should be a (hopefully) easy fix.

sneako · 2020-06-23T22:29:41Z

I assume that is where it is happening as well. I can try whipping up a branch, wrapping the errors, and deploy that temporarily, to nail down exactly where this comes from.

keathley · 2020-06-23T23:15:01Z

I think it’s possible that the connection is being closed after it’s checked back into the pool. Then the next time we try to use it the connection returns an error. That’s my hypothesis

sneako · 2020-06-23T23:25:03Z

Looks like this error is always happening (in my application at least) when we are waiting for the response actually, here: https://github.com/keathley/finch/pull/65/files#diff-c3126e24dc4c2613caa9bec21852ec5aR109

sneako · 2020-06-23T23:37:48Z

I think it’s possible that the connection is being closed after it’s checked back into the pool. Then the next time we try to use it the connection returns an error. That’s my hypothesis

I would expect Finch to try to checkout a new worker/connection from NimblePool if an error occurred during the handle_checkout callback

josevalim · 2020-06-24T07:35:14Z

@sneako it is happening in the response before it sends the headers and stuff? If so... then yeah, Finch is correct in return :closed.

sneako · 2020-06-24T16:50:35Z

I can do another test later today to check if we have received any part of the response yet when this happens.

sneako · 2020-06-24T18:26:35Z

I deployed this branch: https://github.com/keathley/finch/compare/track-closed-error and can confirm that in my case, the closed error is returned before any part of the response has been received at all (always getting "req_recv_empty"), so it seems that Finch is behaving correctly.

NeilMenne · 2020-06-24T18:56:46Z

I will try to carve out some time to deploy that branch with some additional tracing/logging information.

In the short term, we had to switch to HTTPoison since this appears to be unique to Finch in our deployed environment.

NeilMenne · 2020-06-25T16:04:41Z

We have confirmed that this is the error we're seeing as well with the linked branch. While it may seem that it's behaving correctly, under identical workloads HTTPoison does not exhibit this behavior and Finch has a much more forgiving default timeout for receive (15_000 I believe) that we are not approaching.

Is it possible that there are some timing issues around the connection management? If I check out a connection and is due to be reclaimed due to inactivity before I actually send the data for example, it would be closed by the time I try to send my actual GET request.

josevalim · 2020-06-25T16:57:41Z

@NeilMenne are your errors happening before sending the request or before receiving the response or after sending the response?

NeilMenne · 2020-06-25T21:54:02Z

I am working on putting it back in place with more thorough telemetry hooks to tell the difference. When I have a sample, I will have an answer!

NeilMenne · 2020-06-25T22:13:33Z

Here lies the order of events as telemetry dumped them, so it looks like it's while waiting for the response but well within the response window as the whole thing happens in about 3 ms. I did try to find relevant logs on the other service as well to correlate, but it appears to have never gotten there.

All of this on a re-used connection too...

21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :queue, :start] | %{system_time: 1593122170482380015} | %{host: "redacted", pool: #PID<0.5073.0>, port: 443, scheme: :https}
21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :queue, :stop] | %{duration: 395655, idle_time: 558151263389} | %{host: "redacted", pool: #PID<0.5073.0>, port: 443, scheme: :https}
21:56:10.482 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :reused_connection] | %{} | %{host: "redacted", port: 443, scheme: :https}
21:56:10.484 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :request, :start] | %{system_time: 1593122170482943249} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :request, :stop] | %{duration: 1857550} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :response, :start] | %{system_time: 1593122170485031014} | %{host: "redacted", path: "/redactedpath", port: 443, scheme: :https}
21:56:10.485 request_id=FhvoLFbo0yed2sUAAAlk [info] [:finch, :response, :stop] | %{duration: 714219} | %{error: %Mint.TransportError{reason: :closed}, host: "redacted", path: "/redactedpath", port: 443, scheme: :https}

josevalim · 2020-06-25T22:27:20Z

Beautiful. Did it send any response at all? Like headers and statuses?

NeilMenne · 2020-06-25T22:40:09Z

Beautiful. Did it send any response at all? Like headers and statuses?

There was nothing that I could find. if you have any additional information that I can track, I'm maintaining a branch for this now.

sneako · 2020-06-25T22:50:14Z

Thanks for looking in to this with us @NeilMenne ! I have put together this branch here: https://github.com/keathley/finch/compare/track-closed-error

Using this branch, Finch will return a 3 element tuple on error, where the second element will be an atom or binary, telling us if any part of the response at all was returned before the :closed error.

NeilMenne · 2020-06-25T22:52:00Z

It's the the req_recv_empty error

sneako · 2020-06-25T22:53:11Z

Ah right ok, so that means that none of the response was received at all, just like what I saw in my application. 🤔

NeilMenne · 2020-06-25T22:54:21Z

indeed. only happens to reused connections so far

sneako · 2020-06-25T22:59:57Z

So I guess that means that both setting the mode on the connection and transferring the owner of the connection can succeed even if the socket has been closed...

NeilMenne · 2020-06-25T23:01:23Z

Yeah, that sounds feasible.

josevalim · 2020-06-25T23:25:07Z

It may be that the connection was closed for write on the other side, which is why it happens. We could automatically retry but we may end-up sending the request twice in some cases. We would need to investigate to make sure what is the best way forward.

NeilMenne · 2020-06-26T00:30:18Z

Yeah, I don't think you can reasonably do that; however, if we had some insight into the TCP protocol for this, we could tell more information. Things like keep-alive (which is ostensibly being stripped before I can log it or is not being requested by default) would be contributing factors.

At the TCP protocol level, we should get a FIN packet or similar graceful disconnect (depending on the varying levels of abstraction we're interacting with) under normal circumstances. As the remote service is not dying or thrashing when this is happening, it seems like it could be something like that kind of information being ignored/delayed due to having the connection checked out in the client's code.

josevalim · 2020-06-26T00:53:52Z

Oh, connection: close! Can you please try this branch? https://github.com/keathley/finch/pull/66

NeilMenne · 2020-06-26T01:27:57Z

yeah, i'll get it deployed tomorrow and i'll post results if @sneako doesn't get there first

sneako · 2020-06-26T01:58:29Z

Ah that is interesting. We do have a test case for 'connection: close' here: https://github.com/keathley/finch/blob/master/test/finch_test.exs#L202 We should probably increase the iterations there as well, because it only does 10 iterations, with a pool size of 10, so we don't see what happens when we try to reuse a closed connection, but even increasing it 100, always passes locally.

I wonder if we actually need to look for the header here: https://github.com/keathley/finch/pull/66/files#diff-c3126e24dc4c2613caa9bec21852ec5aR143 or if Mint.HTTP.open?/1 would be enough, since I think that Mint is already checking for that header. I'll try out both

sneako · 2020-06-26T02:40:56Z

It does seem that Mint is closing the connection itself when it encounters this header: https://github.com/elixir-mint/mint/blob/9b12d2f3cd53b45864460203f953c202ae6c6a54/lib/mint/http1.ex#L812

I am still trying out the branch though and will post results shortly

keathley · 2020-09-28T17:34:16Z

I've found an issue in how mint handles connections after streaming. Its possible that this is related to the issues we've been seeing. I can definitely hypothesize ways that this would result in issues on our end, although I'm not convinced that this is the root issue.

keathley · 2020-09-28T17:52:39Z

I also may have found an issue in how the http1 pool is handling errors after a receive. I'm going to do some experimentation to confirm.

keathley · 2020-09-28T19:17:40Z

re-opening this until the issues in Mint are fixed.

keathley · 2020-09-28T19:30:16Z

I also may have found an issue in how the http1 pool is handling errors after a receive. I'm going to do some experimentation to confirm.

To follow up on this, my hypothesis was that if a server finishes a response and closes a connection, Mint could return an error (since it detects that the connection is closed it returns an error). In that scenario, we would have access to the responses in the error tuple that mint returns from recv. We would need to see if the response was complete and return it if so. But, I haven't been able to show that that can happen. Looking at Mint's code, I still believe it can happen. But I haven't been able to accomplish it. If this is what is happening, we would expect to see the errors manifest after consuming most of the response. Based on the notes above, when @NeilMenne and @sneako saw this error they received none of the response body. That would seem to indicate that the main culprit is the Mint issue after performing a recv.

At this point, I'm going to work on building a mock http1 webserver so I get a little more control over the underlying tcp connections. Hopefully that'll make it a little easier to induce some of these failures.

acco · 2020-10-01T19:35:38Z

Closing the loop on my end, elixir-mint/mint#280 seems to have done the trick for me. Thanks all! 🙏

keathley · 2020-10-02T14:27:11Z

Alright, going to re-close this now that the fix has landed in Mint.

derekkraan · 2022-03-28T12:27:39Z

Hoping the following code snippet can help the next person who stumbles upon this issue while googling:

  defp do_request(request, pool_name, opts, start_time \\ :erlang.monotonic_time(:millisecond)) do
    case Finch.request(request, pool_name, opts) do
      {:ok, response} ->
        {:ok, response}

      {:error, %Mint.TransportError{} = error} ->
        transport_retry_timeout =
          Keyword.get(opts, :transport_retry_timeout, @default_transport_retry_timeout)

        if :erlang.monotonic_time(:millisecond) < start_time + transport_retry_timeout do
          Logger.warn(fn -> "retrying request, reason: #{inspect(error)}" end)
          do_request(request, pool_name, opts, start_time)
        else
          {:error, error}
        end

      {:error, error} ->
        {:error, error}
    end
  end

jadlr · 2022-09-20T08:39:32Z

I'm having exactly this problem with the newest versions of finch (0.13.0) and mint (1.4.2). We're using finch in default configuration to call all sorts of APIs. A lot of calls fail with %Mint.TransportError{reason: :closed}

AHBruns · 2022-12-09T19:48:22Z

Just wanted to add that we are too. Interestingly, it seems to be a cold start issue as the second request (we implement job retries) seems to always succeed.

AHBruns · 2022-12-09T19:50:36Z

For example, this is the logs from an Oban job that calls an external service:

19:44:24.549 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":1,"event":"job:start","id":923,"max_attempts":2,"meta":{},"queue":"default","source":"oban","system_time":1670615064549808298,"tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:24.558 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":1,"duration":2333,"error":"** (MatchError) no match of right hand side value: {:error, %Mint.TransportError{reason: :closed}}","event":"job:exception","id":923,"max_attempts":2,"meta":{},"queue":"default","queue_time":-1104483,"source":"oban","state":"failure","tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:39.213 request_id=Fy83kDDzUB2AxBAAACdh [info] GET /status/health
19:44:39.214 request_id=Fy83kDDzUB2AxBAAACdh [info] Sent 200 in 193µs
19:44:39.216 request_id=Fy83kDEcv3J9GUMAADBi [info] GET /status/health
19:44:39.216 request_id=Fy83kDEcv3J9GUMAADBi [info] Sent 200 in 234µs
19:44:39.236 request_id=Fy83kDJF32WPO7kAACdx [info] GET /status/health
19:44:39.236 request_id=Fy83kDJF32WPO7kAACdx [info] Sent 200 in 222µs
19:44:40.217 request_id=Fy83kGzETxlraM8AADBy [info] GET /status/health
19:44:40.217 request_id=Fy83kGzETxlraM8AADBy [info] Sent 200 in 178µs
19:44:40.247 request_id=Fy83kG6PrhQxypEAADCC [info] GET /status/health
19:44:40.247 request_id=Fy83kG6PrhQxypEAADCC [info] Sent 200 in 314µs
19:44:40.266 request_id=Fy83kG-n-wV1D8MAADCS [info] GET /status/health
19:44:40.266 request_id=Fy83kG-n-wV1D8MAADCS [info] Sent 200 in 364µs
19:44:42.479 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":2,"event":"job:start","id":923,"max_attempts":2,"meta":{},"queue":"default","source":"oban","system_time":1670615082479194634,"tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}
19:44:43.231 [info] {"args":{"contact_id":632851,"portal_id":7712148},"attempt":2,"duration":747632,"event":"job:stop","id":923,"max_attempts":2,"meta":{},"queue":"default","queue_time":923030,"source":"oban","state":"success","tags":[],"worker":"Grimoire.LeadRanking.Workers.RankContactWorker"}

sneako · 2022-12-11T10:53:12Z

Thanks for the additional report. If it is happening often would you be able to grab a packet capture of it happening? That would be much more insightful than these application logs. We are regularly sending over 1.5 million requests/s with Finch and only see a very small amount of these errors here and there.

jadlr · 2022-12-11T11:05:56Z

The issue seems to be correlated to a setting on the called server. It happened for a specific API most of the time, that's why we at first thought it was an issue on the API side. Since we switched to hackney we're not having any issues, without any changes on the API side.

josevalim · 2022-12-11T11:16:50Z

I believe Hackney does automatic retries on connection closed errors. The problem is when to retry: that could be dangerous on POST requests for example. However, we could try to see if the error happened before we sent the body (or the empty body). In such cases, it should be safe to retry.

jadlr · 2022-12-12T21:39:57Z

Hackney is reconnecting closed connections as far as I understand it. It is not retrying requests.

…mittent errors, see sneako/finch#62

cigrainger · 2023-06-15T14:07:29Z

Just chiming in that this is happening a lot for us as well on one particular API. The API does not have any problems like this from other clients, but pretty much always kicks back %Mint.TransportError{reason: :closed} on the first attempt with Finch. The second attempt is pretty much always fine. We've built retries into the app but it's certainly imperfect. If I can be of any assistance in providing more information that might help figure this out, it happens reliably enough for us that that wouldn't be a problem. Please let me know if I can help.

wojtekmach · 2023-06-15T14:12:52Z

The API does not have any problems like this from other clients, but pretty much always kicks back %Mint.TransportError{reason: :closed} on the first attempt with Finch

I wonder if you would be able to reproduce this issue by hitting this API using Mint directly. That could give us a clue where a possible culprit might be.

cigrainger · 2023-06-15T15:24:11Z

Good suggestion. I don't know if I can. I thought I did, but I can't replicate. I got this, but it honestly may have been user error as I was interactively getting to grips with querying directly with Mint in Livebook. But it's reliably returning with no problems now.

** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.21>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
    (stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
    #cell:p7jckfdyq4mpx32ntkhlnbc57sm5zb5u:1: (file)

This is the Livebook: https://gist.github.com/cigrainger/a8c876e0f11db16d2b5de15c65fdf351

cigrainger · 2023-06-15T15:42:18Z

Okay, I can get that reliably if I wait a while and try another request. Should a conn 'expire'?

** (MatchError) no match of right hand side value: {:error, %Mint.HTTP1{host: redacted, port: 9200, request: nil, streaming_request: nil, socket: #Port<0.12>, transport: Mint.Core.Transport.TCP, mode: :active, scheme_as_string: "http", requests: {[], []}, state: :closed, buffer: "", proxy_headers: [], private: %{}, log: false}, %Mint.TransportError{reason: :closed}}
    (stdlib 4.0.1) erl_eval.erl:496: :erl_eval.expr/6
    /Users/chris/Library/Application Support/livebook/autosaved/2023_06_15/14_57_38_mint_closed_gg2j.livemd#cell:ol4eldoq352ibnvzyszealwcmtjvs27g:1: (file)

Edit: Maybe this is a problem with handling tcp keepalive?

acco · 2024-01-16T21:44:07Z

In case you're using Req and finding yourself here (like we were): try upgrading to the latest version!

MrYawe · 2024-06-04T09:56:14Z

If you are on AWS this comment may help you: #272 (comment)

keathley reopened this Sep 28, 2020

sneako mentioned this issue Sep 29, 2020

Fix connection closed errors not propagating, closes #279 elixir-mint/mint#280

Merged

keathley closed this as completed Oct 2, 2020

keathley mentioned this issue Dec 21, 2020

Handle connection closed in recv/3 elixir-mint/mint#290

Merged

keathley mentioned this issue Feb 18, 2021

Add correct transport_opt flags to all pools #122

Merged

bpardee added a commit to bpardee/noap that referenced this issue Feb 15, 2023

Add a retry to Finch implementation on :closed to hopefully fix inter…

fa56999

…mittent errors, see sneako/finch#62

Abdallahj94 mentioned this issue Apr 13, 2023

[NXTG-976] Add Mint Error retries Userpilot/hubspot#5

Merged

15 tasks

chrisgreg mentioned this issue Dec 1, 2023

%Mint.TransportError{reason: :closed} brainlid/langchain#42

Closed

salex89 mentioned this issue May 14, 2024

Connection closed when request is attempted #272

Open

ruslandoga mentioned this issue Nov 9, 2024

Guidance on handling closed connections elixir-ecto/db_connection#315

Closed

Mint.TransportError: :closed #62

Mint.TransportError: :closed #62

Comments

NeilMenne commented Jun 18, 2020

josevalim commented Jun 23, 2020

josevalim commented Jun 23, 2020

sneako commented Jun 23, 2020

josevalim commented Jun 23, 2020

sneako commented Jun 23, 2020 • edited Loading

keathley commented Jun 23, 2020

sneako commented Jun 23, 2020 • edited Loading

sneako commented Jun 23, 2020 • edited Loading

josevalim commented Jun 24, 2020

sneako commented Jun 24, 2020

sneako commented Jun 24, 2020

NeilMenne commented Jun 24, 2020

NeilMenne commented Jun 25, 2020

josevalim commented Jun 25, 2020

NeilMenne commented Jun 25, 2020

NeilMenne commented Jun 25, 2020 • edited Loading

josevalim commented Jun 25, 2020

NeilMenne commented Jun 25, 2020

sneako commented Jun 25, 2020

NeilMenne commented Jun 25, 2020

sneako commented Jun 25, 2020

NeilMenne commented Jun 25, 2020

sneako commented Jun 25, 2020 • edited Loading

NeilMenne commented Jun 25, 2020

josevalim commented Jun 25, 2020

NeilMenne commented Jun 26, 2020

josevalim commented Jun 26, 2020

NeilMenne commented Jun 26, 2020

sneako commented Jun 26, 2020 • edited Loading

sneako commented Jun 26, 2020

keathley commented Sep 28, 2020

keathley commented Sep 28, 2020

keathley commented Sep 28, 2020

keathley commented Sep 28, 2020

acco commented Oct 1, 2020

keathley commented Oct 2, 2020

derekkraan commented Mar 28, 2022

jadlr commented Sep 20, 2022

AHBruns commented Dec 9, 2022

AHBruns commented Dec 9, 2022

sneako commented Dec 11, 2022

jadlr commented Dec 11, 2022

josevalim commented Dec 11, 2022

jadlr commented Dec 12, 2022

cigrainger commented Jun 15, 2023

wojtekmach commented Jun 15, 2023

cigrainger commented Jun 15, 2023 • edited Loading

cigrainger commented Jun 15, 2023 • edited Loading

acco commented Jan 16, 2024

MrYawe commented Jun 4, 2024

sneako commented Jun 23, 2020 •

edited

Loading

sneako commented Jun 23, 2020 •

edited

Loading

sneako commented Jun 23, 2020 •

edited

Loading

NeilMenne commented Jun 25, 2020 •

edited

Loading

sneako commented Jun 25, 2020 •

edited

Loading

sneako commented Jun 26, 2020 •

edited

Loading

cigrainger commented Jun 15, 2023 •

edited

Loading

cigrainger commented Jun 15, 2023 •

edited

Loading