Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed issue #643 and memory leak in hackney_pool #661

Merged
merged 2 commits into from
Nov 20, 2020

Conversation

SergeTupchiy
Copy link
Contributor

@SergeTupchiy SergeTupchiy commented Oct 13, 2020

Fixes: #643

This is similar to PR #656, but using hackney_manager:cancel_request/1 ensures that request state is also erased from process dictionary and deleted from hackney_manager_refs ETS table.

Additionally, this PR includes a memory leak fix (pending client not removed in hackney_pool:dequeue) and some code cleanup.

@@ -127,8 +127,7 @@ do_checkout(Requester, Host, _Port, Transport, #client{options=Opts,
{error, Reason} ->
{error, Reason};
{'EXIT', {timeout, _}} ->
% socket will still checkout so to avoid deadlock we send in a cancellation
gen_server:cast(Pool, {checkout_cancel, Connection, RequestRef}),
%% checkout should be canceled by the caller via hackney_manager
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to cancel it here, as we now can rely on eventually caneling it by hackney_manager.

%------------------------------------------------------------------------------
%% @private
%%------------------------------------------------------------------------------
del_from_queue(Connection, Ref, Queues) ->
Copy link
Contributor Author

@SergeTupchiy SergeTupchiy Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore. Moreover, there is remove_pending which can be used instead and is implemented in a more efficient manner: first check pending dict and (if found) filter queue.

@isaacsanders
Copy link

This is awesome, you have all but one test passing.

@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy commented Oct 15, 2020

This is awesome, you have all but one test passing.

Looks like the failed test (hackney_integration_tests:'-test_frees_manager_ets_when_body_is_in_client/0-fun-2-'/2 ) has a non-deterministic behavior, At least two reasons for it are as follows:

 {ok, _unusedBody} = hackney:body(Client),
    AfterCount = ets:info(hackney_manager_refs, size),
   ...

I think it's worth fixing this test, but it's not directly related to the issue.
Anyway, I re-trigered CI and all tests have passed. :)

@isaacsanders
Copy link

@benoitc @IceDragon200 I think we might have a solution here.

@IceDragon200
Copy link

IceDragon200 commented Oct 15, 2020

@isaacsanders @SergeTuchiy I guess the last test is to run a request loop and see if it still experiences checkout_timeout, I'll do that shortly and let you know how that goes

EDIT:

iex(25)> results = for _ <- 1..200, do: :hackney.request(:get, "http://localhost:6573"); Enum.uniq(results)
[error: :econnrefused]

Looks like it works as intended, so 👍

I'll close my PR and reference this one

@ijunaid8989
Copy link

do anyone else faced this issue on this PR.

[%HTTPoison.Error{id: nil, reason: :checkout_failure}]

@SergeTupchiy
Copy link
Contributor Author

do anyone else faced this issue on this PR.

[%HTTPoison.Error{id: nil, reason: :checkout_failure}]

@ijunaid8989
Could you please share some details to reproduce it?

@ijunaid8989
Copy link

do anyone else faced this issue on this PR.

[%HTTPoison.Error{id: nil, reason: :checkout_failure}]

@ijunaid8989
Could you please share some details to reproduce it?

I have been using this branch in latest HTTPoison , and I am sending 500 per second post requests to a third party API. my code looks like this

  def post(url, file_path, image) do
    HTTPoison.post(
      url,
      {:multipart, [{file_path, image, []}]},
      [],
      hackney: [pool: :seaweedfs_upload_pool, recv_timeout: 15_000]
    )
  end

and I have these pool settings

      :hackney_pool.child_spec(:snapshot_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_upload_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_download_pool, [timeout: 50_000, max_connections: 10_000]),

in my application tree.

@SergeTupchiy
Copy link
Contributor Author

@ijunaid8989
Are you using https?
It's hard to guess what exactly went wrong under your conditions, but I tried to make many concurrent https multipart requests using your pool settings and got checkout_failure, caused by:

exit, {timeout,
                            {gen_server,call,
                                [application_controller,
                                 {set_env,crypto,'$curves$',
                                      ...

My code snippet:

1> application:ensure_all_started(hackney).                                                                                                                                              {ok,[crypto,asn1,public_key,ssl,unicode_util_compat,idna,                                      
     mimerl,certifi,syntax_tools,parse_trans,ssl_verify_fun,
     metrics,hackney]}
2> hackney_pool:start_pool(multipart, [{max_connections, 10_000}, {timeout, 50_000}]).
ok
3> [spawn(fun() -> case hackney:post("https://example.com", [], {multipart, [{file, <<"/tmp/img.jpg">>, <<"image">>, []}]}, [with_body, {pool, multipart}, {recv_timeout, 5000}]) of {error, Reason} -> io:format("error: ~p ~n", [Reason]); _ -> ok end end) || _ <- lists:seq(1,300)].

This not introduced by this PR, as it's also reproducible on master.
If you reason is the same, I would recommend calling crypto:supports() before making any https requests. This will 'warm up' crypto by pre-populating its app config.

@ijunaid8989
Copy link

@SergeTupchiy Actually we are not making any HTTPS request,

the origin of the request is HTTPS but the remote URL to which request is going is not HTTPs. do you still suggest the same?

@SergeTupchiy
Copy link
Contributor Author

@ijunaid8989,
This is the place where it occurs:
https://github.com/SergeTupchiy/hackney/blob/master/src/hackney_pool.erl#L74

You can add some logging there and share the output. It would help to locate the issue.
For example:

try
  do_checkout(Requester, Host, Port, Transport, Client, ConnectTimeout, CheckoutTimeout) 
catch Err:Res:St ->
  io:format("checkout failure: ~p, ~p, ~p~n", [Err, Res, St]),
  {error, checkout_failure}
end

@ijunaid8989
Copy link

@SergeTupchiy we are actually trapped with many errors. checkout timeout, checkout failure, closed.

edgurgel/httpoison#326

@SergeTupchiy
Copy link
Contributor Author

@ijunaid8989,
There is no surprise that you (occasionally?) get all the errors you mentioned under some still not completely clear conditions.
This PR is intended to fix a rather clear and defined problem, see #643.
I'd be happy to at least try to fix the issues you face but it would be too time consuming (if feasible at all) for me, considering the details you shared.
Saying that one experiences checkout_timeout or closed doesn't really helps, since those errors are natural to happen in some cases .
I have no idea how to reproduce the issue you linked. It's have been open for two years and I guess I know why it's still not resolved: someone says that one http client works fine with service B but fails with server A, while another works with both. It doesn't look like an exhaustive description.

@benoitc
Copy link
Owner

benoitc commented Nov 1, 2020

sorry for the late reply. I have been quite busy these days.

Thanks for the patch! Looking at it.

@isaacsanders
Copy link

@benoitc Running into this issue currently. Any status update on your end?

@lasernite
Copy link

lasernite commented Nov 11, 2020

@benoitc Also getting bit hard by this, merging would be much appreciated so we don't have to further litter our codebase with catches and hard resets of the pool everytime we make a request with HTTPoison that may timeout. Thank you!

@ijunaid8989
Copy link

@lasernite how do you hard reset your pool?

@benoitc benoitc added this to the 1.17.0 milestone Nov 12, 2020
@lasernite
Copy link

@ijunaid8989 we're doing stuff like:

{:error, %HTTPoison.Error{reason: :checkout_timeout}} ->
    :hackney_pool.stop_pool(:default)

@ijunaid8989
Copy link

@ijunaid8989 we're doing stuff like:

{:error, %HTTPoison.Error{reason: :checkout_timeout}} ->
    :hackney_pool.stop_pool(:default)

is it helpful? we also getting timeout/checkout_timeout/checkout_failure/closed?

Would that be helpful in such condition?

@lasernite
Copy link

@ijunaid8989 Yeh once we catch the timeouts and reset the pool the error goes away—although I assume this has some negative performance implications as unnecessarily clearing the pool frequently, but nothing noticeable so far.

@ijunaid8989
Copy link

@ijunaid8989 Yeh once we catch the timeouts and reset the pool the error goes away—although I assume this has some negative performance implications as unnecessarily clearing the pool frequently, but nothing noticeable so far.

Okay thanks.

FYI: we have started hackney pools such as

      :hackney_pool.child_spec(:snapshot_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_upload_pool, [timeout: 50_000, max_connections: 10_000]),
      :hackney_pool.child_spec(:seaweedfs_download_pool, [timeout: 50_000, max_connections: 10_000]),

in application.ex file.

@lasernite
Copy link

@ijunaid8989 Assuming those child_specs are getting started in a supervision tree in your application, I'm guessing you have to stop the pools by the atoms you customized above (:snapshot_pool, etc.) instead of :default when calling stop_pool. But don't take my word for it—I'm just using a library around hackney (HTTPoison) instead of accessing the methods directly like this, so I'll leave it up to you to see what works!

@ijunaid8989
Copy link

@ijunaid8989 Assuming those child_specs are getting started in a supervision tree in your application, I'm guessing you have to stop the pools by the atoms you customized above (:snapshot_pool, etc.) instead of :default when calling stop_pool. But don't take my word for it—I'm just using a library around hackney (HTTPoison) instead of accessing the methods directly like this, so I'll leave it up to you to see what works!

Thanks, I am also using HTTPoison, but this is how we are starting the pools.

can you tell me how you have started the pools? I mean in the config file? or with each HTTP request?

Maybe your way of using pools with HTTPoison is different (correct).?

@lasernite
Copy link

@ijunaid8989 We're not setting up any custom pools. As HTTPoison is a dependency in mix.exs when we start the server the normal application lifecycle methods are hit to automatically start HTTPoison and add to the supervision tree. Pooling reduces latency on subsequent requests from the same host, but I haven't investigated whether our patch is stopping pooling altogether or whether when the pools get stopped the supervision tree is automatically restarting them again. Doesn't matter except for saving a few hundred milliseconds on subsequent requests in the best case, which if we are losing will be fixed when this library is updated and we remove the temporary patch.

remove pending by the same ref that was taken from the front of the queue
Replace casting checkout_cancel msg in hackney_pool with a hackney_manager:cancel_request/1 call from hackney_connect:
this must ensure that pool checkout is canceled as well as client state is erased
@SergeTupchiy
Copy link
Contributor Author

SergeTupchiy added 2 commits on Oct 13
@SergeTupchiy
Fix a memory leak in hackney_pool
c9979cf
@SergeTupchiy
Fix checkout_timeout error in hackney_pool caused by connection errors
2bf38f9

This is just to reformat my commit messages 😄

@ijunaid8989
Copy link

@SergeTupchiy You got me, I was thinking its a fix for something!

@benoitc
Copy link
Owner

benoitc commented Nov 19, 2020

late update but testing the patch right now. thanks for the refactoring

@ijunaid8989
Copy link

late update but testing the patch right now.

It would be great to see some progress on this. I had to clone HTTPoison as well as Hackney to use my own sources.

@benoitc
Copy link
Owner

benoitc commented Nov 20, 2020

late update but testing the patch right now.

It would be great to see some progress on this. I had to clone HTTPoison as well as Hackney to use my own sources.

this release is beeing tested. It will be normally shipped later today.

@benoitc benoitc merged commit 91763df into benoitc:master Nov 20, 2020
@benoitc
Copy link
Owner

benoitc commented Nov 20, 2020

@SergeTupchiy thanks for the patch. Just merged your changes. Will finish the testing later today of the release today. Normally a release should land today as well.

@RaphSfeir
Copy link

Is this now part of the 1.16.0 release? I'm also using my own sources and would like to switch back to official release but I keep getting the same issue. Thanks!

@benoitc
Copy link
Owner

benoitc commented Nov 27, 2020 via email

digitalcora added a commit to mbta/alerts_concierge that referenced this pull request Mar 5, 2021
This version fixes a memory/connection pool leak that has resulted in
app crashes on some of our other projects (and maybe this one?).

ref: benoitc/hackney#661
digitalcora added a commit to mbta/alerts_concierge that referenced this pull request Mar 5, 2021
This version fixes a memory/connection pool leak that has resulted in
app crashes on some of our other projects (and maybe this one?).

Since Hackney is not explicitly depended-on anywhere in `concierge_site`
it can be removed from that app's dependencies entirely.

ref: benoitc/hackney#661
digitalcora added a commit to mbta/alerts_concierge that referenced this pull request Mar 5, 2021
This version fixes a memory/connection pool leak that has resulted in
app crashes on some of our other projects (and maybe this one?).

Since Hackney is not explicitly depended-on anywhere in `concierge_site`
it can be removed from that app's dependencies entirely.

ref: benoitc/hackney#661
digitalcora added a commit to mbta/alerts_concierge that referenced this pull request Mar 5, 2021
This version fixes a memory/connection pool leak that has resulted in
app crashes on some of our other projects (and maybe this one?).

Since Hackney is not explicitly depended-on anywhere in `concierge_site`
it can be removed from that app's dependencies entirely.

ref: benoitc/hackney#661
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

connections not returned to the pool on connection errors
7 participants