Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Replaced GAIResolver with twisted.names.client for the default resolver #5053

Closed
wants to merge 4 commits into from

Conversation

olivia-fl
Copy link

@olivia-fl olivia-fl commented Apr 12, 2019

I profiled synapse for a relatively small homeserver that is in several large rooms, and found that a huge chunk of time was being taken up by calls to getaddrinfo:

before changes

It turns out that twisted defaults to using twisted.internet._resolver.GAIResolver for resolving hostnames. This is implemented with blocking getaddrinfo calls on a thread pool, but python appears to not allow running concurrent getaddrinfo anyway. This pull request replaces the default relsover with twisted.names.client, which is a user space non-blocking resolver.

after changes

Signed-off-by: Benjamin Lee [email protected]

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file
  • Pull request includes a sign off

@olivia-fl olivia-fl force-pushed the avoid-getaddrinfo branch 2 times, most recently from bd0ce97 to 834290c Compare April 12, 2019 23:14
@hawkowl
Copy link
Contributor

hawkowl commented Apr 16, 2019

@Benjamin-L Hi, thanks for the patch!

I'm going to hold off on merging this for two reasons:

  1. This uncovered a bug in Twisted's DNS implementation on Python 3. (https://twistedmatrix.com/trac/ticket/9627#ticket)
  2. The reason why Twisted (and Synapse) uses GAI as the default is that it makes DNS issues more debuggable for the average user. By using GAI, and therefore the operating system's implementation, cache, and that sort of thing, a user can debug DNS issues with operating system tools. If Synapse (or Twisted) quietly uses its own implementation, they require knowledge of Twisted's DNS system to debug issues.

As such, I would like if this was put behind a configuration option, so people can opt into it instead.

@olivia-fl
Copy link
Author

That seems pretty reasonable. Amusingly, I ended up discovering that exact issue when trying to debug the sytest failures, but didn't find anything on the issue tracker for it. It's good to know it's already known.

Another thing that I've found out since submitting this is that whether or not getaddrinfo releases the GIL varies depending on OS. On OS's where it doesn't lock, I'm not even sure it's actually a bottleneck, since the profiler samples would be on different threads that wouldn't block the main thread. For that we'd need a good way of measuring how much time is actually spend waiting on DNS queries, and I'm not familiar enough with python to know what a good way of doing that is.

Putting this behind a config option seems like a good choice. I'm going to try to implement that and add a new commit. I'll call the option no_getaddrinfo for now, but there's probably a better name.

@codecov
Copy link

codecov bot commented Apr 16, 2019

Codecov Report

Merging #5053 into develop will decrease coverage by <.01%.
The diff coverage is 50%.

@@             Coverage Diff             @@
##           develop    #5053      +/-   ##
===========================================
- Coverage    61.57%   61.56%   -0.01%     
===========================================
  Files          332      332              
  Lines        34265    34267       +2     
  Branches      5648     5648              
===========================================
+ Hits         21097    21098       +1     
- Misses       11654    11655       +1     
  Partials      1514     1514

@codecov
Copy link

codecov bot commented Apr 16, 2019

Codecov Report

Merging #5053 into develop will decrease coverage by 0.58%.
The diff coverage is 50%.

@@             Coverage Diff             @@
##           develop    #5053      +/-   ##
===========================================
- Coverage    62.15%   61.56%   -0.59%     
===========================================
  Files          336      332       -4     
  Lines        34659    34269     -390     
  Branches      5694     5649      -45     
===========================================
- Hits         21541    21099     -442     
- Misses       11583    11656      +73     
+ Partials      1535     1514      -21

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what erik said. Please could you rename the option.

Also looks like a conflict has crept in - please could you merge in latest develop?

Otherwise this looks good to me.

@olivia-fl olivia-fl force-pushed the avoid-getaddrinfo branch from ca00a32 to c0aff07 Compare May 13, 2019 23:39
@olivia-fl
Copy link
Author

I've pushed the renamed option, as well as adding it to the default config. Do you want me to squash the commits?

@richvdh richvdh self-requested a review May 14, 2019 10:00
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this. No need to squash the commits (we can do that when we merge the PR), but please take a look at the comments below.

I have a concern, however:

I note that the CI is failing (which is partly because the condition at https://github.com/matrix-org/synapse/pull/5053/files#diff-80846f8c47b65e25b3c6716a3eb1bbc8R135 is reversed, so that the option is being used when it shouldn't, but even so: I wouldn't expect the integration tests to fail).

It looks like, when we switch to twisted.names.client, outbound requests to https://localhost:port are only made to the IPv6 version of localhost (::1), and it seems that sytest is only listening on IPv4. My expectation would be that we would attempt to connect to both IPv4 and IPv6. @hawkowl: is this the manifestation of twisted #9627?

@@ -46,6 +46,12 @@ pid_file: DATADIR/homeserver.pid
#
#cpu_affinity: 0xFFFFFFFF

# Whether to use the default system resolver (getaddrinfo) instead of the twisted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is a generated file (the script to update it is scripts-dev/generate-sample-config). Please could you update the source for it (in synapse/config/server.py) and then run the script to update the sample config?

# asynchronous dns resolver. Using the async resolver can potentially improve
# performance, but may also behave different from getaddrinfo.
# Defaults to True. Uncomment to use the twisted resolver instead.
#use_getaddrinfo_for_dns: False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you're changing the source, please could you add a line with just # between the text and the option, for consistency with the rest of the file.

@@ -0,0 +1 @@
Replace blocking getaddrinfo hostname resolver with the non-blocking resolver from twisted.names.client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace blocking getaddrinfo hostname resolver with the non-blocking resolver from twisted.names.client.
Add an option to replace the blocking `getaddrinfo` hostname resolver with the non-blocking resolver from `twisted.names.client`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(could call this a feature rather than a misc?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, feel free to add yourself as a contributor here, along the lines of the other entries in CHANGES.md.

@@ -127,6 +131,10 @@ def run():
change_resource_limit(soft_file_limit)
if gc_thresholds:
gc.set_threshold(*gc_thresholds)

if use_getaddrinfo_for_dns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely this condition is reversed?

@richvdh
Copy link
Member

richvdh commented Jun 27, 2019

@Benjamin-L any chance of taking another look at this?

@richvdh
Copy link
Member

richvdh commented Jul 22, 2019

I'm going to assume this is abandoned. Please reopen it if not!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants