Unvendor requests and use urllib3 directly #1495

joguSD · 2018-07-04T00:13:37Z

This pull request begins to implement the changes discussed in #1486

This builds on some of the work started in this PR: #1466

The work is not yet done but this branch should largely be functional for the majority of use cases. In particular exception cases have been hard to verify, and there will be cases where different exceptions are now raised. This is an unavoidable consequence of moving away from our vendored dependencies. To continue building on this I would like to minimize dependency exception uses to single modules, and ensure that we don't leak exceptions from our dependencies to simplify both the exceptions generated from botocore and make it easier to verify botocore's exception behavior.

Additionally, these changes impact our dependent packages s3-transfer, boto3, and aws-cli as all have references to our vendored dependencies (mostly exception classes).

Downstream Updates:
Boto3: boto/boto3#1668
CLI: aws/aws-cli#3513

Closes #756, #1248, #1258, #1300, #1370, #1385, #1464, #1466,
Closes aws/aws-cli#2994

codecov-io · 2018-07-04T00:26:09Z

Codecov Report

Merging #1495 into develop will decrease coverage by 6.15%.
The diff coverage is 94.62%.

@@             Coverage Diff             @@
##           develop    #1495      +/-   ##
===========================================
- Coverage    72.82%   66.67%   -6.16%     
===========================================
  Files          128      129       +1     
  Lines        14236    14462     +226     
===========================================
- Hits         10367     9642     -725     
- Misses        3869     4820     +951

Impacted Files	Coverage Δ
botocore/stub.py	`97.8% <100%> (-0.08%)`	⬇️
botocore/exceptions.py	`100% <100%> (ø)`	⬆️
botocore/response.py	`92.64% <100%> (+0.58%)`	⬆️
botocore/compat.py	`93.92% <100%> (ø)`	⬆️
botocore/retryhandler.py	`99.36% <100%> (-0.01%)`	⬇️
botocore/parsers.py	`99.78% <100%> (ø)`	⬆️
botocore/utils.py	`97.43% <86.36%> (-1.01%)`	⬇️
botocore/httpsession.py	`92.3% <92.3%> (ø)`
botocore/endpoint.py	`98.37% <94.44%> (+0.95%)`	⬆️
botocore/awsrequest.py	`99.27% <99.15%> (+0.63%)`	⬆️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cb551e...51f5b5d. Read the comment docs.

jamesls · 2018-07-06T20:24:35Z

I'm starting to look at this now, and one of the things that would really help is putting a description of what/how/why you changed certain things. Comparing this to what requests does, it's not quite the same and there's not really an explanation of why anywhere (and it's not immediately obvious from looking at your code). That and docstrings would help out.

While this isn't always the case for small PRs, ripping out our existing http client is going to require careful scrutiny and the more insight you can give into the changes you've made, the better (and more timely) feedback you're going to get.

jamesls

I haven't look at this in depth, I mostly just have high level questions for now.

One of the hardest parts to follow was the proxy logic, which at first glance seems to be split across http session functions, the Urllib3Session class, and some stuff in utils.py. While it would be preferable to consolidate them in a single class, I think there should be some docstring/comment somewhere that explains how proxies work now that we have to do more of the work than when we relied on requests.

And lastly, I'm wondering how to handle the connection errors leaking out from requests. I think a lot of code uses this (chalice also catches this), and this would be a pretty big breaking change for customers. Maybe we figure out some way to roll out the exception changes first, get code updated where possible, then merge the urllib3 changes. Worst case is we keep the exception around and just alias it to a new botocore exception.

jamesls · 2018-07-06T20:30:26Z

botocore/http_session.py

+DEFAULT_TIMEOUT = 60
+MAX_POOL_CONNECTIONS = 10
+# TODO: move the vendored cacert when we drop requests
+DEFAULT_CA_BUNDLE = os.path.join(


I'd just copy it over to its new location now so we can just remove vendored/ entirely when we're ready to ship this.

I didn't copy it over as the CA_BUNDLE kinda muddies the diff. I was planning on moving this at the same time we remove the vendored requests directory.

kinda muddies the diff

Given we won't likely won't remove the vendored requests directory for a while, I'd prefer if this new stuff had minimal dependencies on anything in vendored/ unless necessary. That way we can just remove vendored/ when appropriate. It will also give customers something to switch to if for whatever reason they're depending on our cacert.pem.

Copied the vendored CA Cert and updated this.

jamesls · 2018-07-06T20:32:11Z

botocore/http_session.py

+        return None, None
+
+
+def get_proxy_headers(proxy_url):


All of the proxy and auth stuff seems like it would be better to encapsulate in a single class rather than a collection of module level functions.

I'll see what I can do. This logic is more or less directly ported from what requests does.

This logic is more or less directly ported from what requests does.

It's still confusing to follow and I'd prefer we come up with a better abstraction for proxies if we're going to be maintaining/supporting this.

jamesls · 2018-07-06T20:33:01Z

botocore/http_session.py

+    return headers
+
+
+def fix_proxy_url(proxy_url):


I know this PR is an early draft, but function like fix_proxy_url without any explanation or docs is hard to review.

Yeah, I haven't touched docstrings yet as I didn't know how much churn there would be. I'll try adding some documentation for the things that I think are little more solid.

jamesls · 2018-07-06T20:51:34Z

botocore/http_session.py

+    return where()
+
+
+class Urllib3Session(object):


For this class, general comments:

It seems like part of this class's logic for proxies is contained in the class itself, and part is in module functions above. It would be helpful to consolidate all the proxy stuff to a single class.

URL not Url.

It looks like this takes the place of the requests session and adapter. What was the rationale for consolidating the two classes?

All of the proxy logic inside the class directly relates to taking a properly formed proxy url and turning that into a urllib3 proxied connection. The logic outside the class don't relate to anything urllib3 specific.

URLLib3?

This correlates to the adapter. It's less of consolidating the two classes and more-so removing the part we didn't use. We effectively passed every parameter every time we made a request so the session didn't provide much. The parts it did give us we either didn't rely on (retries, cookies) or are now our handling in our own retry handler (redirects).

Looking into refactoring the proxy code.

jamesls · 2018-07-06T20:56:23Z

botocore/http_session.py

+        self._timeout = timeout
+        self._max_pool_connections = max_pool_connections
+        self._proxy_managers = {}
+        self._http_pool = PoolManager(


If this is designed to be tied to a single client (i.e one urllib3 session per botocore client) what's the rationale for using a pool manager instead of a connection pool? If we're going to use a pool manager, wouldn't it make more sense to tie it to the botocore session and share it across all clients?

Yep, this is potentially a change I've thought about exploring as well. I've kept it this way as it more closely correlates to what requests was going previously and I haven't explored what implications switching from a pool manager to a connection pool would entail.

What's interesting is that the max_pool_connections is really "max persistent connections", and in the case of s3 with virtual hosted addressing, this number is potentially 100 max persistent connections by default (10 per pool, 10 pool total, 1 pool used per bucket because it's a different hostname). A connection pool would make that closer to what you'd expect max_pool_connections to mean, but could potentially result in worse performance. I suppose we should keep this as is for now to minimize risky changes in the switchover.

Yep, that was the big difference I noticed as well.

jamesls · 2018-07-06T21:35:34Z

botocore/awsrequest.py

+            # Read the contents.
+            try:
+                data_generator = self.raw.stream()
+            except AttributeError:


When would raw not have a stream method on it?

When it's not a raw response from urllib3.

Currently, this will never happen in practice but we relied on this functionality from requests in some tests where we mock the raw response with a file-like object.

Yeah I'd prefer to simplify and remove this fallback if our tests are the only thing relying on it.

I removed it and just updated the tests that were dependent on it.

jamesls · 2018-07-06T21:36:09Z

botocore/awsrequest.py

+        if not self.content:
+            return str('')
+
+        return self.content.encode('utf-8')


Shouldn't this take into account encodings provided via response headers?

It could, requests had some fairly complex logic for trying to determine the content encoding. The only place the text property is used is in the ContainerMetadataFetcher. I think I might just remove this and do the decoding there.

Switched this to just use content and decode.

jamesls · 2018-07-06T21:37:12Z

botocore/awsrequest.py

+        try:
+            length = len(self.body)
+            self.headers['Content-Length'] = str(length)
+        except Exception as e:


What would raise here? The len() call?

Why's that? We should scope this to something more specific than Exception.

jamesls · 2018-07-06T21:40:11Z

botocore/awsrequest.py

@@ -54,7 +50,7 @@ def _read_status(self):
            return HTTPResponse._read_status(self)


-class AWSHTTPConnection(HTTPConnection):
+class AWSConnection(object):
    """HTTPConnection that supports Expect 100-continue.


docstring is out of date.

Kind of? The functionality and inheritance described is still true, it just doesn't actually get inherited until it's used as a mixin.

jamesls · 2018-07-06T21:42:49Z

botocore/awsrequest.py

+            try:
+                data_generator = self.raw.stream()
+            except AttributeError:
+                def stream():


Is there a reason not to just have this be a separate method and always use it (vs. using raw.stream() if it's available).

This is another case of 'it's what requests does'. I think the motivation here was that the Response object is generic and the underlying response could come from a different library other than urllib3 that may or may not have a stream method.

In that case, I'd prefer to remove this if nothing's relying on it. If we are going to keep it at least extract it out to a method instead of a nested function.

joguSD · 2018-07-06T22:02:53Z

@jamesls What do you think would be the best format for tracking these types of comparisons? I don't necessarily think it makes sense to put these in-code (as the difference are mostly removals). My workflow on this has been less about "this is what requests does" and more about "this is what botocore expects" (passing all the tests). In general my answer to "Why is this different from requests?" will be "because botocore doesn't use it, afaik".

At a really high level we're basically taking two things from requests and re-implementing the subset that botocore expects. These two pieces are:

requests model objects that represent a request & response (AWSRequest, AWSPreparedRequest, AWSResponse)
Requests adapter class for urllib3

As for the Urllib3Session which parallels to the HTTPAdapter class in requests there shouldn't be much difference. The big difference is:

Our vendored version of requests implemented chunked encoding for request bodies. Newer versions of urllib3 support this by setting chunked=True when calling urlopen, so we don't need it in our http session class. Additionally, I've not enabled this flag because afaik we never use chunked encoding when making requests.

The model objects do have a lot of changes as I basically gutted them to have next to no logic and only implemented the things that we relied on requests for. This turns out to be not a whole lot of functionality as botocore is already producing / preparing generally well structured requests before handing them off to requests. The behavior that I've found we needed is:

Converting the modeled params dictionary to the query string
Determining content length of the body
Converting the headers to a case-insensitive dictionary
The biggest differences here are url preparation and determining the content length of the body. Requests parses the url does some validations and conversions on it and reconstructs it. From what I've gathered this process basically just fixes any issues the url may have and I've never ever seen it actually change a url after it was done. I verified this by running all of the tests asserting that the url before and after are the same. As for determining content-length of the body I tried to keep it simple and implement the logic that we document (it's bytes or a file-like).

jamesls · 2018-07-09T16:15:30Z

What do you think would be the best format for tracking these types of comparisons?

If you're asking where's the best place to document where we deviate from what requests does, I think it's worth putting in the code somewhere as either a comment or a docstring.

A lot of your previous comment would have been really helpful to read as a comment/docstring while reviewing the code.

JordonPhillips · 2018-07-12T20:15:46Z

botocore/awsrequest.py

+    there are the following differences:
+        This class does not heavily prepare the URL. Requests performed many
+        validations and corrections to ensure the URL is properly formatted.
+        Botocore either performs this validations elsewhere or otherwise


s/this/these

JordonPhillips · 2018-07-12T20:54:40Z

botocore/awsrequest.py

+        self.stream_output = stream_output
+
+        if headers is not None:
+            for key, value in headers.items():


self.headers.update(headers)

I'll check if this works but self.headers isn't a dict it's a HTTPHeaders object which has some weird behavior and this is how we were doing this previously so I just left it.

Yeah this doesn't work.

JordonPhillips · 2018-07-12T21:04:37Z

botocore/awsrequest.py

+        return hash(self._lower)
+
+    def __eq__(self, other):
+        return self._lower == other._lower


Probably not needed, but what about putting a try/except here in case this is compared against something else?

I can do an isinstance check I suppose.

JordonPhillips · 2018-07-12T21:06:56Z

botocore/cacert.pem

@@ -0,0 +1,4433 @@
+


While we're at it, why not just go certifi?

I'm trying to keep changes to a minimum for now. I think this is something we could come back to.

jamesls

Looks good, I mostly focused on the http_session code. I'd like to look over the new exceptions in more detail, I'll have more feedback tomorrow.

jamesls · 2018-08-14T00:14:21Z