Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated iter_bucket to use concurrent futures. #368

Merged
merged 8 commits into from
Mar 11, 2020

Conversation

derpferd
Copy link
Contributor

This PR addresses issue #340.
AWS Lambda environments do not support multiprocessing.Queue or
multiprocessing.Pool, which are used by iter_bucket to optimize the
pulling of files from s3.

Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead.
This still optimizes the pulling of files from s3 without using new
processes.

@piskvorky
Copy link
Owner

piskvorky commented Oct 29, 2019

Interesting! How do we (unit-)test this?

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 29, 2019

Thank you for your contribution. Looks like the right thing to do.

Looks like the existing tests are failing because of a non-existing dependency: `flask'. Why do we depend on this package?

@derpferd Could you please:

  • Get these existing tests to pass. May need to tweak setup.py, unless there is a better way.
  • Add new unit tests. You should check both the concurrent and non-concurrent code path, and make sure they both work exactly the same way on machines that support the concurrent package.

@derpferd
Copy link
Contributor Author

@mpenkov Will do. I saw the flask import error in Travis CI. I am not getting this error locally. I will try a couple things to try to get it working.

As far as unit tests, will push those. For AWS Lambda testing I successfully ran the updated version of the function without problems.

@derpferd derpferd force-pushed the master branch 2 times, most recently from 9d52d14 to a69c7fb Compare October 29, 2019 23:46
This commit addresses issue piskvorky#340.
AWS Lambda environments do not support multiprocessing.Queue or
multiprocessing.Pool, which are used by iter_bucket to optimize the
pulling of files from s3.

Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead.
This still optimizes the pulling of files from s3 without using new
processes.
@derpferd
Copy link
Contributor Author

So, it appears the moto framework, is not reliable. It randomly gets the following error. It appears that this is an issue with moto not the change made.

Traceback (most recent call last):

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/models.py", line 71, in wrapper

    result = func(*args, **kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/tests/test_s3.py", line 497, in test_old

    for k, c in smart_open.s3.iter_bucket(mybucket):

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 692, in iter_bucket

    for key_no, (key, content) in enumerate(result_iterator):

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 786, in <lambda>

    return map(lambda future: future.result(), concurrent.futures.as_completed(futures))

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/_base.py", line 425, in result

    return self.__get_result()

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result

    raise self._exception

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/thread.py", line 57, in run

    result = self.fn(*self.args, **self.kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 749, in _download_key

    content_bytes = _download_fileobj(bucket, key_name)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 766, in _download_fileobj

    bucket.download_fileobj(key_name, buf)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/boto3-1.7.84-py3.7.egg/boto3/s3/inject.py", line 720, in bucket_download_fileobj

    Callback=Callback, Config=Config)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/boto3-1.7.84-py3.7.egg/boto3/s3/inject.py", line 678, in download_fileobj

    return future.result()

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/futures.py", line 73, in result

    return self._coordinator.result()

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/futures.py", line 233, in result

    raise self._exception

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/tasks.py", line 255, in _main

    self._submit(transfer_future=transfer_future, **kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/download.py", line 353, in _submit

    **transfer_future.meta.call_args.extra_args

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/client.py", line 314, in _api_call

    return self._make_api_call(operation_name, kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/client.py", line 599, in _make_api_call

    operation_model, request_dict)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 148, in make_request

    return self._send_request(request_dict, operation_model)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 177, in _send_request

    success_response, exception):

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 273, in _needs_retry

    caught_exception=caught_exception, request_dict=request_dict)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 227, in emit

    return self._emit(event_name, kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 360, in _emit

    aliased_event_name, kwargs, stop_on_response

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 210, in _emit

    response = handler(**kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 183, in __call__

    if self._checker(attempts, response, caught_exception):

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 251, in __call__

    caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 269, in _should_retry

    return self._checker(attempt_number, response, caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 317, in __call__

    caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 223, in __call__

    attempt_number, caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 359, in _check_caught_exception

    raise caught_exception

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 222, in _get_response

    proxies=self.proxies, timeout=self.timeout)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/vendored/requests/sessions.py", line 573, in send

    r = adapter.send(request, **kwargs)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/responses.py", line 626, in unbound_on_send

    return self._on_request(adapter, request, *a, **kwargs)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/responses.py", line 604, in _on_request

    response = adapter.build_response(request, match.get_response(request))

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/models.py", line 137, in get_response

    result = self.callback(request)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/utils.py", line 172, in __call__

    result = self.callback(request, request.url, request.headers)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/instance_metadata/responses.py", line 48, in metadata_response

    "The {0} metadata path has not been implemented".format(path))
NotImplementedError: The /mykey14 metadata path has not been implemented

@piskvorky
Copy link
Owner

piskvorky commented Oct 30, 2019

The comment in our setup.py leads to this moto issue:
getmoto/moto#1793

There's a long discussion I just skimmed, but the problem seems related to botocore version and module import order.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 31, 2019

I agree that this seems to be a problem with moto as opposed to the code @derpferd made. We can't merge while the tests are failing, though. I suggest the following:

  • Disable that test for Py3.7 while the moto guys get this worked out
  • Write an integration test that hits a real S3 bucket using the two code paths introduced in this PR. This will cover the ground we lost by disabling the unit test above.

@piskvorky Does the above seem acceptable?

@derpferd
Copy link
Contributor Author

@mpenkov This wouldn't work. The tests randomly fail on all versions of Python not just 3.7 😢
However I can go forward with a integration test.

We could switch to using moto server instead of the mock. We could also upgrade the version of moto to a more recent version (both of these options would probably take lots of work).

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 31, 2019

OK then, how about this: disable the test for all version of Py, and replace with an integration test. Another alternative is to re-run the test multiple times, and count a success if it passes once or more.

@piskvorky WDYT?

@piskvorky
Copy link
Owner

piskvorky commented Oct 31, 2019

We cannot give write access to our private (paid) S3 buckets to the public. Or how would this protect our access credentials?

Upgrading moto makes sense to me, the issue may be fixed according to that thread (though I may be wrong, just skimmed it). @derpferd why do you think upgrading moto to a more recent version would take a lot of work?

Since the issue is unrelated to changes in this PR, I assume it happens in master too? We could merge here, and fix it in a separate PR.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 31, 2019

@piskvorky We need not give access to our buckets. The functionality in this PR is read-only, so we can apply it to any publicly-readable bucket out there.

Perhaps you're thinking of #372?

@piskvorky
Copy link
Owner

piskvorky commented Oct 31, 2019

I don't think we have any publicly-accessible buckets, do we?

I thought we'd have to create an IAM role with limited (read-only) access to our private S3 bucket, and use that.

Or do you mean using a completely public-access bucket (not ours) in the tests? Which one?

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 31, 2019

Any publicly-accessible bucket will do.

                           PRE segments/
2019-10-29 10:19:02        846 cc-index.paths.gz
2019-10-29 12:48:06       2473 index.html
2019-10-29 10:19:02     163394 non200responses.paths.gz
2019-10-29 10:19:03     162602 robotstxt.paths.gz
2019-10-29 10:19:03        683 segment.paths.gz
2019-10-29 10:19:04     161412 warc.paths.gz
2019-10-29 10:19:05     162599 wat.paths.gz
2019-10-29 10:19:05     162599 wet.paths.gz
$ 

@piskvorky
Copy link
Owner

Yes, I guess we can do that (as long as the 3rd party bucket doesn't change too often).

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 8, 2020

@derpferd Ping. Are you able to finish this PR?

@mpenkov mpenkov added the stale No recent activity from author label Jan 8, 2020
@mpenkov mpenkov removed the stale No recent activity from author label Mar 11, 2020
@mpenkov mpenkov merged commit b0418fd into piskvorky:master Mar 11, 2020
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 11, 2020

Merged. @derpferd Thank you once again for your efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants