Updated iter_bucket to use concurrent futures. #368

derpferd · 2019-10-29T01:43:11Z

This PR addresses issue #340.
AWS Lambda environments do not support multiprocessing.Queue or
multiprocessing.Pool, which are used by iter_bucket to optimize the
pulling of files from s3.

Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead.
This still optimizes the pulling of files from s3 without using new
processes.

piskvorky · 2019-10-29T07:49:02Z

Interesting! How do we (unit-)test this?

mpenkov · 2019-10-29T08:50:09Z

Thank you for your contribution. Looks like the right thing to do.

Looks like the existing tests are failing because of a non-existing dependency: `flask'. Why do we depend on this package?

@derpferd Could you please:

Get these existing tests to pass. May need to tweak setup.py, unless there is a better way.
Add new unit tests. You should check both the concurrent and non-concurrent code path, and make sure they both work exactly the same way on machines that support the concurrent package.

derpferd · 2019-10-29T23:25:34Z

@mpenkov Will do. I saw the flask import error in Travis CI. I am not getting this error locally. I will try a couple things to try to get it working.

As far as unit tests, will push those. For AWS Lambda testing I successfully ran the updated version of the function without problems.

This commit addresses issue piskvorky#340. AWS Lambda environments do not support multiprocessing.Queue or multiprocessing.Pool, which are used by iter_bucket to optimize the pulling of files from s3. Solution: Switch to using concurrent.futures.ThreadPoolExecutor instead. This still optimizes the pulling of files from s3 without using new processes.

derpferd · 2019-10-30T11:51:55Z

So, it appears the moto framework, is not reliable. It randomly gets the following error. It appears that this is an issue with moto not the change made.

Traceback (most recent call last):

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/models.py", line 71, in wrapper

    result = func(*args, **kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/tests/test_s3.py", line 497, in test_old

    for k, c in smart_open.s3.iter_bucket(mybucket):

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 692, in iter_bucket

    for key_no, (key, content) in enumerate(result_iterator):

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 786, in <lambda>

    return map(lambda future: future.result(), concurrent.futures.as_completed(futures))

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/_base.py", line 425, in result

    return self.__get_result()

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result

    raise self._exception

  File "/opt/python/3.7.1/lib/python3.7/concurrent/futures/thread.py", line 57, in run

    result = self.fn(*self.args, **self.kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 749, in _download_key

    content_bytes = _download_fileobj(bucket, key_name)

  File "/home/travis/build/RaRe-Technologies/smart_open/smart_open/s3.py", line 766, in _download_fileobj

    bucket.download_fileobj(key_name, buf)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/boto3-1.7.84-py3.7.egg/boto3/s3/inject.py", line 720, in bucket_download_fileobj

    Callback=Callback, Config=Config)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/boto3-1.7.84-py3.7.egg/boto3/s3/inject.py", line 678, in download_fileobj

    return future.result()

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/futures.py", line 73, in result

    return self._coordinator.result()

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/futures.py", line 233, in result

    raise self._exception

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/tasks.py", line 255, in _main

    self._submit(transfer_future=transfer_future, **kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/s3transfer-0.1.13-py3.7.egg/s3transfer/download.py", line 353, in _submit

    **transfer_future.meta.call_args.extra_args

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/client.py", line 314, in _api_call

    return self._make_api_call(operation_name, kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/client.py", line 599, in _make_api_call

    operation_model, request_dict)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 148, in make_request

    return self._send_request(request_dict, operation_model)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 177, in _send_request

    success_response, exception):

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 273, in _needs_retry

    caught_exception=caught_exception, request_dict=request_dict)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 227, in emit

    return self._emit(event_name, kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 360, in _emit

    aliased_event_name, kwargs, stop_on_response

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/hooks.py", line 210, in _emit

    response = handler(**kwargs)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 183, in __call__

    if self._checker(attempts, response, caught_exception):

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 251, in __call__

    caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 269, in _should_retry

    return self._checker(attempt_number, response, caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 317, in __call__

    caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 223, in __call__

    attempt_number, caught_exception)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/retryhandler.py", line 359, in _check_caught_exception

    raise caught_exception

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/endpoint.py", line 222, in _get_response

    proxies=self.proxies, timeout=self.timeout)

  File "/home/travis/build/RaRe-Technologies/smart_open/.eggs/botocore-1.10.84-py3.7.egg/botocore/vendored/requests/sessions.py", line 573, in send

    r = adapter.send(request, **kwargs)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/responses.py", line 626, in unbound_on_send

    return self._on_request(adapter, request, *a, **kwargs)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/responses.py", line 604, in _on_request

    response = adapter.build_response(request, match.get_response(request))

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/models.py", line 137, in get_response

    result = self.callback(request)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/core/utils.py", line 172, in __call__

    result = self.callback(request, request.url, request.headers)

  File "/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/moto/instance_metadata/responses.py", line 48, in metadata_response

    "The {0} metadata path has not been implemented".format(path))
NotImplementedError: The /mykey14 metadata path has not been implemented

piskvorky · 2019-10-30T16:21:31Z

The comment in our setup.py leads to this moto issue:
getmoto/moto#1793

There's a long discussion I just skimmed, but the problem seems related to botocore version and module import order.

mpenkov · 2019-10-31T02:56:03Z

I agree that this seems to be a problem with moto as opposed to the code @derpferd made. We can't merge while the tests are failing, though. I suggest the following:

Disable that test for Py3.7 while the moto guys get this worked out
Write an integration test that hits a real S3 bucket using the two code paths introduced in this PR. This will cover the ground we lost by disabling the unit test above.

@piskvorky Does the above seem acceptable?

derpferd · 2019-10-31T03:11:33Z

@mpenkov This wouldn't work. The tests randomly fail on all versions of Python not just 3.7 😢
However I can go forward with a integration test.

We could switch to using moto server instead of the mock. We could also upgrade the version of moto to a more recent version (both of these options would probably take lots of work).

mpenkov · 2019-10-31T05:10:16Z

OK then, how about this: disable the test for all version of Py, and replace with an integration test. Another alternative is to re-run the test multiple times, and count a success if it passes once or more.

@piskvorky WDYT?

piskvorky · 2019-10-31T07:09:42Z

We cannot give write access to our private (paid) S3 buckets to the public. Or how would this protect our access credentials?

Upgrading moto makes sense to me, the issue may be fixed according to that thread (though I may be wrong, just skimmed it). @derpferd why do you think upgrading moto to a more recent version would take a lot of work?

Since the issue is unrelated to changes in this PR, I assume it happens in master too? We could merge here, and fix it in a separate PR.

mpenkov · 2019-10-31T08:35:06Z

@piskvorky We need not give access to our buckets. The functionality in this PR is read-only, so we can apply it to any publicly-readable bucket out there.

Perhaps you're thinking of #372?

piskvorky · 2019-10-31T09:41:28Z

I don't think we have any publicly-accessible buckets, do we?

I thought we'd have to create an IAM role with limited (read-only) access to our private S3 bucket, and use that.

Or do you mean using a completely public-access bucket (not ours) in the tests? Which one?

mpenkov · 2019-10-31T10:59:14Z

Any publicly-accessible bucket will do.

                           PRE segments/
2019-10-29 10:19:02        846 cc-index.paths.gz
2019-10-29 12:48:06       2473 index.html
2019-10-29 10:19:02     163394 non200responses.paths.gz
2019-10-29 10:19:03     162602 robotstxt.paths.gz
2019-10-29 10:19:03        683 segment.paths.gz
2019-10-29 10:19:04     161412 warc.paths.gz
2019-10-29 10:19:05     162599 wat.paths.gz
2019-10-29 10:19:05     162599 wet.paths.gz
$

piskvorky · 2019-10-31T11:51:16Z

Yes, I guess we can do that (as long as the 3rd party bucket doesn't change too often).

mpenkov · 2020-01-08T03:30:06Z

@derpferd Ping. Are you able to finish this PR?

This reverts commit 6506562.

mpenkov · 2020-03-11T05:11:54Z

Merged. @derpferd Thank you once again for your efforts!

derpferd force-pushed the master branch 2 times, most recently from 9d52d14 to a69c7fb Compare October 29, 2019 23:46

derpferd force-pushed the master branch from a69c7fb to bb7e7d2 Compare October 30, 2019 00:12

mpenkov added the stale No recent activity from author label Jan 8, 2020

mpenkov removed the stale No recent activity from author label Mar 11, 2020

mpenkov force-pushed the master branch from 5ad7606 to bb7e7d2 Compare March 11, 2020 03:28

mpenkov added 7 commits March 11, 2020 14:31

disable test_old when mocks are disabled

6506562

favor multiprocessing over concurrent.futures

7f9c37d

make imap_unordered return an iterator instead of a list

93e2a44

skip tests when their respective features are unavailable

3e37837

Revert "disable test_old when mocks are disabled"

7aef607

This reverts commit 6506562.

tweak imap_unordered

ec7d055

remove tests_require pins

8f64d5b

mpenkov merged commit b0418fd into piskvorky:master Mar 11, 2020

mpenkov added this to the 1.10.0 milestone Mar 11, 2020

mpenkov mentioned this pull request Mar 11, 2020

iter_bucket fails on AWS Lambda when worker threads is not 1 #340

Closed

ronkorving mentioned this pull request Nov 22, 2021

iter_bucket fails in AWS Lambda function #605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated iter_bucket to use concurrent futures. #368

Updated iter_bucket to use concurrent futures. #368

derpferd commented Oct 29, 2019

piskvorky commented Oct 29, 2019 •

edited

Loading

mpenkov commented Oct 29, 2019 •

edited

Loading

derpferd commented Oct 29, 2019

derpferd commented Oct 30, 2019

piskvorky commented Oct 30, 2019 •

edited

Loading

mpenkov commented Oct 31, 2019

derpferd commented Oct 31, 2019

mpenkov commented Oct 31, 2019 •

edited

Loading

piskvorky commented Oct 31, 2019 •

edited

Loading

mpenkov commented Oct 31, 2019

piskvorky commented Oct 31, 2019 •

edited

Loading

mpenkov commented Oct 31, 2019

piskvorky commented Oct 31, 2019

mpenkov commented Jan 8, 2020

mpenkov commented Mar 11, 2020

Updated iter_bucket to use concurrent futures. #368

Updated iter_bucket to use concurrent futures. #368

Conversation

derpferd commented Oct 29, 2019

piskvorky commented Oct 29, 2019 • edited Loading

mpenkov commented Oct 29, 2019 • edited Loading

derpferd commented Oct 29, 2019

derpferd commented Oct 30, 2019

piskvorky commented Oct 30, 2019 • edited Loading

mpenkov commented Oct 31, 2019

derpferd commented Oct 31, 2019

mpenkov commented Oct 31, 2019 • edited Loading

piskvorky commented Oct 31, 2019 • edited Loading

mpenkov commented Oct 31, 2019

piskvorky commented Oct 31, 2019 • edited Loading

mpenkov commented Oct 31, 2019

piskvorky commented Oct 31, 2019

mpenkov commented Jan 8, 2020

mpenkov commented Mar 11, 2020

piskvorky commented Oct 29, 2019 •

edited

Loading

mpenkov commented Oct 29, 2019 •

edited

Loading

piskvorky commented Oct 30, 2019 •

edited

Loading

mpenkov commented Oct 31, 2019 •

edited

Loading

piskvorky commented Oct 31, 2019 •

edited

Loading

piskvorky commented Oct 31, 2019 •

edited

Loading