Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 datapipes #165

Closed
wants to merge 153 commits into from
Closed

S3 datapipes #165

wants to merge 153 commits into from

Conversation

ydaiming
Copy link
Contributor

@ydaiming ydaiming commented Jan 12, 2022

Changes

  • Added S3FileLister and S3FileLoader IterDataPipes.
  • Added pybind11 build for s3 io cpp files and python scripts.

TODO

  • clean up setup files and link pybind11 in CMAKE_PREFIX automatically.
  • remove aws-cpp-sdk dependency at build with BUILD_S3 env var & pop exceptions when missing dependencies at usage.
  • new api changes for list_files.
  • clean up cpp files (naming, new structure, new logic etc.)
  • expose timeouts, regions.
  • thorough tests
    • different correct usage: bucket (with or without / at last), folder (with or without / at last), prefix, item.
    • different incorrect usage: non-existing files, wrong s3 urls, etc.
    • region changes
    • choice of public datasets
  • benchmarks
    • performance test
  • README.md
    • user guide & recommendations
    • dependencies

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2022
@ydaiming ydaiming marked this pull request as ready for review February 2, 2022 17:30
@josephevans
Copy link

Can we get this into the torch 1.11.0 release?

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add ninja cmake into pip install on our CI here

- name: Install dependencies
run: |
pip3 install -r requirements.txt
pip3 install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
?

@ejguan
Copy link
Contributor

ejguan commented Feb 4, 2022

Can we get this into the torch 1.11.0 release?

TorchData is going to do a release after torch 1.11 (March maybe). If this PR is landed before that, we can squeeze it into TorchData release

@ydaiming
Copy link
Contributor Author

ydaiming commented Feb 4, 2022

Note that, to build torchdata with S3 IO, we'll need pybind11 and "aws-sdk-cpp:s3;transfer" installed. We also need the BUILD_S3 env var set to "ON" to trigger build.

@ejguan
Copy link
Contributor

ejguan commented Feb 4, 2022

Note that, to build torchdata with S3 IO, we'll need pybind11 and "aws-sdk-cpp:s3;transfer" installed. We also need the BUILD_S3 env var set to "ON" to trigger build.

If so, we need to add pip install pybind11 as well to run CI correctly. But, I am not sure how we are gonna install "aws-sdk-cpp:s3;transfer"?

To turn on BUILD_S3, you should add an env: BUILD_S3=1 in

- name: Build TorchData
run: python setup.py develop

@ydaiming
Copy link
Contributor Author

ydaiming commented Feb 4, 2022

how we are gonna install "aws-sdk-cpp:s3;transfer"

I've written a README on how to install dependencies and build and use S3 IO datapipes. torchdata/datapipes/iter/load/README.md

For aws-sdk-cpp:

git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp/
mkdir sdk-build
cd sdk-build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer"
make
make install # may need sudo

How should we install this dependency?

@ejguan
Copy link
Contributor

ejguan commented Feb 4, 2022

git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp/
mkdir sdk-build
cd sdk-build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer"
make
make install # may need sudo

Add a separate step before building torchdata should be fine.

- name: Install test requirements
run: pip3 install expecttest fsspec iopath==0.1.9 numpy pytest rarfile

@ydaiming
Copy link
Contributor Author

ydaiming commented Feb 4, 2022

git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp/
mkdir sdk-build
cd sdk-build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer"
make
make install # may need sudo

Add a separate step before building torchdata should be fine.

- name: Install test requirements
run: pip3 install expecttest fsspec iopath==0.1.9 numpy pytest rarfile

Added another step to install aws-sdk-cpp.

I wonder why install test requirements before building instead of after building and before tests?

@ejguan
Copy link
Contributor

ejguan commented Feb 4, 2022

Added another step to install aws-sdk-cpp.

I wonder why install test requirements before building instead of after building and before tests?

You need to compile dependency before building TorchData, am I right? Except, you want to do a more fancy feature to compile aws and provide S3 IO whenever such DataPipe is used. (I think it's doable but needs to do some research, we should land this PR first then figure it out).

@ydaiming
Copy link
Contributor Author

ydaiming commented Feb 5, 2022

Added another step to install aws-sdk-cpp.
I wonder why install test requirements before building instead of after building and before tests?

You need to compile dependency before building TorchData, am I right?

I meant that I noticed the following step (already existing in ci) is before building torchdata:

      - name: Install test requirements
        run: pip3 install expecttest fsspec iopath==0.1.9 numpy pytest rarfile

Should we move it to after building torchdata and before testing?

@ydaiming
Copy link
Contributor Author

ydaiming commented Feb 7, 2022

CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool.
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage

Looks like the CI still doesn't install ninja before installing torchdata. Any idea? @ejguan?

@VitalyFedyunin
Copy link
Contributor

Without SDK installed any usage of S3 pipes fails with

#17 [13/13] RUN python test.py
#17 sha256:a0ceb2208da477f7ae261e437253fd7fe4763562b03c7bcce76b1d46b57163b5
#17 1.951 /usr/local/lib/python3.8/dist-packages/torchdata-0.3.0a0+2627793-py3.8-linux-x86_64.egg/torchdata/_extension.py:30: UserWarning: torchdata C++ extension is not available.
#17 1.951   warnings.warn("torchdata C++ extension is not available.")
#17 1.952 Traceback (most recent call last):
#17 1.952   File "test.py", line 8, in <module>
#17 1.952     s3_lister_dp = S3FileLister(IterableWrapper(input), region="us-west-2")
#17 1.952   File "/usr/local/lib/python3.8/dist-packages/torchdata-0.3.0a0+2627793-py3.8-linux-x86_64.egg/torchdata/datapipes/iter/load/s3io.py", line 29, in __init__
#17 1.952     self.handler = torchdata._torchdata.S3Handler(request_timeout_ms, region)
#17 1.952 AttributeError: module 'torchdata' has no attribute '_torchdata'
#17 ERROR: executor failed running [/bin/sh -c python test.py]: exit code: 1

@VitalyFedyunin
Copy link
Contributor

Should be nice error instead

@josephevans
Copy link

josephevans commented Feb 10, 2022

Should be nice error instead

Hi @VitalyFedyunin, I'm helping with this PR while @ydaiming is on vacation, would you recommend we raise a ModuleNotFoundError exception, similar to what is done here?

Something like this?

--- a/torchdata/datapipes/iter/load/s3io.py
+++ b/torchdata/datapipes/iter/load/s3io.py
@@ -24,6 +24,11 @@ class S3FileListerIterDataPipe(IterDataPipe[str]):
     """

     def __init__(self, source_datapipe: IterDataPipe[str], length: int = -1, request_timeout_ms=-1, region="") -> None:
+        if not hasattr(torchdata, '_torchdata'):
+            raise ModuleNotFoundError(
+                "Torchdata must be built with BUILD_S3=ON to use this datapipe."
+            )
+
         self.source_datapipe: IterDataPipe[str] = source_datapipe
         self.length: int = length
         self.handler = torchdata._torchdata.S3Handler(request_timeout_ms, region)

@ejguan
Copy link
Contributor

ejguan commented Feb 14, 2022

Yes. And, you may want to check if S3Handler exists in torchdata._torchdata as well.

if not (hasattr(torchdata, '_torchdata') and hasattr(torchdata._torchdata, 'S3Handler')):

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, we would need to check final binary size and what happens if we install python egg on the system without AWS library.

@ydaiming
Copy link
Contributor Author

@ejguan , let's make the mergins process faster and catch PyTorch 1.11.0 release.

The current issues are:

  1. Windows requires a different set of commands to install aws-cpp-sdk (the final step also requires administrative privileges). How should we do that?
  2. The style lint is failing for weird reasons un-related to torchdata, and mypy lint is complaining about the inevitable torchdata._torchdata usage. Shall we set to ignore the mypy lint with some mypy.ini settings?
  3. Building fails for MacOS py37/38, Ubuntu py37, which I can't reproduce. I tried to build in Ubuntu py37 env, and my build and test both succeeded. Could this be a resource issue?
  4. The test failures for the remaining MacOS py39, Ubuntu py38/39 are also weird, because I tested locally and the ValueError should be raised to pass the tests. I changed it to BaseException to see if it catches the error.

Please help us speed up the process, as we all want to include S3 IO datapipes in the coming torchdata release. Thank you!

@ejguan
Copy link
Contributor

ejguan commented Feb 23, 2022

let's make the mergins process faster and catch PyTorch 1.11.0 release.

PyTorch is doing final branch cut tomorrow. We can try to test this PR ASAP but I don't think it's possible.

  • Windows requires a different set of commands to install aws-cpp-sdk (the final step also requires administrative privileges). How should we do that?

What do you mean different set of commands? Can you list the command you want to use?
IIRC, you can specify shell: bash in the workflow to run the command

  • The style lint is failing for weird reasons un-related to torchdata, and mypy lint is complaining about the inevitable torchdata._torchdata usage. Shall we set to ignore the mypy lint with some mypy.ini settings?

You can add torchdata._torchdata into mypy.ini

  • Building fails for MacOS py37/38, Ubuntu py37, which I can't reproduce. I tried to build in Ubuntu py37 env, and my build and test both succeeded. Could this be a resource issue?

Nope. The Error is pretty clear in the traceback, the Python used to compile _torchdata is not correct.
For example of macos-py38. (https://github.com/pytorch/data/runs/5304828107?check_suite_focus=true)

-- Found Python3: /usr/local/Frameworks/Python.framework/Versions/3.9/bin/python3.9 (found version "3.9.10") found components: Interpreter Development Development.Module Development.Embed 
  • The test failures for the remaining MacOS py39, Ubuntu py38/39 are also weird, because I tested locally and the ValueError should be raised to pass the tests. I changed it to BaseException to see if it catches the error.

Can I have a code pointer which Error you are expecting?

@ydaiming
Copy link
Contributor Author

@ejguan Could you please trigger the CI again to check if the mypy error is cleared, and the correct python versions are found at building? Thanks.

@ydaiming
Copy link
Contributor Author

The test failures for the remaining MacOS py39, Ubuntu py38/39 are also weird, because I tested locally and the ValueError should be raised to pass the tests. I changed it to BaseException to see if it catches the error.

Can I have a code pointer which Error you are expecting?

The logic for error is:

  1. passing in the input url (incorrect in this test) to list the files:
    https://github.com/ydaiming/data/blob/s3-datapipes/torchdata/csrc/pybind/S3Handler/pybind.cpp#L22
    https://github.com/ydaiming/data/blob/s3-datapipes/torchdata/csrc/pybind/S3Handler/S3Handler.cpp#L431
  2. when parsing the incorrect input, invalid_argument will be thrown.
    https://github.com/ydaiming/data/blob/s3-datapipes/torchdata/csrc/pybind/S3Handler/S3Handler.cpp#L435
    https://github.com/ydaiming/data/blob/s3-datapipes/torchdata/csrc/pybind/S3Handler/S3Handler.cpp#L139
  3. Then pybind11 will throw ValueError on the PyThon side.
    https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html

The tests passed in my environment, I'm not sure what's the different in the CI system.

@ydaiming
Copy link
Contributor Author

  • Windows requires a different set of commands to install aws-cpp-sdk (the final step also requires administrative privileges). How should we do that?

What do you mean different set of commands? Can you list the command you want to use? IIRC, you can specify shell: bash in the workflow to run the command

For Ubuntu and MacOS, the commands are as following, and already added:

          git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
          cd aws-sdk-cpp/
          mkdir sdk-build
          cd sdk-build
          cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer"
          make
          make install

while on Windows, the following commands are used:

git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
mkdir sdk_build
cd sdk_build
cmake ..\aws-sdk-cpp -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3:transfer"
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\MSBuild.exe" ALL_BUILD.vcxproj  -p:Configuration=Release # depending on where MSBuild is
msbuild INSTALL.vcxproj -p:Configuration=Release # with administrative privilege 

reference: https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/setup-windows.html

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked through this PR again. I don't see the problem. @VitalyFedyunin Do you mind taking an extra look at this PR

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan pushed a commit to ejguan/data that referenced this pull request Apr 4, 2022
Summary:
### Changes
- Added S3FileLister and S3FileLoader IterDataPipes.
- Added pybind11 build for s3 io cpp files and python scripts.
### TODO
- [x] clean up setup files and link pybind11 in CMAKE_PREFIX automatically.
- [x] remove `aws-cpp-sdk` dependency at build with `BUILD_S3` env var & pop exceptions when missing dependencies at usage.
- [x] new api changes for list_files.
- [x] clean up cpp files (naming, new structure, new logic etc.)
- [x] expose timeouts, regions.
- [x] thorough tests
  - [x] different correct usage: bucket (with or without / at last), folder (with or without / at last), prefix, item.
  - [x] different incorrect usage: non-existing files, wrong s3 urls, etc.
  - [x] region changes
  - [x] choice of public datasets
- [x] benchmarks
  - [x] performance test
- [x] README.md
  - [x] user guide & recommendations
  - [x] dependencies

Pull Request resolved: pytorch#165

Reviewed By: NivekT

Differential Revision: D35095053

Pulled By: ejguan

fbshipit-source-id: 36cd3f850cd5a8e79d7d28291ffb2b73e5cea3e0
@ejguan ejguan mentioned this pull request Apr 4, 2022
15 tasks
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants