Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 datapipes #165

Closed
wants to merge 153 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
38b691a
Initial code for s3 io datapipes with successful pybind11 build
ydaiming Dec 21, 2021
da92961
change from S3Init to S3Handler
ydaiming Jan 12, 2022
8375759
include pybind11 at compilation automatically
ydaiming Jan 12, 2022
fa258e4
new torchdata._torchdata to avoid circular import
ydaiming Jan 12, 2022
7016deb
new build flag, new extension import, new clean command
ydaiming Jan 12, 2022
2be7b87
structured CMakeLists following torchaudio style
ydaiming Jan 12, 2022
d7e0b38
clean up cpp naming style
ydaiming Jan 12, 2022
f1354a2
separate get and initilize cpp functions
ydaiming Jan 12, 2022
058154c
change default region to the default Region configured in the applica…
ydaiming Jan 12, 2022
baea3b6
clean up get and initialize cpp functions
ydaiming Jan 12, 2022
aff8ed0
remove unused constants
ydaiming Jan 12, 2022
619d86c
create a class-level static configuration file
ydaiming Jan 12, 2022
9f853a5
update indentation of the test
ydaiming Jan 12, 2022
4bae528
clean up linting and wrong modifications
ydaiming Jan 13, 2022
6599ce9
clean up cpp api's, styling, include necessary headers
ydaiming Jan 13, 2022
1c479e2
expose request timeout and region python api's for temporary override
ydaiming Jan 13, 2022
3e17da9
enable file loading while listing files, clean up code
ydaiming Jan 14, 2022
eea6c78
rename s3_io to S3Handler
ydaiming Jan 14, 2022
c4fddf5
clean up std::cerr lines
ydaiming Jan 14, 2022
54e49cc
deal with extreme cases when all results are folders
ydaiming Jan 15, 2022
25ba1d7
change AWS init flow and clean up code
ydaiming Jan 16, 2022
c09be1e
rename S3FS api's
ydaiming Jan 17, 2022
9865675
thorough S3FileLister tests
ydaiming Jan 17, 2022
c72a3fb
expose buffer size and multi part download api's
ydaiming Jan 17, 2022
1d645ac
fix s3 read, add BytesIO to pass file, improve read speed
ydaiming Jan 17, 2022
ba58e68
improve reading efficiency, set default buffer size to 128MB
ydaiming Jan 17, 2022
7d1b5f8
rename multi_part_download to use_multi_part_download
ydaiming Jan 18, 2022
db5600d
Make S3Handler Pickleable
ydaiming Jan 18, 2022
1a915b0
Remove max_keys api in S3FileLister and clean up code
ydaiming Feb 2, 2022
8cb8849
add S3 IO readme
ydaiming Feb 2, 2022
d7414bc
add S3FileLoader and integration tests
ydaiming Feb 2, 2022
2af94ec
prettier README.md
ydaiming Feb 2, 2022
47ea591
add pip3 install cmake ninja to ci
ydaiming Feb 4, 2022
aed2af1
remove scipy requirement and update clean command
ydaiming Feb 4, 2022
4436bd2
add pip3 install pybind11 and BUILD_S3=1 to ci
ydaiming Feb 4, 2022
6d78f99
add step to install aws-sdk-cpp for S3 IO datapipes
ydaiming Feb 4, 2022
739c4ed
Update .github/workflows/ci.yml
ydaiming Feb 5, 2022
9711d89
Update .github/workflows/ci.yml
ydaiming Feb 9, 2022
9e27382
Add python installed packages path to GITHUB_PATH.
josephevans Feb 10, 2022
6367a04
Apply requested changes from lint.
josephevans Feb 3, 2022
bed7090
Combine pip package install lines, remove extra env line.
josephevans Feb 10, 2022
746ca2f
Change minimum cmake version to 3.13.
josephevans Feb 10, 2022
193a0ea
Install curl as dependency for aws sdk for ubuntu builds.
josephevans Feb 14, 2022
372c0e0
Raise ModuleNotFoundError when attempting to use S3 datapipe and not …
josephevans Feb 14, 2022
17749d1
Fix lint.
josephevans Feb 14, 2022
48738d0
Install header dependencies for aws-sdk-cpp.
josephevans Feb 14, 2022
d7982f0
Only add the _torchdata extension if BUILD_S3 is enabled (can be exte…
josephevans Feb 14, 2022
71e1cd0
Set PYTHON_EXECUTABLE so we use the same version as what is running t…
josephevans Feb 14, 2022
f1a27bb
Fix merge conflict.
josephevans Feb 15, 2022
8f16b7f
Add warning (but don't error out) when failing to import c++ extensions.
josephevans Feb 15, 2022
89e6fe8
add sudo for admin rights when make install aws-sdk-cpp
ydaiming Feb 22, 2022
0cd94f6
Catch Exception in incorrect use case
ydaiming Feb 23, 2022
82f0d4e
ignore s3 io tests if _torchdata or aws-sdk-cpp doesn't exist
ydaiming Feb 23, 2022
5ba0b0a
ignore mypy error for torchdata._torchdata
ydaiming Feb 23, 2022
757ace5
find the last python version in PATH
ydaiming Feb 23, 2022
6bfadd5
remove unnecessary sudo in ci
ydaiming Feb 23, 2022
ddd2048
Change PyThon_FIND_FRAMEWORK to CMAKE_FIND_FRAMEWORK
ydaiming Feb 23, 2022
19e30d5
separate Windows installation for AWS-SDK-CPP
ydaiming Feb 23, 2022
65bc893
still need administrative privilege to install aws-sdk-cpp
ydaiming Feb 23, 2022
140c0da
remove redundant parentheses
ydaiming Feb 23, 2022
e2fac1c
edit test for incorrect filenames
ydaiming Feb 23, 2022
98e8e36
update test cases for incorrect user input
ydaiming Feb 23, 2022
0e2d1f9
add a step to setup msbuild in ci
ydaiming Feb 23, 2022
0ce24b9
fix a typo in ci step to install AWS-SDK-CPP on Windows
ydaiming Feb 23, 2022
ab3954f
enable long names for Windows to install AWS-SDK-CPP
ydaiming Feb 23, 2022
f262366
Display runtime Python version after setup-python@v2 in ci
ydaiming Feb 23, 2022
8e3eee4
try ping specific version at build
ydaiming Feb 23, 2022
03d7576
use cmake flags to bypass find_package(Python3)
ydaiming Feb 23, 2022
f2b4a48
add a separate step for reg update in Windows, and find python3 in CMake
ydaiming Feb 23, 2022
66d138d
Fix a typo in Windows registry step
ydaiming Feb 23, 2022
84e5858
turn DPYBIND11_FINDPYTHON
ydaiming Feb 23, 2022
b6d7ae6
ignore unit tests for aws-sdk-cpp
ydaiming Feb 23, 2022
e876562
turn off force new findpython
ydaiming Feb 23, 2022
201d0d1
add -DPython3_ROOT_DIR for cmake
ydaiming Feb 23, 2022
57f9547
pass in environment var for PYTHON_ROOT_DIR
ydaiming Feb 23, 2022
6864609
find exact python version at building
ydaiming Feb 23, 2022
d6ea5ae
rewrite cmake-generator logic for Windows
ydaiming Feb 24, 2022
10b10a1
set visual studio sheel before building for Windows
ydaiming Feb 24, 2022
0da9684
removed unnecessary arch type argument in cmake
ydaiming Feb 24, 2022
d8d3c09
remove redundant import sys
ydaiming Feb 24, 2022
aef5e49
try add aws sdk to a sub cmake folder for cmake reference
ydaiming Feb 24, 2022
7643752
build shared lib for windows, and append cmake_prefix_path
ydaiming Feb 24, 2022
33a3b95
add another flag for windows aws sdk cpp build
ydaiming Feb 24, 2022
d48f3ab
ping 1.8.x branch of aws-sdk-cpp for Windows
ydaiming Feb 24, 2022
e8fd663
update aws-sdk-cpp build in Windows
ydaiming Feb 24, 2022
fc6fb22
point to aws-cpp-sdk location
ydaiming Feb 24, 2022
5a20b37
append CMAKE_INSTALL_PREFIX with windows aws sdk cpp path
ydaiming Feb 24, 2022
ae9a2e1
remove unnecessary CMAKE PATH in CMakeLists
ydaiming Feb 24, 2022
beebaa2
change CMAKE_INSTALL_PREFIX in CMakeLists
ydaiming Feb 24, 2022
ef21b4b
add necessary ; for CMAKE_INSTALL_PREFIX
ydaiming Feb 24, 2022
9713c96
change CMAKE_INSTALL_PREFIX to only sdk_dir in extension
ydaiming Feb 24, 2022
0308bf6
use ninja for aws sdk cpp
ydaiming Feb 24, 2022
aeeea10
move all includes to a precompile.h
ydaiming Feb 24, 2022
a2bf507
add precompile.h
ydaiming Feb 24, 2022
97f9774
remove undef functions in cpp
ydaiming Feb 24, 2022
17617e7
remove msbuild from ci, because using ninja for aws-sdk-cpp in Windows
ydaiming Feb 24, 2022
d667aef
remove msbuild from ci, because using ninja for aws-sdk-cpp in Windows
ydaiming Feb 24, 2022
70554d7
remove unnecessary CMakeList lines
ydaiming Feb 24, 2022
9f6c847
set up msbuild for Windows in CI
ydaiming Feb 24, 2022
c997c66
Fix mypy
ejguan Feb 24, 2022
2ae4f2d
update docstring with functional name and update examples
ydaiming Mar 3, 2022
1374187
fix lint style
ydaiming Mar 3, 2022
ba5b085
fix lint style
ydaiming Mar 3, 2022
94d8e95
update github ci to include build-s3 = 0
ydaiming Mar 3, 2022
c1c445a
remove redundant setup.cfg
ydaiming Mar 23, 2022
d254f83
fix lint and update comment
ydaiming Mar 23, 2022
030f8d5
disable test building for aws-sdk-cpp
ydaiming Mar 23, 2022
a5c3a86
fix lint
ydaiming Mar 23, 2022
3c59f18
remove warning when _torchdata fails to load
ydaiming Mar 24, 2022
12003ea
load _torchdata instead of libtorchdata
ydaiming Mar 24, 2022
98709cb
print serach path and existence of _torchdata
ydaiming Mar 24, 2022
ee972e6
print torch and torchdata paths
ydaiming Mar 25, 2022
97ccbe1
print torchdata path
ydaiming Mar 25, 2022
90062e3
print torchdata path
ydaiming Mar 25, 2022
4233a56
find path for extension _torchdata
ydaiming Mar 25, 2022
b60fc7e
Update _torchdata find path
ydaiming Mar 25, 2022
54a993b
add sheel bash to find _torchdata path
ydaiming Mar 25, 2022
af9d7c1
Update lib find path in _extension.py
ydaiming Mar 25, 2022
543294e
use the method from torchtext to find _torchdata .so
ydaiming Mar 25, 2022
bcec5fc
remove _torchdata and try not to confuse importlib
ydaiming Mar 25, 2022
7f69dda
remove _init_extension() all time
ydaiming Mar 25, 2022
0d494a4
comment out gen_pyi in setup.py
ydaiming Mar 25, 2022
0f0d158
should _init_extension() at all time
ydaiming Mar 25, 2022
bff38d9
add extension loading logic for windows
ydaiming Mar 25, 2022
c137e03
Optimize pybind build
ejguan Mar 28, 2022
f9b789a
Ignore compiled c code
ejguan Mar 28, 2022
5de5f4c
Revamp cmake
ejguan Mar 28, 2022
f374fc8
Remove libtorch
ejguan Mar 28, 2022
b616a2c
Temporary disable a few tests
ejguan Mar 28, 2022
32b42b2
Fix import _torchdata
ejguan Mar 28, 2022
3e4ca67
Fix pybind11 path
ejguan Mar 28, 2022
62b0ea0
Remove binary
ejguan Mar 28, 2022
3931ff3
Revamp cmake again
ejguan Mar 28, 2022
26f0dbc
Remove test for BUILD_S3=0
ejguan Mar 28, 2022
ee1b679
Add cache
ejguan Mar 28, 2022
4c93e18
Create cache
ejguan Mar 28, 2022
6c2a5ea
Start to use cache
ejguan Mar 28, 2022
c1db344
Disable cache
ejguan Mar 28, 2022
ebbe407
Add CMAKE_PREFIX_PATH and fix pybind
ejguan Mar 29, 2022
161b2e4
Fix lint
ejguan Mar 30, 2022
41d8e66
Try to fix windows
ejguan Mar 30, 2022
13d8454
Remove BUILD_PYTHON_VERSION
ejguan Mar 30, 2022
1c70aad
Remove redundant _internal
ejguan Mar 30, 2022
020c6b8
Export aws to path for windows
ejguan Mar 30, 2022
926cd01
Static link on windows
ejguan Mar 31, 2022
d7a8577
skip cache
ejguan Mar 31, 2022
a836303
Remove debugging lines and enable tests
ejguan Mar 31, 2022
27d2370
Fix mypy
ejguan Apr 1, 2022
c04c973
Remove extension for all platforms when clean
ejguan Apr 1, 2022
6d3289b
Fix gen_pyi
ejguan Apr 1, 2022
34eacd6
update S3 datapipes in-line descriptions
ydaiming Apr 1, 2022
43d7231
add comment on static linking on Windows for aws-sdk-cpp
ydaiming Apr 1, 2022
ed7e4ff
Fix lint
ydaiming Apr 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ ignore = E203,E402,E501,F821,W503,W504,
per-file-ignores =
__init__.py: F401, F403, F405
test/*: F401
_extension.py: F401
exclude =
./.git,
./third_party,
Expand Down
56 changes: 53 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,27 +29,77 @@ jobs:
- 3.7
- 3.8
- 3.9
with-s3:
- 1
- 0
steps:
- name: Setup additional system libraries
if: startsWith( matrix.os, 'ubuntu' )
run: |
sudo add-apt-repository multiverse
sudo apt update
sudo apt install rar unrar
sudo apt install rar unrar libssl-dev libcurl4-openssl-dev zlib1g-dev
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Setup msbuild on Windows
if: matrix.with-s3 == 1 && matrix.os == 'windows-latest'
uses: microsoft/[email protected]
- name: Set up Visual Studio shell
if: matrix.with-s3 == 1 && matrix.os == 'windows-latest'
uses: egor-tensin/vs-shell@v2
with:
arch: x64
- name: Check out source repository
uses: actions/checkout@v2
- name: Install dependencies
run: |
pip3 install -r requirements.txt
pip3 install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
pip3 install cmake ninja pybind11
echo "/home/runner/.local/bin" >> $GITHUB_PATH
- name: Export AWS-SDK-CPP & PYBIND11
if: matrix.with-s3 == 1
shell: bash
run: |
if [[ ${{ matrix.os }} == 'windows-latest' ]]; then
AWSSDK_PATH="$GITHUB_WORKSPACE\\aws-sdk-cpp\\sdk-lib"
else
AWSSDK_PATH="$GITHUB_WORKSPACE/aws-sdk-cpp/sdk-lib"
fi
PYBIND11_PATH=`pybind11-config --cmakedir`
echo "::set-output name=awssdk::$AWSSDK_PATH"
echo "::set-output name=pybind11::$PYBIND11_PATH"
id: export_path
- name: Install AWS-SDK-CPP on Windows for S3 IO datapipes
if: matrix.with-s3 == 1 && matrix.os == 'windows-latest'
run: |
git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp
mkdir sdk-lib
cmake -S . -B build -GNinja -DBUILD_ONLY="s3;transfer" -DBUILD_SHARED_LIBS=OFF -DENABLE_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=sdk-lib
cmake --build build --config Release
cmake --install build --config Release
- name: Install AWS-SDK-CPP on Non-Windows for S3 IO datapipes
if: matrix.with-s3 == 1 && matrix.os != 'windows-latest'
run: |
git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp/
mkdir sdk-build sdk-lib
cd sdk-build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer" -DENABLE_TESTING=OFF -DCMAKE_INSTALL_PREFIX=../sdk-lib
make
sudo make install
- name: Build TorchData
run: |
python setup.py develop
env:
BUILD_S3: ${{ matrix.with-s3 }}
pybind11_DIR: ${{ steps.export_path.outputs.pybind11 }}
AWSSDK_DIR: ${{ steps.export_path.outputs.awssdk }}
- name: Install test requirements
run: pip3 install expecttest fsspec iopath==0.1.9 numpy pytest rarfile
- name: Build TorchData
run: python setup.py develop
- name: Run DataPipes tests with pytest
if: ${{ ! contains(github.event.pull_request.labels.*.name, 'ciflow/slow') }}
run:
Expand Down
20 changes: 19 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
build/*
dist/*
torchdata.egg-info/*
*.egg-info/*

torchdata/version.py
torchdata/datapipes/iter/__init__.pyi
Expand All @@ -17,6 +17,24 @@ torchdata/datapipes/iter/__init__.pyi
# macOS dir files
.DS_Store

## General

*/*.so*
*/**/*.so*
torchdata/*.so*

# Compiled Object files
*.slo
*.lo
*.o
*.cuo
*.obj

# Compiled Dynamic libraries
*.so
*.dylib
*.dll

# Compiled python
*.pyc
*.pyd
Expand Down
59 changes: 59 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
cmake_minimum_required(VERSION 3.13 FATAL_ERROR)

# Most of the configurations are taken from PyTorch
# https://github.com/pytorch/pytorch/blob/0c9fb4aff0d60eaadb04e4d5d099fb1e1d5701a9/CMakeLists.txt

# Use compiler ID "AppleClang" instead of "Clang" for XCode.
# Not setting this sometimes makes XCode C compiler gets detected as "Clang",
# even when the C++ one is detected as "AppleClang".
cmake_policy(SET CMP0010 NEW)
cmake_policy(SET CMP0025 NEW)

# Suppress warning flags in default MSVC configuration. It's not
# mandatory that we do this (and we don't if cmake is old), but it's
# nice when it's possible, and it's possible on our Windows configs.
if(NOT CMAKE_VERSION VERSION_LESS 3.15.0)
cmake_policy(SET CMP0092 NEW)
endif()

project(torchdata)

# check and set CMAKE_CXX_STANDARD
string(FIND "${CMAKE_CXX_FLAGS}" "-std=c++" env_cxx_standard)
if(env_cxx_standard GREATER -1)
message(
WARNING "C++ standard version definition detected in environment variable."
"PyTorch requires -std=c++14. Please remove -std=c++ settings in your environment.")
endif()

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_C_STANDARD 11)

# https://developercommunity.visualstudio.com/t/VS-16100-isnt-compatible-with-CUDA-11/1433342
if(MSVC)
if(USE_CUDA)
set(CMAKE_CXX_STANDARD 17)
endif()
endif()


set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)

# Apple specific
if(APPLE)
# Get clang version on macOS
execute_process( COMMAND ${CMAKE_CXX_COMPILER} --version OUTPUT_VARIABLE clang_full_version_string )
string(REGEX REPLACE "Apple LLVM version ([0-9]+\\.[0-9]+).*" "\\1" CLANG_VERSION_STRING ${clang_full_version_string})
message( STATUS "CLANG_VERSION_STRING: " ${CLANG_VERSION_STRING} )

# RPATH stuff
set(CMAKE_MACOSX_RPATH ON)

set(CMAKE_SHARED_LIBRARY_SUFFIX ".so")
endif()

# Options
option(BUILD_S3 "Build s3 io functionality" OFF)

add_subdirectory(torchdata/csrc)
49 changes: 46 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
import distutils.command.clean
import os
import shutil
import subprocess
import sys

from pathlib import Path

from setuptools import find_packages, setup
from torchdata.datapipes.gen_pyi import gen_pyi

from tools import setup_helpers
from tools.gen_pyi import gen_pyi

ROOT_DIR = Path(__file__).parent.resolve()

Expand Down Expand Up @@ -52,12 +56,41 @@ def _export_version(version, sha):
]


class clean(distutils.command.clean.clean):
def run(self):
# Run default behavior first
distutils.command.clean.clean.run(self)

# Remove torchdata extension
def remove_extension(pattern):
for path in (ROOT_DIR / "torchdata").glob(pattern):
print(f"removing extension '{path}'")
path.unlink()

for ext in ["so", "dylib", "pyd"]:
remove_extension("**/*." + ext)

# Remove build directory
build_dirs = [
ROOT_DIR / "build",
]
for path in build_dirs:
if path.exists():
print(f"removing '{path}' (and everything under it)")
shutil.rmtree(str(path), ignore_errors=True)


if __name__ == "__main__":
VERSION, SHA = _get_version()
_export_version(VERSION, SHA)

print("-- Building version " + VERSION)

if sys.argv[1] != "clean":
gen_pyi()
# TODO: Fix #343
os.chdir(ROOT_DIR)

setup(
# Metadata
name="torchdata",
Expand All @@ -82,8 +115,18 @@ def _export_version(version, sha):
"Programming Language :: Python :: Implementation :: CPython",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
package_data={
"torchdata": [
"datapipes/iter/*.pyi",
],
},
# Package Info
packages=find_packages(exclude=["test*", "examples*"]),
packages=find_packages(exclude=["test*", "examples*", "tools*", "torchdata.csrc*", "build*"]),
zip_safe=False,
# C++ Extension Modules
ext_modules=setup_helpers.get_ext_modules(),
cmdclass={
"build_ext": setup_helpers.CMakeBuild,
"clean": clean,
},
)
gen_pyi()
116 changes: 115 additions & 1 deletion test/test_remote_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,19 @@

import expecttest

import torchdata

from _utils._common_utils_for_test import check_hash_fn, create_temp_dir

from torchdata.datapipes.iter import EndOnDiskCacheHolder, FileOpener, HttpReader, IterableWrapper, OnDiskCacheHolder
from torchdata.datapipes.iter import (
EndOnDiskCacheHolder,
FileOpener,
HttpReader,
IterableWrapper,
OnDiskCacheHolder,
S3FileLister,
S3FileLoader,
)


class TestDataPipeRemoteIO(expecttest.TestCase):
Expand Down Expand Up @@ -161,6 +171,110 @@ def _read_and_decode(x):
self.assertTrue(os.path.exists(expected_csv_path))
self.assertEqual(expected_csv_path, csv_path)

def test_s3_io_iterdatapipe(self):
# sanity test
file_urls = ["s3://ai2-public-datasets"]
try:
s3_lister_dp = S3FileLister(IterableWrapper(file_urls))
s3_loader_dp = S3FileLoader(IterableWrapper(file_urls))
except ModuleNotFoundError:
warnings.warn(
"S3 IO datapipes or C++ extension '_torchdata' isn't built in the current 'torchdata' package"
)
return

# S3FileLister: different inputs
input_list = [
[["s3://ai2-public-datasets"], 71], # bucket without '/'
[["s3://ai2-public-datasets/"], 71], # bucket with '/'
[["s3://ai2-public-datasets/charades"], 18], # folder without '/'
[["s3://ai2-public-datasets/charades/"], 18], # folder without '/'
[["s3://ai2-public-datasets/charad"], 18], # prefix
[
[
"s3://ai2-public-datasets/charades/Charades_v1",
"s3://ai2-public-datasets/charades/Charades_vu17",
],
12,
], # prefixes
[["s3://ai2-public-datasets/charades/Charades_v1.zip"], 1], # single file
[
[
"s3://ai2-public-datasets/charades/Charades_v1.zip",
"s3://ai2-public-datasets/charades/Charades_v1_flow.tar",
"s3://ai2-public-datasets/charades/Charades_v1_rgb.tar",
"s3://ai2-public-datasets/charades/Charades_v1_480.zip",
],
4,
], # multiple files
[
[
"s3://ai2-public-datasets/charades/Charades_v1.zip",
"s3://ai2-public-datasets/charades/Charades_v1_flow.tar",
"s3://ai2-public-datasets/charades/Charades_v1_rgb.tar",
"s3://ai2-public-datasets/charades/Charades_v1_480.zip",
"s3://ai2-public-datasets/charades/Charades_vu17",
],
10,
], # files + prefixes
]
for input in input_list:
s3_lister_dp = S3FileLister(IterableWrapper(input[0]), region="us-west-2")
self.assertEqual(sum(1 for _ in s3_lister_dp), input[1], f"{input[0]} failed")

# S3FileLister: prefixes + different region
file_urls = [
"s3://aft-vbi-pds/bin-images/111",
"s3://aft-vbi-pds/bin-images/222",
]
s3_lister_dp = S3FileLister(IterableWrapper(file_urls), region="us-east-1")
self.assertEqual(sum(1 for _ in s3_lister_dp), 2212, f"{input} failed")

# S3FileLister: incorrect inputs
input_list = [
[""],
["ai2-public-datasets"],
["s3://"],
["s3:///bin-images"],
]
for input in input_list:
with self.assertRaises(ValueError, msg=f"{input} should raise ValueError."):
s3_lister_dp = S3FileLister(IterableWrapper(input), region="us-east-1")
for _ in s3_lister_dp:
pass

# S3FileLoader: loader
input = [
"s3://charades-tar-shards/charades-video-0.tar",
"s3://charades-tar-shards/charades-video-1.tar",
] # multiple files
s3_loader_dp = S3FileLoader(input, region="us-west-2")
self.assertEqual(sum(1 for _ in s3_loader_dp), 2, f"{input} failed")

input = [["s3://aft-vbi-pds/bin-images/100730.jpg"], 1]
s3_loader_dp = S3FileLoader(input[0], region="us-east-1")
self.assertEqual(sum(1 for _ in s3_loader_dp), input[1], f"{input[0]} failed")

# S3FileLoader: incorrect inputs
input_list = [
[""],
["ai2-public-datasets"],
["s3://"],
["s3:///bin-images"],
["s3://ai2-public-datasets/bin-image"],
]
for input in input_list:
with self.assertRaises(ValueError, msg=f"{input} should raise ValueError."):
s3_loader_dp = S3FileLoader(input, region="us-east-1")
for _ in s3_loader_dp:
pass

# integration test
input = [["s3://charades-tar-shards/"], 10]
s3_lister_dp = S3FileLister(IterableWrapper(input[0]), region="us-west-2")
s3_loader_dp = S3FileLoader(s3_lister_dp, region="us-west-2")
self.assertEqual(sum(1 for _ in s3_loader_dp), input[1], f"{input[0]} failed")


if __name__ == "__main__":
unittest.main()
Empty file added tools/__init__.py
Empty file.
Loading