TorchData 0.4.0 Beta Release
TorchData 0.4.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Performance
- Documentation
- Future Plans
- Beta Usage Note
Highlights
We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.
TorchData 0.4.0 updates are focused on consolidating the DataPipe
APIs and supporting more remote file systems. Highlights include:
- DataPipe graph is now backward compatible with
DataLoader
regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here. AWSSDK
is integrated to support listing/loading files from AWS S3.- Adding support to read from
TFRecord
and Hugging Face Hub. DataLoader2
became available in prototype mode. For more details, please check our future plans.
Backwards Incompatible Change
DataPipe
Updated Multiplexer
(functional API mux
) to stop merging multiple DataPipes
whenever the shortest one is exhausted (pytorch/pytorch#77145)
Please use MultiplexerLongest
(functional API mux_longgest
) to achieve the previous functionality.
0.3.0 | 0.4.0 |
---|---|
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
|
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
|
Enforcing single valid iterator for IterDataPipes
w/wo multiple outputs pytorch/pytorch#70479, (pytorch/pytorch#75995)
If you need to reference the same IterDataPipe
multiple times, please apply .fork()
on the IterDataPipe
instance.
IterDataPipe with a single output | |
---|---|
0.3.0 | 0.4.0 |
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
|
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp) # This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2) # Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
# Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
|
IterDataPipe with multiple outputs | |
---|---|
0.3.0 | 0.4.0 |
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
# Basically share the same reference as `it1`
# doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
# The next line resets all ChildDataPipe
# because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1) # This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
# The next line should not invalidate anything, as there was no new iterator created
# for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
Deprecations
DataPipe
Deprecated functional APIs of open_file_by_fsspec
and open_file_by_iopath
for IterDataPipe
(pytorch/pytorch#78970, pytorch/pytorch#79302)
Please use open_files_by_fsspec
and open_files_by_iopath
0.3.0 | 0.4.0 |
---|---|
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec() # No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath() # No Warning
|
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
|
Argument drop_empty_batches
of Filter
(functional API filter
) is deprecated and going to be removed in the future release (pytorch/pytorch#76060)
0.3.0 | 0.4.0 |
---|---|
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
|
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
|
New Features
DataPipe
- Added utility to visualize
DataPipe
graphs (#330)
IterDataPipe
- Added
Bz2FileLoader
with functional API ofload_from_bz2
(#312) - Added
BatchMapper
(functional API:map_batches
) andFlatMapper
(functional API:flat_map
) (#359) - Added support for WebDataset-style archives (#367)
- Added
MultiplexerLongest
with functional API ofmux_longest
(#372) - Add
ZipperLongest
with functional API ofzip_longest
(#373) - Added
MaxTokenBucketizer
with functional API ofmax_token_bucketize
(#283) - Added
S3FileLister
(functional API:list_files_by_s3
) andS3FileLoader
(functional API:load_files_by_s3
) integrated with the native AWSSDK (#165) - Added
HuggingFaceHubReader
(#490) - Added
TFRecordLoader
with functional API ofload_from_tfrecord
(#308)
MapDataPipe
- Added
UnZipper
with functional API ofunzip
(#325) - Added
MapToIterConverter
with functional API ofto_iter_datapipe
(#327) - Added
InMemoryCacheHolder
with functional API ofin_memory_cache
(#328)
Releng
- Added nightly releases for TorchData. Users should be able to install nightly TorchData via
pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu
conda install -c pytorch-nightly torchdata
- Added support of AWSSDK enabled
DataPipes
. See: README- AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases
Improvements
DataPipe
- Added optional
encoding
argument toFileOpener
(pytorch/pytorch#72715) - Renamed
BucketBatcher
argument to avoid name collision (#304) - Removed default parameter of
ShufflerIterDataPipe
(pytorch/pytorch#74370) - Made profiler wrapper can delegating function calls to
DataPipe
iterator (pytorch/pytorch#75275) - Added
input_col
argument toflatmap
for applyingfn
to the specific column(s) (#363) - Improved debug message when exceptions are raised within
IterDataPipe
(pytorch/pytorch#75618) - Improved debug message when argument is a tuple/list of
DataPipes
(pytorch/pytorch#76134) - Add functional API to
StreamReader
(functional API:open_files
) andFileOpener
(functional API:read_from_stream
) (pytorch/pytorch#76233) - Enabled graph traversal for
MapDataPipe
(pytorch/pytorch#74851) - Added
input_col
argument tofilter
for applyingfilter_fn
to the specific column(s) (pytorch/pytorch#76060) - Added functional APIs for
OnlineReaders
(#369)HTTPReaderIterDataPipe
:read_from_http
GDriveReaderDataPipe
:read_from_gdrive
OnlineReaderIterDataPipe
:read_from_remote
- Cleared buffer for
DataPipe
during__del__
(pytorch/pytorch#76345) - Overrode wrong python https proxy on Windows (#371)
- Exposed functional API of 'to_map_datapipe' from
IterDataPipe
's pyi interface (#326) - Moved buffer for
IterDataPipe
from iterator to instance (self) (#388) - Improved
DataPipe
serialization:- Enabled serialization of
ForkerIterDataPipe
(pytorch/pytorch#73118) - Fixed issue with
DataPipe
serialization with dill (pytorch/pytorch#72896) - Applied special serialization when dill is installed (pytorch/pytorch#74958)
- Applied dill serialization for
demux
and added cache to graph traverse (pytorch/pytorch#75034) - Revamp serialization logic of
DataPipes
(pytorch/pytorch#74984) - Prevented automatic reset after state is restored (pytorch/pytorch#77774)
- Enabled serialization of
- Moved
IterDataPipe
buffers from iter to instance (self) (#76999) - Refactored buffer of
Multiplexer
from__iter__
to instance (self) (pytorch/pytorch#77775) - Made
GDriveReader
handling Virus Scan Warning (#442) - Added
**kwargs
arguments toHttpReader
to specify extra parameters for HTTP requests (#392) - Updated
FSSpecFileLister
andIoPathFileLister
to support multiple root paths and updatedFSSpecFileLister
to support S3 urls (#383) - Fixed racing condition issue with writing files in multiprocessing
- Added a 's' to the functional names of open/list
DataPipes
(#479) - Added
list_file
functional API toFSSpecFileLister
andIoPathFileLister
(#463) - Added
list_files
functional API toFileLister
(pytorch/pytorch#78419) - Improved FSSpec
DataPipes
to accept extra keyword arguments (#495) - Pass through
kwargs
tojson.loads
call in JsonParse (#518)
DataLoader
- Added ability to use
dill
to passDataPipes
in multiprocessing (pytorch/pytorch#77288)) DataLoader
automatically apply sharding toDataPipe
graph in single-process, multi-process and distributed environments (pytorch/pytorch#78762, pytorch/pytorch#78950, pytorch/pytorch#79041, pytorch/pytorch#79124, pytorch/pytorch#79524)- Made
ShufflerDataPipe
deterministic withDataLoader
in single-process, multi-process and distributed environments (pytorch/pytorch#77741, pytorch/pytorch#77855, pytorch/pytorch#78765, pytorch/pytorch#79829) - Prevented overriding shuffle settings in
DataLoader
forDataPipe
(pytorch/pytorch#75505)
Releng
- Made
requirements.txt
as the single source of truth for TorchData version (#414) - Prohibited Release GHA workflows running on forked branches. (#361)
Performance
DataPipe
- Lazily generated exception message for performance (pytorch/pytorch#78673)
- Fixes regression introduced from single iterator constraint related PRs.
- Disabled profiler for
IterDataPipe
by default (pytorch/pytorch#78674)- By skipping over the record function when the profiler is not enabled, the speedup is up to 5-6x for
DataPipes
when their internal operations are very simple (e.g.IterableWrapper
)
- By skipping over the record function when the profiler is not enabled, the speedup is up to 5-6x for
Documentation
DataPipe
- Fixed typo in TorchVision example (#311)
- Updated
DataPipe
naming guidelines (#428) - Updated documents from
DataSet
to PyTorchDataset
(#292) - Added examples for graphs, meshes and point clouds using
DataPipe
(#337) - Added examples for semantic segmentation and time series using
DataPipe
(#340) - Expanded the contribution guide, especially including instructions to add a new
DataPipe
(#354) - Updated tutorial about placing
sharding_filter
(#487) - Improved graph visualization documentation (#504)
- Added instructions about ImportError for portalocker (#506)
- Updated examples to avoid lambdas (#524)
- Updated documentation for S3 DataPipes (#534)
- Updated links for tutorial (#543)
IterDataPipe
- Fixed documentation for
IterToMapConverter
,S3FileLister
andS3FileLoader
(#381) - Update documentation for S3 DataPipes (#534)
MapDataPipe
- Updated contributing guide and added guidance for
MapDataPipe
(#379)- Rather than re-implementing the same functionalities twice for both
IterDataPipe
andMapDataPipe
, we encourage users to use the built-in functionalities ofIterDataPipe
and use the converter toMapDataPipe
as needed.
- Rather than re-implementing the same functionalities twice for both
DataLoader/DataLoader2
- Fixed tutorial about
DataPipe
working withDataLoader
(#458) - Updated examples and tutorial after automatic sharding has landed (#505)
- Add README for DataLoader2 (#526, #541)
Releng
- Added nightly documentation for TorchData in https://pytorch.org/data/main/
- Fixed instruction to install TorchData (#455)
Future Plans
For DataLoader2
, we are introducing new ways to interact between DataPipes
, DataLoading API, and backends (aka ReadingServices
). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.