Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update streaming tutorial #1526

Merged
merged 17 commits into from
Aug 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 108 additions & 88 deletions docs/gallery/advanced_io/streaming.py
Original file line number Diff line number Diff line change
@@ -1,98 +1,118 @@
'''
.. _streaming:

Streaming from an S3 Bucket
===========================
Streaming NWB files
===================

It is possible to read data directly from an S3 bucket, such as data from the `DANDI Archive
<https://dandiarchive.org/>`_. This is especially useful for reading small pieces of data
from a large NWB file stored remotely. In fact, there are two different ways to do this supported by PyNWB.
You can read specific sections within individual data files directly from remote stores such as the
`DANDI Archive <https://dandiarchive.org/>`_. This is especially useful for reading small pieces of data
from a large NWB file stored
remotely. First, you will need to get the location of the file. The code below illustrates how to do this on DANDI
using the dandi API library.

Method 1: ROS3
~~~~~~~~~~~~~~
ROS3 stands for "read only S3" and is a driver created by the HDF5 group that allows HDF5 to read HDF5 files
stored on s3. Using this method requires that your HDF5 library is installed with the ROS3 driver enabled. This
is not the default configuration, so you will need to make sure you install the right version of h5py that has this
advanced configuration enabled. You can install HDF5 with the ROS3 driver from `conda-forge
<https://conda-forge.org/>`_ using ``conda``. You may first need to uninstall a currently installed version of h5py.
Getting the location of the file on DANDI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

'''
The :py:class:`~dandi.dandiapi.DandiAPIClient` can be used to get the S3 URL of any NWB file stored in the DANDI
Archive. If you have not already, install the latest release of the ``dandi`` package.


.. code-block:: bash

pip install dandi

Now you can get the url of a particular NWB file using the dandiset ID and the path of that file within the dandiset.

.. code-block:: python

from dandi.dandiapi import DandiAPIClient

dandiset_id = '000006' # ephys dataset from the Svoboda Lab
filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb' # 450 kB file
with DandiAPIClient() as client:
asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath)
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)


Streaming Method 1: ROS3
~~~~~~~~~~~~~~~~~~~~~~~~
ROS3 is one of the supported methods for reading data from a remote store. ROS3 stands for "read only S3" and is a
driver created by the HDF5 Group that allows HDF5 to read HDF5 files stored remotely in s3 buckets. Using this method
bendichter marked this conversation as resolved.
Show resolved Hide resolved
requires that your HDF5 library is installed with the ROS3 driver enabled. This is not the default configuration,
so you will need to make sure you install the right version of ``h5py`` that has this advanced configuration enabled.
You can install HDF5 with the ROS3 driver from `conda-forge <https://conda-forge.org/>`_ using ``conda``. You may
first need to uninstall a currently installed version of ``h5py``.

.. code-block:: bash

pip uninstall h5py
conda install -c conda-forge "h5py>=3.2"

Now instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This
will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily,
just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to
download the sliced data (and only the sliced data) to memory.

.. code-block:: python

from pynwb import NWBHDF5IO

####################
# .. code-block:: bash
#
# pip uninstall h5py
# conda install -c conda-forge "h5py>=3.2"
#

####################
# The ``DandiAPIClient`` can be used to get the S3 URL to an NWB file of interest stored in the DANDI Archive.
# If you have not already, install the latest release of the ``dandi`` package.
#
# .. code-block:: bash
#
# pip install dandi
#
# .. code-block:: python
#
# from dandi.dandiapi import DandiAPIClient
#
# dandiset_id = '000006' # ephys dataset from the Svoboda Lab
# filepath = 'sub-anm372795/sub-anm372795_ses-20170718.nwb' # 450 kB file
# with DandiAPIClient() as client:
# asset = client.get_dandiset(dandiset_id, 'draft').get_asset_by_path(filepath)
# s3_path = asset.get_content_url(follow_redirects=1, strip_query=True)

####################
# Finally, instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This
# will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily,
# just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to
# download the sliced data (and only the sliced data) to memory.
#
# .. code-block:: python
#
# from pynwb import NWBHDF5IO
#
# with NWBHDF5IO(s3_path, mode='r', load_namespaces=True, driver='ros3') as io:
# nwbfile = io.read()
# print(nwbfile)
# print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])

####################
# Method 2: s3fs
# ~~~~~~~~~~~~~~
# s3fs is a library that creates a virtual filesystem for an S3 store. With this approach, a virtual file is created
# for the file and virtual filesystem layer will take care of requesting data from the s3 bucket whenever data is
# read from the virtual file.
#
# First install s3fs:
#
# .. code-block:: bash
#
# pip install s3fs
#
# Then in Python:
#
# .. code-block:: python
#
# import s3fs
# import pynwb
# import h5py
#
# fs = s3fs.S3FileSystem(anon=True)
#
# f = fs.open("s3://dandiarchive/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35", 'rb')
# file = h5py.File(f)
# io = pynwb.NWBHDF5IO(file=file, load_namespaces=True)
#
# io.read()
Comment on lines -76 to -88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the s3fs builds on fsspec and since the s3fs code is simpler than using the full fsspec I think it would be nice to add a subsection to the Streaming Method 2: fsspec section that shows this code example for s3fs as well. In this way you can still keep the tutorial to the two main options but still mention thes3fs option, which may be sufficient and simpler for many users.

Copy link
Contributor Author

@bendichter bendichter Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So s3fs is a specific implementation of the fsspec protocol. I changed over to using the more generic fsspec library with http for three reasons:

  1. This way we can use the http path as input, which is the same path that the ros3 driver takes, so it is clear how the two methods relate to each other.
  2. Showing it this way makes it pretty clear how you would extend this to other backends.
  3. This method also shows how to use caching, which I think is an important feature that a lot of users will want to leverage.

Do you really think this workflow is more complex than the s3fs syntax?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing the fsspec example as the main example makes sense. If you think that showing the full s3fs example in addition is too confusing, then maybe we can just add a note along the lines of "The s3fs package also builds on fsspec and can be used to access S3 in a similar manner. s3fs provides additional convenience methods for interacting with S3 more specifically while fsspec provides greater flexibility with regard to supported remote file systems. Note, when using s3fs requires the use of s3 URIs instead of http. "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best would be if someone checks for what s3fs might be providing "on top" or in addition to pure straight fsspec. E.g. it could be some better choices of blocksize etc. I would have at least compared performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true that s3fs provides additional convenience methods for interacting with S3 more specifically?

s3fs implements in S3FileSystem which inherits from fsspec's AsyncFileSystem abstract class. My impression is that it just implements the method of a fsspec.FileSystem without much else besides maybe credential controls that we don't use. You can switch to an s3 backend by simply changing fs=fsspec.filesystem("http") to fs=fsspec.filesystem("s3"), but then you'll have to supply the s3 path, not the http path.

Copy link
Contributor

@oruebel oruebel Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell from the documentation what the specific differences are, but my impression is that s3fs adds support for async, improved logging, and S3 specific implementation of the file system functions. Purely to read an NWB file, this may not be big difference. Pending further investigation, maybe what we could do is to add a note to indicate that there are a growing number of libraries build on ffspec that provide functionality specific for different file systems (e.g., S3, Google Cloud, or Azure) and link to https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations. We could then post alternate examples, such as the code using S3FS, as GitHub Gists online and link to them from the same note. This would keep the docs clean and at the same time provide examples for folks that may want to use the more specific libraries build on fsspec. Given that we don't know how all the various options perform, we could also add a "disclaimer" along those lines.

#
# The above snippet opens an arbitrary file on DANDI. You can use the ``DandiAPIClient`` to find the s3 path,
# but you will need to adjust this url to give it a prefix of "s3://dandiarchive/" as shown above.
#
# The s3fs approach has the advantage of being more robust that ROS3. Sometimes s3 requests are interrupted,
# and s3fs has internal mechanisms to retry these requests automatically, whereas ROS3 does not. However, it may not
# be available on all platforms. s3fs does not currently work for Windows.
with NWBHDF5IO(s3_url, mode='r', load_namespaces=True, driver='ros3') as io:
nwbfile = io.read()
print(nwbfile)
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])

Streaming Method 2: fsspec
~~~~~~~~~~~~~~~~~~~~~~~~~~~
fsspec is another data streaming approach that is quite flexible and has several performance advantages. This library
creates a virtual filesystem for remote stores. With this approach, a virtual file is created for the file and
the virtual filesystem layer takes care of requesting data from the S3 bucket whenever data is
read from the virtual file. Note that this implementation is completely unaware of internals of the HDF5 format
and thus can work for **any** file, not only for the purpose of use with H5PY and PyNWB.

First install ``fsspec`` and the dependencies of the :py:class:`~fsspec.implementations.http.HTTPFileSystem`:

.. code-block:: bash

pip install fsspec requests aiohttp

Then in Python:

.. code-block:: python

import fsspec
import pynwb
import h5py
from fsspec.implementations.cached import CachingFileSystem

# first, create a virtual filesystem based on the http protocol and use
# caching to save accessed data to RAM.
fs = CachingFileSystem(
fs=fsspec.filesystem("http"),
cache_storage="nwb-cache", # Local folder for the cache
)

# next, open the file
with fs.open(s3_url, "rb") as f:
with h5py.File(f) as file:
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
nwbfile = io.read()
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])


fsspec is a library that can be used to access a variety of different store formats, including (at the time of
writing):

.. code-block:: python

from fsspec.registry import known_implementations
known_implementations.keys()

file, memory, dropbox, http, https, zip, tar, gcs, gs, gdrive, sftp, ssh, ftp, hdfs, arrow_hdfs, webhdfs, s3, s3a, wandb, oci, adl, abfs, az, cached, blockcache, filecache, simplecache, dask, dbfs, github, git, smb, jupyter, jlab, libarchive, reference

The S3 backend, in particular, may provide additional functionality for accessing data on DANDI. See the
`fsspec documentation on known implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations>`_
for a full updated list of supported store formats.
'''
bendichter marked this conversation as resolved.
Show resolved Hide resolved

# sphinx_gallery_thumbnail_path = 'figures/gallery_thumbnails_streaming.png'
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ def __call__(self, filename):
'hdmf': ('https://hdmf.readthedocs.io/en/latest/', None),
'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
'dandi': ('https://dandi.readthedocs.io/en/stable/', None),
'fsspec': ("https://filesystem-spec.readthedocs.io/en/latest/", None),
}

extlinks = {'incf_lesson': ('https://training.incf.org/lesson/%s', ''),
Expand Down