MongoDB and Redis stores #372

jhamman · 2018-12-27T22:29:25Z

This adds two new stores to zarr, MongoDB and Redis. This is basically the same implementation that @nbren12 proposed in this gist. The PR is far from complete but I wanted to put it up for others to see, should there be interest.

This is also building on #368 so that should be merged first.

Finally, I should say that I've been meaning to learn a bit about how zarr stores are implemented so I'm using this as a learning experience.

closes #299

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

Implements a key-value store using SQLite. As this is a builtin module in Python and a common database to use in various languages, this should have high utility and be very portable. Not to mention many databases provide an SQLite language on top regardless of the internal representation. So this can be a great template for users wishing to work with Zarr in their preferred database.

Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays.

Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples.

Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`.

Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts.

Instead of opening, committing, and closing the SQLite database for every operation, limit these to user requested operations. Namely commit only when the user calls `flush`. Also close only when the user calls `close`. This should make operations with SQLite much more performant than when we automatically committed and closed after every user operation.

As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs.

Appears some of these commands work without capitalization. However as the docs show commands as capitalized, ensure that we are doing the same thing as well. That way this won't run into issues with different SQL implementations or older versions of SQLite that are less forgiving. Plus this should match closer to what users familiar with SQL expect.

Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on.

This was needed when the `import` of `sqlite3` was only here to ensure that it existed (even though it wasn't used). Now we make use of `sqlite3` where it is being imported. So there is no need to tell flake8 to not worry about the unused import as there isn't one.

Make sure that everything intended to be added to the `SQLiteStore` database has been written to disk before attempting to pickle it. That way we can be sure other processes constructing their own `SQLiteStore` have access to the same data and not some earlier representation.

No need to normalize the path when there isn't one (e.g. `:memory:`).

Fix a typo. Co-Authored-By: jakirkham <[email protected]>

Include author and original issue in changelog entry. Co-Authored-By: jakirkham <[email protected]>

The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`.

As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database.

As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database.

As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing.

While there is a default implementation of `update` for `MutableMapping`s, it means that we perform multiple `__setitem__` operations. However it would be better if we could commit all key-value pairs in one operation and commit them. Hence we implement `update` for this purpose.

Simplifies `__setitem__` to an `update` operation with a dictionary that contains only one item. This works the same as before, but cuts out some redundancy, which simplifies the code a bit.

As we now make sure to commit after every mutating change to the database, disable `check_same_thread` again as it should be safe.

As some of these parameters no longer make sense to be user customizable, go ahead and just set their values as part of the `sqlite3.connect` call. This ensures that they are set the way we expect. Also it ensures that if users try to mess with them, an error will be raised due to duplicate keyword arguments. To elaborate on why these parameters are not user configurable any more, `detect_types` only makes sense if one is building their own table with specific types. Instead we build the table for users and have very generic types (i.e. text and blobs), which are not worth checking. As we commit after every modification to the database to make it more friendly for other threads and processes, the `isolation_level` might as well be to auto-commit. Setting it to anything else really has no effect. Finally there is no need for `check_same_thread` to be anything other than `False` as we are guaranteeing everything is committed after mutation, which ensures the database is thread and process safe.

As we already enable auto-committing, any mutation is automatically written after performed. So there is no need for us to commit afterwards. Besides `commit` is turned into a no-op if auto-committing is enabled.

As we auto-commit all changes, there is no need for a `flush` operation for the `SQLiteStore`. So go ahead and drop the `flush` function and its documentation.

As the default implementation of `clear` deletes each key-value pair, this will be considerably slower than an operation that can remove all of them at once. Here we do exactly that by using SQL's `DROP TABLE`. Unfortunately there is not a truncate table command, but doing a drop followed by a create has the same effect. We combine these two operations using `executescript`. Thus auto-commit won't run until after both have run, which will commit the table with all key-value pairs removed.

docs/release.rst

…_store

jhamman · 2019-01-28T22:08:13Z

@jakirkham and @alimanfoo - I think this is basically done. I don't know why the coveralls report is failing. My reading of the coveralls report show that we have 100% coverage on the new bits in this PR but please let me know if I'm missing something.

alimanfoo · 2019-02-05T17:00:16Z

@jhamman I think coveralls is sometimes broken, I have seen that before where coverage is 100% on the new/modified files but somehow coveralls reports total coverage down. One way to check is to make a trivial commit and see if the problem persists. I'll also do some digging.

alimanfoo

This is looking good to me. A couple of very small queries, but up to you if you want to do anything about them.

Only other question is whether you want to mark these two new classes as "experimental features" or not. This is just a convention we've been using to highlight features that are new, and so might require a bit of further work before the API can be considered fully stable. I'm OK either way.

If you do want to mark as experimental then add this after the first paragraph of the docstring:

.. note:: This is an experimental feature.

zarr/storage.py

alimanfoo · 2019-02-05T17:08:29Z

zarr/tests/test_storage.py

+
+    def create_store(self):
+        # TODO: this is the default host for MongoDB on Travis,
+        # we probably want to generalize this though


Is this TODO something you want to address before merge? Would we ever need to do something different?

Right. I think the more important question is how do you want to handle these tests when a mongodb or redis server is not available. How should we decide when to skip these tests?

alimanfoo · 2019-02-05T17:08:52Z

zarr/tests/test_storage.py

+class TestRedisStore(StoreTests, unittest.TestCase):
+
+    def create_store(self):
+        # TODO: this is the default host for Redis on Travis,


As above, is this TODO something that needs to be resolved before merge, or can we live with as-is?

requirements_dev_optional.txt

alimanfoo · 2019-02-05T17:20:51Z

I think the coveralls report is broken, all files have 100% coverage.

…_store

alimanfoo

Looks good. I think the only outstanding question here is the one you raised, which is how to deal with skipping tests if redis or mongo servers are not running locally.

docs/release.rst

zarr/tests/test_storage.py

Co-Authored-By: jhamman <[email protected]>

…store

alimanfoo · 2019-02-08T08:38:10Z

Travis fails because of flake8:

py37 runtests: commands[4] | flake8 zarr
zarr/tests/test_storage.py:908:37: E128 continuation line under-indented for visual indent

(Aside: I'm going to enable the pep8speaks bot so we don't have to go digging through the travis logs.)

alimanfoo · 2019-02-08T08:40:22Z

I've fixed the pep8 problem, I think this is good to go, will merge if CI passes.

alimanfoo · 2019-02-08T14:56:09Z

OK, this is going in, thanks all.

jakirkham and others added 30 commits December 21, 2018 00:50

Test SQLiteStore

e3e2c2e

Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays.

Export SQLiteStore to the top-level namespace

d60aaab

Include some SQLiteStore examples

ace251c

Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples.

Demonstrate the SQLiteStore in the tutorial

ecf18f7

Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`.

Provide API documentation for SQLiteStore

92a4d71

Make a release note for SQLiteStore

efa9ccd

Use unique extension for SQLiteStore files

6f68451

Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts.

Update docs to show how to close SQLiteStore

20ef384

As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs.

Simplify SQLiteStore's __delitem__ using in

5ddc193

Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on.

Simplify close and use flush

b339c09

Special case in-memory SQLite database

1cac5eb

No need to normalize the path when there isn't one (e.g. `:memory:`).

Drop unneeded empty return statement

b8e2d23

Update docs/release.rst

4db7e14

Fix a typo. Co-Authored-By: jakirkham <[email protected]>

Update docs/release.rst

31a9af3

Include author and original issue in changelog entry. Co-Authored-By: jakirkham <[email protected]>

Correct default value for check_same_thread

9f5d02b

The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`.

Flush after making any mutation to the database

ac6827e

As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database.

Skip flushing data when pickling SQLiteStore

8b35eb8

As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database.

Skip using flush in close

f8d3f03

As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing.

Rewrite __setitem__ to use update

4bbbeba

Simplifies `__setitem__` to an `update` operation with a dictionary that contains only one item. This works the same as before, but cuts out some redundancy, which simplifies the code a bit.

Disable check_same_thread by default again

f9481b8

As we now make sure to commit after every mutating change to the database, disable `check_same_thread` again as it should be safe.

Drop flush calls from SQLiteStore

1af4446

As we already enable auto-committing, any mutation is automatically written after performed. So there is no need for us to commit afterwards. Besides `commit` is turned into a no-op if auto-committing is enabled.

Drop the flush function from SQLiteStore

ca6b8a4

As we auto-commit all changes, there is no need for a `flush` operation for the `SQLiteStore`. So go ahead and drop the `flush` function and its documentation.

jakirkham reviewed Jan 12, 2019

View reviewed changes

docs/release.rst Show resolved Hide resolved

Joseph Hamman added 4 commits January 13, 2019 14:28

split release note into two bullets

add3a0d

Merge branch 'master' of github.com:zarr-developers/zarr into mongodb…

9dce652

…_store

whitespace fix in .travis.yml

ef5bc7d

lint after merge

dcd79e5

alimanfoo approved these changes Feb 5, 2019

View reviewed changes

alimanfoo reviewed Feb 5, 2019

View reviewed changes

requirements_dev_optional.txt Outdated Show resolved Hide resolved

Joseph Hamman added 4 commits February 6, 2019 23:01

pin mongo/redis versions and a few doc changes

9be1557

Merge branch 'master' of github.com:zarr-developers/zarr into mongodb…

20dd25a

…_store

use redis client.delete and check for deleted keys

e2988be

fix typo in requirements

44e8850

alimanfoo reviewed Feb 7, 2019

View reviewed changes

docs/release.rst Outdated Show resolved Hide resolved

docs/release.rst Outdated Show resolved Hide resolved

zarr/tests/test_storage.py Show resolved Hide resolved

zarr/tests/test_storage.py Show resolved Hide resolved

alimanfoo and others added 4 commits February 7, 2019 14:58

Update docs/release.rst

7ff0f1a

Co-Authored-By: jhamman <[email protected]>

Update docs/release.rst

06107ce

Co-Authored-By: jhamman <[email protected]>

skip redis/mongodb tests when unable to connect

3821e16

Merge branch 'mongodb_store' of github.com:jhamman/zarr into mongodb_…

881b051

…store

fix pep8

5de3e05

alimanfoo merged commit b76b282 into zarr-developers:master Feb 8, 2019

alimanfoo added this to the v2.3 milestone Feb 8, 2019

jakirkham mentioned this pull request Feb 12, 2019

Use binary_type in MongoDBStore.__getitem__ for Python 2 only #401

Merged

7 tasks

jakirkham mentioned this pull request Feb 21, 2019

Ensure subok zarr-developers/numcodecs#173

Closed

8 tasks

jakirkham mentioned this pull request Mar 5, 2019

Distributed cache strategy dask/cachey#5

Open

jakirkham mentioned this pull request Mar 19, 2019

Explain intermediate storage dask/dask#4025

Merged

jakirkham mentioned this pull request Apr 15, 2019

Coerce data to text for JSON parsing #429

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB and Redis stores #372

MongoDB and Redis stores #372

jhamman commented Dec 27, 2018 •

edited

Loading

jhamman commented Jan 28, 2019

alimanfoo commented Feb 5, 2019

alimanfoo left a comment

alimanfoo Feb 5, 2019

jhamman Feb 7, 2019

alimanfoo Feb 5, 2019

alimanfoo commented Feb 5, 2019

alimanfoo left a comment

alimanfoo commented Feb 8, 2019

alimanfoo commented Feb 8, 2019

alimanfoo commented Feb 8, 2019

MongoDB and Redis stores #372

MongoDB and Redis stores #372

Conversation

jhamman commented Dec 27, 2018 • edited Loading

jhamman commented Jan 28, 2019

alimanfoo commented Feb 5, 2019

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo Feb 5, 2019

Choose a reason for hiding this comment

jhamman Feb 7, 2019

Choose a reason for hiding this comment

alimanfoo Feb 5, 2019

Choose a reason for hiding this comment

alimanfoo commented Feb 5, 2019

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo commented Feb 8, 2019

alimanfoo commented Feb 8, 2019

alimanfoo commented Feb 8, 2019

jhamman commented Dec 27, 2018 •

edited

Loading