Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoDB and Redis stores #372

Merged
merged 64 commits into from
Feb 8, 2019
Merged

MongoDB and Redis stores #372

merged 64 commits into from
Feb 8, 2019

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 27, 2018

This adds two new stores to zarr, MongoDB and Redis. This is basically the same implementation that @nbren12 proposed in this gist. The PR is far from complete but I wanted to put it up for others to see, should there be interest.

This is also building on #368 so that should be merged first.

Finally, I should say that I've been meaning to learn a bit about how zarr stores are implemented so I'm using this as a learning experience.

closes #299

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

jakirkham and others added 30 commits December 21, 2018 00:50
Implements a key-value store using SQLite. As this is a builtin module
in Python and a common database to use in various languages, this should
have high utility and be very portable. Not to mention many databases
provide an SQLite language on top regardless of the internal
representation. So this can be a great template for users wishing to
work with Zarr in their preferred database.
Try using the `SQLiteStore` everywhere one would use another store and
make sure that it behaves correctly. This includes simple key-value
store usage, creating hierarchies, and storing arrays.
Provide a few examples of how one might use `SQLiteStore` to store
arrays or groups. These examples are taken with minor modifications from
the `LMDBStore` examples.
Includes a simple example borrowed from `LMDBStore`'s tutorial example,
which shows how to create and use an `SQLiteStore`.
Otherwise we may end up opening a different databases' files and try to
use them with SQLite only to run into errors. This caused the doctests
to fail previously. Changing the extension as we have done should avoid
these conflicts.
Instead of opening, committing, and closing the SQLite database for
every operation, limit these to user requested operations. Namely commit
only when the user calls `flush`. Also close only when the user calls
`close`. This should make operations with SQLite much more performant
than when we automatically committed and closed after every user
operation.
As users need to explicitly close the `SQLiteStore` to commit changes
and serialize them to the SQLite database, make sure to point this out
in the docs.
Appears some of these commands work without capitalization. However as
the docs show commands as capitalized, ensure that we are doing the same
thing as well. That way this won't run into issues with different SQL
implementations or older versions of SQLite that are less forgiving.
Plus this should match closer to what users familiar with SQL expect.
Make use of `in` instead of repeating the same logic in `__delitem__`.
As we now keep the database open between operations, this is much
simpler than duplicating the key check logic. Also makes it a bit easier
to understand what is going on.
This was needed when the `import` of `sqlite3` was only here to ensure
that it existed (even though it wasn't used). Now we make use of
`sqlite3` where it is being imported. So there is no need to tell flake8
to not worry about the unused import as there isn't one.
Make sure that everything intended to be added to the `SQLiteStore`
database has been written to disk before attempting to pickle it. That
way we can be sure other processes constructing their own `SQLiteStore`
have access to the same data and not some earlier representation.
No need to normalize the path when there isn't one (e.g. `:memory:`).
Fix a typo.

Co-Authored-By: jakirkham <[email protected]>
Include author and original issue in changelog entry.

Co-Authored-By: jakirkham <[email protected]>
The default value for `check_same_thread` was previously set to `False`
when in reality we want this check enabled. So set `check_same_thread`
to `True`.
As users could change the setting of things like `check_same_thread` or
they may try to access the same database from multiple threads or
processes, make sure to flush any changes that would mutate the
database.
As we now always commit after an operation that mutates the data, there
is no need to commit before pickling the `SQLiteStore` object. After all
the data should already be up-to-date in the database.
As everything should already be flushed to the database whenever the
state is mutated, there is no need to perform this before closing.
While there is a default implementation of `update` for
`MutableMapping`s, it means that we perform multiple `__setitem__`
operations. However it would be better if we could commit all key-value
pairs in one operation and commit them. Hence we implement `update` for
this purpose.
Simplifies `__setitem__` to an `update` operation with a dictionary that
contains only one item. This works the same as before, but cuts out some
redundancy, which simplifies the code a bit.
As we now make sure to commit after every mutating change to the
database, disable `check_same_thread` again as it should be safe.
As some of these parameters no longer make sense to be user
customizable, go ahead and just set their values as part of the
`sqlite3.connect` call. This ensures that they are set the way we
expect. Also it ensures that if users try to mess with them, an error
will be raised due to duplicate keyword arguments.

To elaborate on why these parameters are not user configurable any more,
`detect_types` only makes sense if one is building their own table with
specific types. Instead we build the table for users and have very
generic types (i.e. text and blobs), which are not worth checking. As we
commit after every modification to the database to make it more friendly
for other threads and processes, the `isolation_level` might as well be
to auto-commit. Setting it to anything else really has no effect.
Finally there is no need for `check_same_thread` to be anything other
than `False` as we are guaranteeing everything is committed after
mutation, which ensures the database is thread and process safe.
As we already enable auto-committing, any mutation is automatically
written after performed. So there is no need for us to commit
afterwards. Besides `commit` is turned into a no-op if auto-committing
is enabled.
As we auto-commit all changes, there is no need for a `flush` operation
for the `SQLiteStore`. So go ahead and drop the `flush` function and its
documentation.
As the default implementation of `clear` deletes each key-value pair,
this will be considerably slower than an operation that can remove all
of them at once. Here we do exactly that by using SQL's `DROP TABLE`.
Unfortunately there is not a truncate table command, but doing a drop
followed by a create has the same effect. We combine these two
operations using `executescript`. Thus auto-commit won't run until after
both have run, which will commit the table with all key-value pairs
removed.
@jhamman
Copy link
Member Author

jhamman commented Jan 28, 2019

@jakirkham and @alimanfoo - I think this is basically done. I don't know why the coveralls report is failing. My reading of the coveralls report show that we have 100% coverage on the new bits in this PR but please let me know if I'm missing something.

@alimanfoo
Copy link
Member

@jhamman I think coveralls is sometimes broken, I have seen that before where coverage is 100% on the new/modified files but somehow coveralls reports total coverage down. One way to check is to make a trivial commit and see if the problem persists. I'll also do some digging.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. A couple of very small queries, but up to you if you want to do anything about them.

Only other question is whether you want to mark these two new classes as "experimental features" or not. This is just a convention we've been using to highlight features that are new, and so might require a bit of further work before the API can be considered fully stable. I'm OK either way.

If you do want to mark as experimental then add this after the first paragraph of the docstring:

.. note:: This is an experimental feature.

zarr/storage.py Show resolved Hide resolved

def create_store(self):
# TODO: this is the default host for MongoDB on Travis,
# we probably want to generalize this though
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO something you want to address before merge? Would we ever need to do something different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think the more important question is how do you want to handle these tests when a mongodb or redis server is not available. How should we decide when to skip these tests?

class TestRedisStore(StoreTests, unittest.TestCase):

def create_store(self):
# TODO: this is the default host for Redis on Travis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, is this TODO something that needs to be resolved before merge, or can we live with as-is?

@alimanfoo
Copy link
Member

I think the coveralls report is broken, all files have 100% coverage.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think the only outstanding question here is the one you raised, which is how to deal with skipping tests if redis or mongo servers are not running locally.

docs/release.rst Outdated Show resolved Hide resolved
docs/release.rst Outdated Show resolved Hide resolved
zarr/tests/test_storage.py Show resolved Hide resolved
zarr/tests/test_storage.py Show resolved Hide resolved
@alimanfoo
Copy link
Member

Travis fails because of flake8:

py37 runtests: commands[4] | flake8 zarr
zarr/tests/test_storage.py:908:37: E128 continuation line under-indented for visual indent

(Aside: I'm going to enable the pep8speaks bot so we don't have to go digging through the travis logs.)

@alimanfoo
Copy link
Member

I've fixed the pep8 problem, I think this is good to go, will merge if CI passes.

@alimanfoo
Copy link
Member

OK, this is going in, thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add MongoDB storage backend?
4 participants