-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GCS blob interface #729
Conversation
Swap to using GCS native blob open under the hood. Reduces code maintenence overhead.
First off, @cadnce this is great and thank you for your efforts. Something I am wondering is if we need these Reader and Writer classes at all now. Do you think there is a solution that just uses the Blob.open function and nothing more? It seems like almost all of our Reader and Writer methods just defer to the the raw reader returned by Blob.open. If this change breaks the smart-open API in some small way like removing the infrequently used I agree most of these tests are probably useless now, especially if we are able to accomplish what I stated above. |
Could we do something like this? We would need to do a bit of refactoring for handling kwargs but this should eliminate a lot more of our code def open(
bucket_id,
blob_id,
mode,
client=None, # type: google.cloud.storage.Client
blob_properties=None,
blob_open_kwargs=None,
):
if client is None:
client = google.cloud.storage.Client()
bucket = client.bucket(bucket_id)
if mode in (constants.READ_BINARY, 'r', 'rt'):
blob = client.bucket.get_blob(blob_id)
if blob is None:
raise google.cloud.exceptions.NotFound('blob %s not found in %s' % (blob_id, bucket_id))
elif mode in (constants.WRITE_BINARY, 'w', 'wt'):
blob = client.bucket.blob(blob_id)
if blob_properties:
for k, v in blob_properties.items():
setattr(blob, k, v)
else:
raise NotImplementedError('GCS support for mode %r not implemented' % mode)
return blob.open(mode, **blob_open_kwargs) |
Cool! If you’re not opposed to the idea of losing compatibility I’ll swap it to something much simpler like you suggest. Ironically that’s where I started out before I spent ages ensuring it all matched - I guess I should have just asked |
Considering those changes would reduce the maintenance of the GCS module drastically I think it more than warrants a breaking change / major version bump. We could just redirect most limitations to the underlying python-storage repo. I don't think any meaningful breaking changes will really occur regardless though since the blob.open is fairly similar to our implementation here |
👍 I’ll fix this up/replace it next week. |
Breaking changes: * Removed gcs.Reader/gcs.Writer classes * No Reader/Writer.terminate() * The buffer size can no-longer be controlled independently of chunk_size * calling close twice on a gcs file object will now throw an exception
Hmm, so this has caused me a little more thinking than I was expecting and would be keen to get input on a few things
|
If you need to maintain the functionality in 2, we could do something super disgusting and monkeypatch the google BlobWriter object so that it allows close to be called multiple times. But that seems like a recipe for disaster in the future! |
|
@mpenkov @piskvorky what is smart-open's stance on breaking API changes within the library? The opinion I have been expressing so far is that they are ok if we perform a major version bump and the benefits are worth it. In this case, we defer almost all maintenance of the gcs integration to the google-cloud-storage library so the benefits seem worthwhile. I think so far the breaking changes we have identified are:
|
On the breaking changes front also
If we’re reasonably happy with the concept of it being a breaking change, I’ll have a look at ensuring those test cases are covered off (and probably strip out a few that are no-longer testing the right things) |
Btw thanks for your input @petedannemann always hard knowing the right way to contribute |
Ok, so of the changes that were breakign originally in this PR:
Oh, and I updated the test coverage to cover the things I thought we were missing. Which means its just a case of waiting for the next release of the google-cloud-storage library and the interface will remain identical. |
Its definitely not scientific, but I ran the integration/performance tests use-gcs-open
develop
The results pretty much being too similar to compare on my connection |
Do we need to author any documentation that helps people adjust to the breaking changes? |
If we wait for the google-python-storage library version 2.6.0 googleapis/python-storage#838 I think there will be no breaking changes 😁 Though, I’ll add an update to the documentation about the new parameter passthrough. the slightly unfortunate thing is it will break all the rest of the open PR’s which modify the gcs module, but they will all be easier to add in the future. |
fyi another PR waiting for gcs 2.6.0: #727 would be awesome if that PR can be closed in favour of this one (just that acl cant be specified using edit: yes this PR would cover it (source): blob_open_kwargs={'predefined_acl': 'publicRead'} if it's not too much to ask: could you copy-paste the corresponding test from my PR? its tiny :) |
@ddelange Yup, it should be covered by the blob_open_kwargs (why I made this pr instead of adding all the small gcs functionality additions) I’ll add that as a use case in the docs update to explain how to use it? I was trying to trim functionality out of the fakes (and removed the test you updated) because I was deferring a lot of that to the google library. But could add that in if you felt it was adding a lot? |
awesome, docs sounds good! 💥 yeah, all the fakes. still a mystery to me why there's no testing framework for gcs like moto for boto, or a local dev server for testing like azurite for azure blob... |
nvm, spoke too soon https://github.com/fsouza/fake-gcs-server |
here an example of switching to a dockerized azurite server for tests, was a quite satisfying refactor as you just point to fake server and run real assertions on the server's responses |
ah, that felt good again ♻️ ddelange/pypicloud#4 thanks a lot for the efforts here, looking very forward to smart_open 7.0.0! 🚀 |
is this PR ready for review? |
Sorry about the ambiguity. We’re waiting on the release of the google cloud storage python module. That said, the code for this will remain the same except for the dependency version will be bumped. So in theory you can review now and just assume the double close issue is not an issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome stuff!
try: | ||
setattr(g_blob, k, v) | ||
except AttributeError: | ||
logger.warn(f'Unable to set property {k} on blob') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my side there's no need for a try-except: if the user passes unsupported properties, it's OK to fail hard
if hasattr(_blob, 'terminate'): | ||
raise RuntimeWarning( | ||
'Unexpected incompatibility between dependency and google-cloud-storage dependency.' | ||
'Things may not work as expected' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'may not' implies the user can try anyway (warning vs error)
could this error instead state what the user did wrong, and how to fix it?
is it related to minimum gcs lib versions? would it be an option to limit the gcs dependency in setup.py to only allow the versions that work with this new implementation? cc @mpenkov
except AttributeError: | ||
logger.warn(f'Unable to set property {k} on blob') | ||
|
||
_blob = g_blob.open('wb', **blob_open_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was mode 'w' supported before? if yes, maybe a release note? if no, plz ignore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think it's all good. from the PR diff: elif mode in (constants.WRITE_BINARY, 'w', 'wt')
feel free to resolve :)
gcs 2.6.0 was released 💥 |
Awesome! I’ll make the changes requested and bump the version and test this evening. |
@cadnce could we squeeze this into the next release, v6.3.0? We're planning to release soon (~in a week or so). |
@cadnce Do you think you could finish this in the next couple of days? We're thinking of releasing 6.3.0 soon. If not, then no rush, we can include this in the next release. Please let me know. |
Closing in favor of #744 due to inactivity. |
Fixes #599 - Swap to using GCS native blob open under the hood.
This should reduce the amount of custom code to maintain, I have tried to keep the interfaces identical so there is no API breaking changes. Though this does mean there is still lots of code that can be trimmed down.
I think it might be worth re-thinking the test coverage and if the test suites like
FakeAuthorizedSessionTest
are still valid/useful.What do you think? @petedannemann