-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MongoDB: Improve support for reading JSON/BSON files #261
Conversation
414e128
to
e8001c0
Compare
e8001c0
to
be1ca83
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
cratedb_toolkit/api/main.py
Outdated
progress=True, | ||
) | ||
|
||
elif source_url_obj.scheme.startswith("http"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this up to if source_url_obj.scheme.startswith("file") or source_url_obj.scheme.startswith("http")
as the body of both branches looks the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Significantly improved with fb70a36. Thanks!
cratedb_toolkit/io/mongodb/api.py
Outdated
tasks.append( | ||
MongoDBFullLoad( | ||
mongodb_url=str(mongodb_uri_effective), | ||
cratedb_url=str(cratedb_uri_effective), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the URL->String conversion here needed as the MongoDBFullLoad will convert the String back to an URL anyhow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved with 6bbda4d. Thanks again!
Either from server database, or from filesystem directory.
6bbda4d
to
db45c69
Compare
db45c69
to
e129e26
Compare
cratedb_toolkit/io/mongodb/api.py
Outdated
mongodb_uri = source_url | ||
cratedb_uri = target_url | ||
# What the hack? | ||
if ( | ||
mongodb_uri.scheme.startswith("mongodb") | ||
and Path(mongodb_uri.path).is_absolute() | ||
and mongodb_uri.path[-1] != "/" | ||
): | ||
mongodb_uri.path += "/" | ||
if cratedb_uri.path[-1] != "/": | ||
cratedb_uri.path += "/" | ||
mongodb_query_parameters = mongodb_uri.query_params | ||
mongodb_adapter = mongodb_adapter_factory(mongodb_uri) | ||
address_pair_root = AddressPair(source_url=source_url, target_url=target_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hackery has now also been refactored and wrapped away into an AddressPair
object, in order to separate concerns...
cratedb_toolkit/io/mongodb/api.py
Outdated
for collection_path in collections: | ||
mongodb_uri_effective = mongodb_uri.navigate(Path(collection_path).name) | ||
mongodb_uri_effective.query_params = mongodb_query_parameters | ||
cratedb_uri_effective = cratedb_uri.navigate(Path(collection_path).stem) | ||
address_pair = address_pair_root.navigate( | ||
source_path=Path(collection_path).name, | ||
target_path=Path(collection_path).stem, | ||
) | ||
tasks.append( | ||
MongoDBFullLoad( | ||
mongodb_url=mongodb_uri_effective, | ||
cratedb_url=cratedb_uri_effective, | ||
mongodb_url=address_pair.source_url, | ||
cratedb_url=address_pair.target_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and to minimize its API surface. Now, you are just invoking .navigate()
on that composite object instance and it will adjust its managed URL instances correspondingly.
cratedb_toolkit/model.py
Outdated
source_url_query_parameters = self.source_url.query_params | ||
target_url_query_parameters = self.target_url.query_params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the query params are not changed in this method, why do we need to store (and copy) them separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental .navigate()
method on the URL object implicitly gets rid of them, but we want to propagate them, so we need to store and forward them explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most probably, it makes more sense to not use .navigate()
at all, because it apparently provides so many obstacles that need workarounds, and just manipulate the .path
attribute directly instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
94b52a8 stops using the .navigate()
method of the designated URL library, and uses standard urljoin()
instead to compute and directly set the .path
attribute.
The result looks more streamlined and compact than before, as it does not need to work around the obstacles of .navigate()
any longer, so we save the need to explicitly store+forward the URL query parameters. Thanks!
cratedb_toolkit/model.py
Outdated
source_url_query_parameters = self.source_url.query_params | ||
target_url_query_parameters = self.target_url.query_params | ||
|
||
source_url = URL(str(self.source_url)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is no better API at the URL class to avoid string generation + parsing only to copy the instance, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like standard deepcopy
is just fine, as already applied on other spots in this file. Improved with 2a5df04. Thanks!
- Do not use the fundamental `.navigate()` method, as it needs too many workarounds. - Do not store and copy query parameters, because the implementation does not use `.navigate()` any longer. - Manipulate the `.path` property directly instead, computing it using the canonical `urljoin` function. - Adjustments about missing trailing slashes still need to take place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thx
About
A few features that have not been ready (lacked software tests) to be included into GH-255.
Details
https+bson://
Documentation
https://cratedb-toolkit--261.org.readthedocs.build/io/mongodb/loader.html
Install