MongoDB: Improve support for reading JSON/BSON files #261

amotl · 2024-09-11T11:44:43Z

About

A few features that have not been ready (lacked software tests) to be included into GH-255.

Details

Unlock processing multiple collections, either from server database, or from filesystem directory
Unlock processing JSON files from HTTP resource, using https+bson://
Optionally filter server collection using MongoDB query expression

Documentation

https://cratedb-toolkit--261.org.readthedocs.build/io/mongodb/loader.html

Install

pip install --upgrade 'cratedb-toolkit[mongodb] @ git+https://github.com/crate/cratedb-toolkit.git@amo/mongodb-more-2'

seut · 2024-09-12T14:20:33Z

cratedb_toolkit/api/main.py

+                    progress=True,
+                )
+
+        elif source_url_obj.scheme.startswith("http"):


Maybe move this up to if source_url_obj.scheme.startswith("file") or source_url_obj.scheme.startswith("http") as the body of both branches looks the same?

Significantly improved with fb70a36. Thanks!

seut · 2024-09-12T14:32:24Z

cratedb_toolkit/io/mongodb/api.py

+            tasks.append(
+                MongoDBFullLoad(
+                    mongodb_url=str(mongodb_uri_effective),
+                    cratedb_url=str(cratedb_uri_effective),


Is the URL->String conversion here needed as the MongoDBFullLoad will convert the String back to an URL anyhow?

Improved with 6bbda4d. Thanks again!

Either from server database, or from filesystem directory.

amotl · 2024-09-13T08:29:34Z

cratedb_toolkit/io/mongodb/api.py

-        mongodb_uri = source_url
-        cratedb_uri = target_url
-        # What the hack?
-        if (
-            mongodb_uri.scheme.startswith("mongodb")
-            and Path(mongodb_uri.path).is_absolute()
-            and mongodb_uri.path[-1] != "/"
-        ):
-            mongodb_uri.path += "/"
-        if cratedb_uri.path[-1] != "/":
-            cratedb_uri.path += "/"
-        mongodb_query_parameters = mongodb_uri.query_params
-        mongodb_adapter = mongodb_adapter_factory(mongodb_uri)
+        address_pair_root = AddressPair(source_url=source_url, target_url=target_url)


This hackery has now also been refactored and wrapped away into an AddressPair object, in order to separate concerns...

amotl · 2024-09-13T08:30:33Z

cratedb_toolkit/io/mongodb/api.py

        for collection_path in collections:
-            mongodb_uri_effective = mongodb_uri.navigate(Path(collection_path).name)
-            mongodb_uri_effective.query_params = mongodb_query_parameters
-            cratedb_uri_effective = cratedb_uri.navigate(Path(collection_path).stem)
+            address_pair = address_pair_root.navigate(
+                source_path=Path(collection_path).name,
+                target_path=Path(collection_path).stem,
+            )
            tasks.append(
                MongoDBFullLoad(
-                    mongodb_url=mongodb_uri_effective,
-                    cratedb_url=cratedb_uri_effective,
+                    mongodb_url=address_pair.source_url,
+                    cratedb_url=address_pair.target_url,


... and to minimize its API surface. Now, you are just invoking .navigate() on that composite object instance and it will adjust its managed URL instances correspondingly.

seut · 2024-09-13T09:25:31Z

cratedb_toolkit/model.py

+        source_url_query_parameters = self.source_url.query_params
+        target_url_query_parameters = self.target_url.query_params


As the query params are not changed in this method, why do we need to store (and copy) them separately?

The fundamental .navigate() method on the URL object implicitly gets rid of them, but we want to propagate them, so we need to store and forward them explicitly.

Most probably, it makes more sense to not use .navigate() at all, because it apparently provides so many obstacles that need workarounds, and just manipulate the .path attribute directly instead.

94b52a8 stops using the .navigate() method of the designated URL library, and uses standard urljoin() instead to compute and directly set the .path attribute.

The result looks more streamlined and compact than before, as it does not need to work around the obstacles of .navigate() any longer, so we save the need to explicitly store+forward the URL query parameters. Thanks!

seut · 2024-09-13T09:26:16Z

cratedb_toolkit/model.py

+        source_url_query_parameters = self.source_url.query_params
+        target_url_query_parameters = self.target_url.query_params
+
+        source_url = URL(str(self.source_url))


I guess there is no better API at the URL class to avoid string generation + parsing only to copy the instance, right?

It looks like standard deepcopy is just fine, as already applied on other spots in this file. Improved with 2a5df04. Thanks!

- Do not use the fundamental `.navigate()` method, as it needs too many workarounds. - Do not store and copy query parameters, because the implementation does not use `.navigate()` any longer. - Manipulate the `.path` property directly instead, computing it using the canonical `urljoin` function. - Adjustments about missing trailing slashes still need to take place.

seut

👍 thx

amotl requested review from zolbatar and wierdvanderhaar September 11, 2024 11:44

amotl force-pushed the amo/mongodb-more-2 branch 2 times, most recently from 414e128 to e8001c0 Compare September 11, 2024 11:57

amotl marked this pull request as ready for review September 11, 2024 12:00

amotl mentioned this pull request Sep 11, 2024

MongoDB: General backlog #260

Open

8 tasks

amotl force-pushed the amo/mongodb-more-2 branch from e8001c0 to be1ca83 Compare September 11, 2024 12:35

This comment was marked as resolved.

Sign in to view

cla-bot bot added the cla-signed label Sep 11, 2024

This comment was marked as resolved.

Sign in to view

amotl requested a review from hammerhead September 11, 2024 14:37

seut reviewed Sep 12, 2024

View reviewed changes

amotl requested a review from seut September 12, 2024 15:08

amotl added 5 commits September 12, 2024 17:11

MongoDB: Unlock processing multiple collections

3322ce2

Either from server database, or from filesystem directory.

MongoDB: Process JSON files from HTTP resource, using https+bson://

d825f2f

MongoDB: Filter server collection using MongoDB query expression

42f22ee

MongoDB: Decrease default batch size to 100

6030d83

MongoDB: Cleanups. Tests. Hacks. This and that.

86b8024

amotl force-pushed the amo/mongodb-more-2 branch from 6bbda4d to db45c69 Compare September 12, 2024 15:11

amotl added 2 commits September 12, 2024 17:15

MongoDB: Improve dispatching of server- vs. file-based processing

05fa397

MongoDB: Avoid URL object <-> string conversions on a few spots

e129e26

amotl force-pushed the amo/mongodb-more-2 branch from db45c69 to e129e26 Compare September 12, 2024 15:17

MongoDB: Improve URL computation when transferring whole databases

9e52464

amotl mentioned this pull request Sep 13, 2024

MongoDB: Improve error handling wrt. bulk operations vs. usability #262

Merged

amotl commented Sep 13, 2024

View reviewed changes

seut reviewed Sep 13, 2024

View reviewed changes

Model: Use standard deepcopy method to clone boltons.urlutils.URL

2a5df04

amotl requested a review from seut September 13, 2024 09:47

seut approved these changes Sep 13, 2024

View reviewed changes

amotl merged commit a25127e into main Sep 13, 2024
26 checks passed

amotl deleted the amo/mongodb-more-2 branch September 13, 2024 13:17

amotl mentioned this pull request Sep 18, 2024

MongoDB: Code refactoring and generalization crate/commons-codec#49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB: Improve support for reading JSON/BSON files #261

MongoDB: Improve support for reading JSON/BSON files #261

amotl commented Sep 11, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

seut Sep 12, 2024

amotl Sep 12, 2024

seut Sep 12, 2024

amotl Sep 12, 2024

amotl Sep 13, 2024

amotl Sep 13, 2024

seut Sep 13, 2024

amotl Sep 13, 2024 •

edited

Loading

amotl Sep 13, 2024

amotl Sep 13, 2024 •

edited

Loading

seut Sep 13, 2024

amotl Sep 13, 2024 •

edited

Loading

seut left a comment

		source_url_query_parameters = self.source_url.query_params
		target_url_query_parameters = self.target_url.query_params

MongoDB: Improve support for reading JSON/BSON files #261

MongoDB: Improve support for reading JSON/BSON files #261

Conversation

amotl commented Sep 11, 2024 • edited Loading

About

Details

Documentation

Install

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

seut left a comment

Choose a reason for hiding this comment

amotl commented Sep 11, 2024 •

edited

Loading

amotl Sep 13, 2024 •

edited

Loading

amotl Sep 13, 2024 •

edited

Loading

amotl Sep 13, 2024 •

edited

Loading