feat(datasets): Added the Experimental PolarsDatabaseDataset #990

MinuraPunchihewa · 2025-01-14T18:08:47Z

Description

This PR adds the PolarsDatabaseDataset to support interactions with databases using Polars.

Fixes #853

Development notes

Quite a bit of code has been copied over from SQLQueryDataset in this implementation.

These changes have been tested,

Manually, by running the code locally to load and save Polars DataFrames from and to SQLite files.
~~2. Via the existing and newly added unit tests.~~

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2025-01-14T18:11:32Z

Hey @noklam, @deepyaman,
I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Should we implement save() as well? This would require a table name to be provided as parameter.

Or do you have different thoughts on how this dataset ought to be implemented?

noklam

It'd be great if there is at least one example on how to use this, maybe with a sqlite database to avoid the setup.

kedro-datasets/kedro_datasets_experimental/polars/polars_database_dataset.py

noklam · 2025-01-27T15:28:17Z

kedro-datasets/kedro_datasets_experimental/polars/polars_database_dataset.py

+    def load(self) -> pl.DataFrame:
+        load_args = copy.deepcopy(self._load_args)
+
+        if self._filepath:
+            load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
+            with self._fs.open(load_path, mode="r") as fs_file:
+                query = fs_file.read()
+        else:
+            query = load_args.pop("sql")
+
+        return pl.read_database(
+            query=query,
+            connection=self._connection_str,
+            **load_args


Can you add some docs or checking to reflect this logic

If filepath exist use it

Otherwise use sql

If both are defined, maybe error out or at least log which one is being used?

Ideally I would re-order the argument as well, since the dataset put sql as the first argument but actually filepath have higher priority which feels counter-intuitive.

deepyaman · 2025-01-27T17:08:04Z

Hey @noklam, @deepyaman, I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Sorry for the late response; didn't see this.

While there may be opportunities to reduce code copying more broadly, most datasets just inherit from AbstractDataset or AbstractVersionedDataset. Here, inheriting from pandas.SQLQueryDataset adds a pandas dependency, so we shouldn't do that.

Should we implement save() as well? This would require a table name to be provided as parameter.

Probably, because Polars supports it. You can make the table name optional but require it for save.

Signed-off-by: Minura Punchihewa <[email protected]>

MinuraPunchihewa · 2025-02-01T17:16:21Z

Hey @noklam and @deepyaman,
I've refactored this by removing the dependency on pandas.SQLQueryDataset as @deepyaman suggested. I copied over quite a bit of code from there though since these work quite similarly. I think the duplication can definitely be removed to a large extent by moving some of the common functions to _utils. Should I handle this in a different PR since pandas.SQLQueryDataset is already a fixed (non-experimental) dataset?
Let me know what you think and I will add the finishing touches to this PR (tests and so forth).

MinuraPunchihewa added 2 commits January 14, 2025 23:35

added the initial skeleton for the polars database dataset

e5a704d

Signed-off-by: Minura Punchihewa <[email protected]>

updated the implementation by extending SQLQueryDataset

6fc01a8

Signed-off-by: Minura Punchihewa <[email protected]>

noklam reviewed Jan 27, 2025

View reviewed changes

MinuraPunchihewa and others added 11 commits February 1, 2025 22:01

removed dependency on SQLQueryDataset

87766d5

Signed-off-by: Minura Punchihewa <[email protected]>

added the missing func to adapt mssql params

0520de6

Signed-off-by: Minura Punchihewa <[email protected]>

implemented save()

4a57c15

Signed-off-by: Minura Punchihewa <[email protected]>

added missing import

5ec37cd

Signed-off-by: Minura Punchihewa <[email protected]>

introduced the save_args param

c410cf3

Signed-off-by: Minura Punchihewa <[email protected]>

implemented _describe()

b1df871

Signed-off-by: Minura Punchihewa <[email protected]>

updated the docstring for the database

54afac3

Signed-off-by: Minura Punchihewa <[email protected]>

fixed save()

e6910fe

Signed-off-by: Minura Punchihewa <[email protected]>

updated the required params

b477fce

Signed-off-by: Minura Punchihewa <[email protected]>

fixed lint issues

9423fb8

Signed-off-by: Minura Punchihewa <[email protected]>

Merge branch 'main' into feature/polars_database_dataset

5083d81

MinuraPunchihewa marked this pull request as ready for review February 1, 2025 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

MinuraPunchihewa commented Jan 14, 2025 •

edited

Loading

MinuraPunchihewa commented Jan 14, 2025

noklam left a comment

noklam Jan 27, 2025

deepyaman commented Jan 27, 2025

MinuraPunchihewa commented Feb 1, 2025

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

Are you sure you want to change the base?

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

Conversation

MinuraPunchihewa commented Jan 14, 2025 • edited Loading

Description

Development notes

Checklist

MinuraPunchihewa commented Jan 14, 2025

noklam left a comment

Choose a reason for hiding this comment

noklam Jan 27, 2025

Choose a reason for hiding this comment

deepyaman commented Jan 27, 2025

MinuraPunchihewa commented Feb 1, 2025

MinuraPunchihewa commented Jan 14, 2025 •

edited

Loading