Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Added the Experimental PolarsDatabaseDataset #990

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

MinuraPunchihewa
Copy link
Contributor

@MinuraPunchihewa MinuraPunchihewa commented Jan 14, 2025

Description

This PR adds the PolarsDatabaseDataset to support interactions with databases using Polars.

Fixes #853

Development notes

Quite a bit of code has been copied over from SQLQueryDataset in this implementation.

These changes have been tested,

  1. Manually, by running the code locally to load and save Polars DataFrames from and to SQLite files.
    2. Via the existing and newly added unit tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam, @deepyaman,
I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Should we implement save() as well? This would require a table name to be provided as parameter.

Or do you have different thoughts on how this dataset ought to be implemented?

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great if there is at least one example on how to use this, maybe with a sqlite database to avoid the setup.

Comment on lines +31 to +44
def load(self) -> pl.DataFrame:
load_args = copy.deepcopy(self._load_args)

if self._filepath:
load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
with self._fs.open(load_path, mode="r") as fs_file:
query = fs_file.read()
else:
query = load_args.pop("sql")

return pl.read_database(
query=query,
connection=self._connection_str,
**load_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs or checking to reflect this logic

  1. If filepath exist use it
  2. Otherwise use sql
  3. If both are defined, maybe error out or at least log which one is being used?

Ideally I would re-order the argument as well, since the dataset put sql as the first argument but actually filepath have higher priority which feels counter-intuitive.

@deepyaman
Copy link
Member

Hey @noklam, @deepyaman, I was able to come up with this implementation for the PolarsDatabaseDataset by extending SQLQueryDataset and it seems to work quite well (at least load() does).

Sorry for the late response; didn't see this.

While there may be opportunities to reduce code copying more broadly, most datasets just inherit from AbstractDataset or AbstractVersionedDataset. Here, inheriting from pandas.SQLQueryDataset adds a pandas dependency, so we shouldn't do that.

Should we implement save() as well? This would require a table name to be provided as parameter.

Probably, because Polars supports it. You can make the table name optional but require it for save.

MinuraPunchihewa and others added 11 commits February 1, 2025 22:01
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
Signed-off-by: Minura Punchihewa <[email protected]>
@MinuraPunchihewa MinuraPunchihewa marked this pull request as ready for review February 1, 2025 17:11
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam and @deepyaman,
I've refactored this by removing the dependency on pandas.SQLQueryDataset as @deepyaman suggested. I copied over quite a bit of code from there though since these work quite similarly. I think the duplication can definitely be removed to a large extent by moving some of the common functions to _utils. Should I handle this in a different PR since pandas.SQLQueryDataset is already a fixed (non-experimental) dataset?
Let me know what you think and I will add the finishing touches to this PR (tests and so forth).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polars SQL datasets
3 participants