Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dataset diff and compute metric #381

Merged
merged 4 commits into from
Dec 18, 2022
Merged

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented Dec 18, 2022

  1. Minor bug fixes
  2. Add ds.compute_metric(func) and ds.diff(v1, v2) to dataset interface

@changhiskhan changhiskhan requested a review from eddyxu December 18, 2022 21:31
@changhiskhan changhiskhan force-pushed the changhiskhan/refactor branch 2 times, most recently from f445bf6 to 9165a2c Compare December 18, 2022 21:54
@changhiskhan changhiskhan changed the title Refactor diff/compute_metric to be part of the dataset interface Refactor dataset diff and compute metric Dec 18, 2022

Parameters
----------
metric_func: FileSystemDataset -> pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callalble[[FileSystmeDataset], pd.DataFrame]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def compute_metric(self, metric_func: Callable[[Dataset], pd.DataFrame],
versions: list[int] = None) -> pd.DataFrame:
"""
Compute a metric across versions for this dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concerns are the stability / maturity of these two APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, moved to lance/__init__.py and not as a Dataset method

metric_func: FileSystemDataset -> pd.DataFrame
Function to compute some arbitrary metric for a given version
versions: list of int, default None
Compute for specified versions. Compute for all versions if None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default behavior to just compute metrics for latest version ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you'd really need this method then. If you just wanted to compute on latest version, metric_func(ds) is simpler than compute_metrics(ds, metric_func). wdyt?

Copy link
Contributor

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@changhiskhan changhiskhan merged commit 0efbd0a into main Dec 18, 2022
@changhiskhan changhiskhan deleted the changhiskhan/refactor branch December 18, 2022 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants