Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Histogram distribution #382

Merged
merged 14 commits into from
Jun 22, 2024
Merged

[ENH] Histogram distribution #382

merged 14 commits into from
Jun 22, 2024

Conversation

ShreeshaM07
Copy link
Contributor

@ShreeshaM07 ShreeshaM07 commented Jun 11, 2024

Reference Issues/PRs

fixes #323
PR is a new one with updated main merged with #335 as there were some merge conflicts in __init__.py in distributions and also distributions.rst in api_reference.

What does this implement/fix? Explain your changes.

Implements the histogram distribution using bins and bin_mass as the parameters.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

Any comments

All the discussion regarding the Histogram discussion has been discussed in the PR #335.

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@ShreeshaM07 ShreeshaM07 mentioned this pull request Jun 11, 2024
5 tasks
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 12, 2024

merged #381 into main, and then from main into this.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Really nice work, I like the design thinking in the new base class.

Before we merge, the current "template" pattern does not allow override of the public methods pdf etc, or we diverge from the template pattern, see https://github.com/kamranahmedse/design-patterns-for-humans for the pattern.

Could you think of the best way to avoid the override? Or, let me know if you think this is not possible with the current design.

We do not need to merge the two base classes together btw, just avoid the override.

@ShreeshaM07
Copy link
Contributor Author

I hope this will suffice as I am not overriding any public methods anymore.

@ShreeshaM07
Copy link
Contributor Author

@fkiraly , All the checks have passed after the overriding of public methods were avoided as well. Are we waiting on something for this merge?

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented Jun 18, 2024

@fkiraly , I modified the mean and variance to be calculated after vectorizing the bins and bin_mass and found the efficiency to be improved by only around 5 to 6 times. This is excluding the time taken to vectorize the ragged inputs.

These are the outputs

  • current mean time
    curr_mean

  • vectorized mean time
    vectorized_mean

  • current variance time
    curr_var

  • vectorized variance time
    vectorized_var

This involved changing most of the code just for the mean and variance itself which doesn't have all the if conditions like in pdf,cdf,ppf etc which will make slightly more complex. I am not very sure if the effort is worth the reward. If we really would like to have it then maybe we can do it in a separate PR as it will involve a lot of changes to the code.

Thoughts?

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 20, 2024

a factor of 5 or 6 is considered substantial, so I would advise we check how much time the conversion of arrays requires.

I would then proceed based on the outcoe:

  • if overall runtimes are comparable, I would advise we merge whatever functions we have currently vectorized, and open an issue. This is because that means, a different parameter representation might eventually make the computations more performant.
  • if overall runtime with conversion is worse, I would advise we pin the matter in the form of an issue and merge the current version.
  • If it still is a substantial speed-up, I would advise vectorizing a few key methods in addition.

@ShreeshaM07
Copy link
Contributor Author

This is including the vectorization time. With 2 cases of vectorization time inclusion case 1 when it is not vectorized in init but in separate functions and case 4 when it is vectorized in init and not in separate functions.

  • mean

    • vectorization time included in each function
      vec_mean_with_vec_time
    • vectorization time not included but vectorized
      vectorized_mean
    • No vectorization at all
      curr_mean
    • vectorized in init itself not in separate methods(i.e sum of time to vectorize + sum of time to run the mean operations after vectorizing)
      vec_init_mean
  • variance

    • vectorization time included in each function
      vec_var_with_vec_time
    • vectorization time not included but vectorized
      vectorized_var
    • no vectorization at all
      curr_var
    • vectorized in init itself not in separate methods(i.e sum of time to vectorize + sum of time to run the mean operations after vectorizing)
      vec_init_var

As you can see it actually gets worse when it is vectorized in __init__. The time for vectorizing these ragged inputs itself is about 0.0002 secs after which it takes time to calculate mean, var etc. So even though the time to calculate variance etc gets optimised by a factor of 5 or 6. The initializing itself is taking a lot of time.
This should come under the category 2

if overall runtime with conversion is worse, I would advise we pin the matter in the form of an issue and merge the current version.

Please do give your take based on your inference of the data.

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 21, 2024

Hm, I would then agree that the payoff is perhaps not worth the effort at the moment.

I would suggest to summarize the findings and outstanding questions in an issue, and perhaps at a later time point review together:

  • the design of the intermediate base class from this PR
  • the designs and parametrizations of the histogram distribution and two similar distributions: Empirical, and Mixture

@ShreeshaM07
Copy link
Contributor Author

Hm, I would then agree that the payoff is perhaps not worth the effort at the moment.

I would suggest to summarize the findings and outstanding questions in an issue, and perhaps at a later time point review together:

  • the design of the intermediate base class from this PR
  • the designs and parametrizations of the histogram distribution and two similar distributions: Empirical, and Mixture

Yeah I will summarize the findings of this PR in an issue.

@fkiraly fkiraly merged commit 5afdcec into sktime:main Jun 22, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[ENH] histogram distribution
2 participants