Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: Kmeans (Python) #11

Open
10 of 23 tasks
jamesh4 opened this issue Mar 16, 2020 · 3 comments
Open
10 of 23 tasks

Submission: Kmeans (Python) #11

jamesh4 opened this issue Mar 16, 2020 · 3 comments
Assignees

Comments

@jamesh4
Copy link

jamesh4 commented Mar 16, 2020


name: Kmeans
about: K-means package for python
title: K-means implementation from scratch
labels: 1/editor-checks, New Submission!
assignees: Bronwyn Baillie (@bbaillie), Shangjing Hu (@mirohu)


Submitting Author: Rob Blumberg (@RobBlumberg ), Sreejith Munthikodu (@sreejithmunthikodu ), Saurav Chowdhury (@saurav193 ), James Huang (@jamesh4 )
Package Name: Kmeans
One-Line Description of Package: K-means implementation from scratch
Repository Link: https://github.com/UBC-MDS/Kmeans_python
Version submitted: https://github.com/UBC-MDS/Kmeans_python/tree/v1.0.0
Editor: TBD
Reviewer 1: Bronwyn Baillie (@bbaillie)
Reviewer 2: Shangjing Hu (@mirohu)
Archive: TBD
Version accepted: TBD


Description

  • Include a brief paragraph describing what your package does:
    This package includes python functions that implement k-means clustering from scratch. This will work on any dataset with valid numerical features, and includes fit, predict, and cluster_summary functions, as well as as elbow and silhouette methods for hyperparameter “k” optimization. A high level overview of each function is given below. See each function’s documentation for more details.

Scope

  • Please indicate which category or categories this package falls under:
    • Data exploration
    • Data retrieval
    • Data extraction
    • Data munging
    • Data deposition
    • Reproducibility
    • Geospatial
    • Education
    • Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
* The data exploration category was added after consultation with @kvarada since the other categories don't accurately describe this package.

  • Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package implements the k-means algorithm, a data mining and clustering technique used to uncover relationships in unlabelled data.

  • Who is the target audience and what are scientific applications of this package?

This package was created to provide a deeper understanding of the k-means clustering algorithm. Thus, this package is targeted to anyone interested in diving into the implementation of this clustering technique.

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There is a python package sklearn.cluster.KMeans that has similar functions. This package is not meant to add to the existing ecosystem; it is rather intended to deepen fundamental understanding of these algorithms.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

We spoke to @kvarada and she approved the addition of the data exploration category.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has an OSI approved license
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: Do not submit your package separately to JOSS

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

@bbaillie
Copy link

bbaillie commented Mar 21, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all user-facing functions
  • Examples for all user-facing functions
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
  • Installation instructions
  • Any additional setup required (authentication tokens, etc)
  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages
  • Citation information

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

JOSS Checks

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2


Review Comments

Great work on this package folks! At first glance all the requirements are met except for the requirement to have all the functionality in your vignettes run successfully locally. Your badges are displaying correctly, and your code coverage is quite impressive, so well done.

In your usage section of your readme, which I am considering to be your vignettes, there are a couple small issues.

  • In the usage of ft.py, if you include centers, labels = fit(X, 2) instead of just fit(X, 2) then the output of
    (array([[ 1, 2], [10, 2]]), array([0, 0, 0, 1, 1, 1]))
    would not be printed to screen, because it would have been saved as a variable.

  • In the usage of elbow.py your output is incomplete. It should be
    (alt.Chart(...), [2.8284271247461903, 1.4142135623730951, 0.0, 1.4142135623730951])
    instead of just
    alt.chart.

  • In the usage of silhouette.py you don't include an import statement so the code fails to run, and the output is again incorrect. It should be
    (0 0.713348 1 0.436301 2 0.166667 3 0.083333 Name: Score, dtype: float64, alt.Chart(...))
    instead of
    alt.chart.

  • In the usage of cluster_summary.py there is a typo in the imports line causing the code to throw an error. It's missing the 's' in 'Means' so it should be
    from Kmeans_python.cluster_summary import cluster_summary
    instead of
    from Kmean_python.cluster_summary import cluster_summary.

  • This code also gives an error because cluster_summary() is missing the centroids parameter

Upon further inspection of the package I found a few other things that you may want to change.

For starters, there is a misplaced capitol 's' in the very first line of your README file: "This package includeS python packages...". Probably not what you want in the first part of your package that people will see.

Continuing with the readme, I really liked that you put your names at the top (I can't believe my group didn't do this). That being said, I think including your last names would make your README appear more polished and professional, and would better reflect the level of quality of your work.

Also in your readme was your installation section, which mentioned that you had yet to deploy the package to pypi, but provides the installation instructions for once you have. Since writing that section you must have deployed your package to pypi, because I was able to install it successfully. So that section could use an update to reflect that.

Moving on to the code itself, I ran the examples in the documentation of your functions to get a feel for the package and make sure all the functions worked, and it turns out that the examples in the documentation of the cluster_summary() and elbow() functions throw errors. They're quick fixes, but I think it's important to have examples in your documentation that you can just copy and paste and have them run. The cluster_summary() example is missing the required centroids parameter, and the elbow() example includes the output of the example as a line of code, which obviously causes errors if you try to run it, so it should either be commented out or removed from the example.

As far as I could tell there were no issues in the rest of the package, so great job! You should be really proud of yourselves; this project wasn't easy.

@mirohu
Copy link

mirohu commented Mar 21, 2020

Review Template

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all user-facing functions
  • Examples for all user-facing functions
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
  • Installation instructions
  • Any additional setup required (authentication tokens, etc)
  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages
  • Citation information

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1


Review Comments

Well done guys! I can tell you have put a lot of effort into this. The whole structure of the github page looks good, the README.md file states everything we need and the functions run smoothly!!

Documentation

The documentation looks good. Only a small thing: it would be nice if you can include the author's emails in the CONTRIBUTORS.md file.

Readme

I especially liked all the examples you provided in the usage section! It's very helpful when I tried to re-run the function. I noticed a few things that can be further improved:

  • under the sillouette.py section doesn't have the import line. It would be good to make it consistent with other functions. I may prefer to have all the non-Kmeans_python import on the top to avoid repetition;
  • under the cluster_summary.py section, the import line "from Kmean_python.cluster_summary import cluster_summary" had a typo and running the example threw an error because of a missing input (the error says it's missing cluster_assignments input);
Functionality

For some reason that I couldn't install the package according to the guidelines. I have included the error messages as follow:

ERROR: No matching distribution found for altair<5.0.0,>=4.0.1 (from kmeans-python)

I am assuming this is telling me that my altair is not 4.0.1 so it couldn't be installed? I would look into that as dependencies listed altair 3.5 in the README.md file.

This is what I have so far. Other than these, everything looks great. The CI worked amazing and the badges look beautiful. I know how hard it can be to make it work. Good job guys!

@sreejithmunthikodu
Copy link

sreejithmunthikodu commented Mar 26, 2020

Issues Addressed

Our team really appreciates the valuable feedbacks from both the reviewers. They really helped us to fix many bugs and improved the package documentation. We are pleased to announce that we could address all the issues in our latest release. We thank both the reviewers for their time and effort.
Latest Release

Reviewer 1 - All issues addressed

Following issues are addressed in the latest release

  • In the usage of ft.py, if you include centers, labels = fit(X, 2) instead of just fit(X, 2) then the output of (array([[ 1, 2], [10, 2]]), array([0, 0, 0, 1, 1, 1])) would not be printed to screen, because it would have been saved as a variable. Captured outputs in variables

  • In the usage of elbow.py your output is incomplete. It should be
    (alt.Chart(...), [2.8284271247461903, 1.4142135623730951, 0.0, 1.4142135623730951])
    instead of just alt.chart. Corrected

  • In the usage of silhouette.py you don't include an import statement so the code fails to run, and the output is again incorrect. It should be
    (0 0.713348 1 0.436301 2 0.166667 3 0.083333 Name: Score, dtype: float64, alt.Chart(...))
    instead of alt.chart. Code has been corrected and it works now

  • In the usage of cluster_summary.py there is a typo in the imports line causing the code to throw an error. It's missing the 's' in 'Means' so it should be
    from Kmeans_python.cluster_summary import cluster_summary
    instead of
    from Kmean_python.cluster_summary import cluster_summary.
    Typo has been corrected

  • Usage code on cluster_summary.py also gives an error because cluster_summary() is missing the centroids parameter Corrected

  • Updated README.md. Corrected typos, updated author names with last names and updated the package installation guide.

  • Ensured all code snippets in the usage on README.md actually works

Reviewer 2 - All issues addressed

Following issues are addressed in the latest release

  • under the sillouette.py section doesn't have the import line. It would be good to make it consistent with other functions. Added import line

  • under the cluster_summary.py section, the import line "from Kmean_python.cluster_summary import cluster_summary" had a typo and running the example threw an error because of a missing input (the error says it's missing cluster_assignments input); Corrected the code

  • Updated installation instructions to enable installing dependencies from PyPI, outside TestPyPi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants