-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hamilton #74
Comments
Hi @skrawcz welcome to pyOpenSci! Thank you for your presubmission inquiry. I'll help you figure out if the package is in scope, and if we should move to a full submission, at which time I would find an editor. The categories you have indicated look right to me. A couple of questions:
Could you please clarify how Hamilton relates to those other tools?
Can you talk about overlap with snakemake? If you have examples where researchers are already using Hamilton, that would help. And please just say a little generally about what the Hamilton authors are hoping to achieve by a pyOpenSci review. To be clear, I fully agree with you that Hamilton is a tool that scientists could potentially use for munging and reproducibility. Thank you! edit: removed comment that could give the wrong impression about our scope |
Sure. Hamilton does not replace airflow, metaflow, dagster, luigi, etc. They focus on the "macro" scheduling problem, and are systems that require state to be managed. Hamilton focuses on the "micro", i.e. what people do within the step of an airflow, metaflow, dagster, luigi, etc. task. Hamilton replaces lines of logic with functions and tries to make that part of a code base testable, documentation friendly, and maintainable, which is not the goal of those other systems. For example, here's a blog post showing Hamilton + Metaflow - Hamilton helps with the feature engineering task, and metaflow does the macro orchestration. Other differences:
From snakemake:
Snakemake is basically an orchestration system it sounds like. Hamilton is much much lighter-weight and only focused on pure python. Given the feature set snakemake has, and if that's the bar for reproducibility, then I don't think Hamilton meets it. I'll retract that category.
(1) I saw pandera on the list, and I think it's a good tool for a scientist to know about (Hamilton supports integration with it), and thus thought to myself that Hamilton would be a fit here. Thanks for the questions -- let me know what I can clarify/you want to diver deeper on. |
Thank you @skrawcz, that's very helpful. I think I understand that Hamilton is a pure Python way to do feature engineering (in brief). I'm glad you mentioned pandera, it did also occur to me they might work well together. If there are any public repositories from nat'l labs like PNNL that would provide examples of using Hamilton in the wild, that could also help us see its application to open science. I hear you that one of your goals is to better understand how to reach this community of users. So your reasons for seeking review make sense to me and make me feel like this could be in scope. I am discussing with the executive director and other editors. Please let me get back to you with any further questions or a decision by Monday at the latest. |
Sure -- here's what I've found in my notes: |
Hi again @skrawcz, thank you for providing that example. It's very helpful to see. We have decided that yes we will review the package as it is in scope. Some context on the decision for you, and us for future reference, and transparency: as I noted above, we see that Hamilton has already had support for its development, and there is a proceedings paper, although a publication review is not the same as our software review. One of our goals is to provide resources to packages that have not yet enjoyed this kind of support. But it is also within our scope to help build consistency across the whole scientific Python ecosystem. You have clearly shown that (1) you are interested in participating in this process as an author and (2) there are researchers using the code now, that are part of a community we want to build connections with. For those reasons we will proceed with a review. I expect that we will find an editor by early next week. @skrawcz could you please go ahead and make a full submission issue? I will close this one once you have done so. |
Awesome thanks @NickleDave . I'll get started on the full issue. Will finish it in the next 24-72 hours or so :) |
Okay I did #80; still working on JOSS section, otherwise I think I filled it out appropriately. |
Closing this since full submission is in #80 |
Submitting Author: Stefan Krawczyk (@skrawcz)
Package Name: Hamilton (sf-hamilton on pypi)
One-Line Description of Package: A general purpose micro-framework for defining dataflows.
Repository Link (if existing): https://github.com/stitchfix/hamilton
Description
Hamilton is a general purpose micro-framework for creating dataflows from python functions! Specifically, Hamilton defines a novel paradigm, that allows you to specify a flow of (delayed) execution, that forms a Directed Acyclic Graph (DAG). It was originally built to solve the challenges in wrangling and maintaining production code to create wide (1000+) column dataframes, but has been extended to enable modeling any python object generation. Core to the design of Hamilton is a clear mapping of function name to dataflow output. That is, Hamilton forces a declarative paradigm expressed through writing python functions, and aims for DAG clarity, low code upkeep costs, ease of modification, with always unit testable and naturally documentable code.
Scope
Please indicate which category or categories this package falls under:
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
data munging
Hamilton was built for a team to manage their time-series forecasting feature engineering. So it's design goal was to help data science teams maintain data munging code well.
reproducibility
Core to reproducibility is sharing code. Most researchers only share data, not their code. We believe that with Hamilton, one could more easily share their implementation and in a standardized way that is approachable to a broad audience.
data extraction
Kind of unsure here. But Hamilton helps you structure and "orchestrate" the code that does extraction.
data retrieval
Kind of unsure here. But Hamilton helps you structure and "orchestrate" the code that does retrieval.
Anyone doing any data transformations in python.
Scientific applications: time-series forecasting, any machine learning, any work that involves executing a dataflow.
None that the author is aware of.
N/A
P.S. *Have feedback/comments about our review process? Leave a comment here
The text was updated successfully, but these errors were encountered: