-
Notifications
You must be signed in to change notification settings - Fork 145
Notes And Ideas
We need a complete "stdlib".
- Joins
- Splits
- Extract/Load from usual services
- Dictionary maps
- Outputs (console ...)
Graph elements (or data transformations) should be able to define what kind of dependencies they have, and let the execution context inject the real implementation.
- Databases
- Filesystems
- API-based services (and especially OAuth-orized services)
- HTTP client (with or without caching, and/or proxies)
Printing to stdout/stderr have a few problems right now, especially if we're using a threaded strategy. Can we replace stdout/stderr by a buffer collection that outputs everything using a plugin?
We need a way to create a standard ETL project, update it, and directives to structure projects in a coherent manner. There is already partial work using the optional edgy.project dependency today, that enables "bonobo init" command line task, but it's lacking explicit directives about how to structure things once you're there.
A lot of things are "about correct" for 1.0, but class tree is not complete. We should define a 1.0 (as in semver major version 1) class tree and API.
- Graph
- Node
- Execution contexts (GraphExecutionContext, NodeExecutionContext, PluginExecutionContext, ...)
- Execution strategies (Naive, Pool, ThreadPool, ProcessPool, ...)
- Bags (Bag, ErrorBag, InheritingBag, ...)
Sometimes, input to output order does not really matter. Using an event loop and asynchronous functions would be a nice way to define that, as it allows a row to be slower to transform than the next, still not block the next one. Great use cases with http (or network) related transformations.
Context Processors are currently a way to inject variables to a transformation, tied to an execution context. It keeps the transformations state-less, which is great, but maybe it's not easy enough to understand and use ? Let's think more about that.
Can we better type hint things for editors like PyCharm ?
Still a lot of things to do here for a 1.0 candidate.
As we're requiring python 3.5+, we can consider annotations are available whatever happens. It would be a great usage to use them to define input data validation (and output), while having better readability for transformations.
For now, any exception happening stops the flow for the current row but let the transformation graph continue its execution with next row.
Sometimes, the error happening just means we won't be able to finish the transformation, and there should be a way to pass this message to the executor.
Can we return a token to say "I will ignore whatever you send me next, no need to compute more"? What side effect does that have (for example how to tell read and write apart, or check that nobody else use the read data before stopping the reader?)
Can we interrupt, serialize, resume, step-debug a transformation graph?
Configurables should support positional arguments, and stdlib classes that have required, obvious kwargs should use it for those.
Cases that CW Andrews pointed
How to "run(checkifnew,shouldexit,...actual processing...)"
For example: compare datetimes of two files and return a Boolean for whether or not the second is newer than the first. Then you could have a should exit block which would gracefully exit if it receives a false, then one which takes a str and returns a str with certain characters replaced or removed.
The behaviour is “bonobo run” look for a graph instance in your file, and sets __name__ to something which is not __main__. So it takes this graph and run it. If you use inspect, same thing. Looks for the graph and generates .dot. The __main__ section is not required, but useful if you want to be able to run it using “raw” python interpreter.
Think of it like this : bonobo run in a shell does the same thing as bonobo.run() in a .py ; I try to avoid magic as much as possible but that sounded coherent to me. Probably I should look for some more explicit way to do it.
I’d say “drop the __main__ block and use bonobo cli” as the guideline, unless you want to integrate your etl jobs in another system
I note that it’s confusing, so I’ll probably drop all the __main__ blocks from the documentation and add a page about integration in other tools that will be the only reference to bonobo.run(). Thanks for that, never noticed it was confusing.
- ContextCurrifier -> ContextStack? Is there any way to use something like contextlib.ExitStack?
- Default HTTP service using requests? Ability to override, to use custom session, throttler, etc...
-
Update contribution guide.
First entry point should be "tools overview", so that everyone knows what communication channels exist, and why each channel/tool is there.
- Github repository: the codebase (master branch is for current stable and maintenance (bugfix only), while develop can bring new features while trying as much as humanly possible to keep backward compatibility.)
- Github wiki: all development related discussions, including RFCs for structured discussions about a topic. Everybody is welcome to modify a RFC, as long as it is respectful of others opinions and not to impose one's view. In case of disputes over a decision, project lead will arbitrate.
- Github issues (may be replaced in the future) contains all bugs and feature requests.
- Documentation lives both in source code and on readthedocs. One should update the documentation in the branch that matches his change.
- Slack is used for real time communication. Help requests, discussions, etc. should go in the #general channel. If a discussion is too big and generates too much output about a given topic, it's ok to create a specific channel for the topic.
A list of "starting points" should be given here (for users, for first time contributors, etc).
-
Re-create a clean roadmap in this wiki.
-
Update the sprints pages in the wiki.
-
Update the wiki index.
-
Move execution default and recommendation to python __main__, which is the standard for python users.
-
Automate the release process and add -dev version in the process (post release, probably).
TODO this is copied from the repo and needs cleanup.
Things that should be thought about and/or implemented, but that I don't know where to store.
- Enhancers or node-level plugins
- Graph level plugins
- Documentation
- How do we manage environment ? .env ?
- How do we configure plugins ?
- ContextProcessors not clean (a bit better, but still not in love with the api)
- Release process specialised for bonobo. With changelog production, etc.
- Document how to upgrade version, like, minor need change badges, etc.
- Windows console looks crappy.
- logger, vebosity level
- bonobo create xxx.py
- bonobo create package foo
- deprecate bonobo init
- dask
- initialize / finalize better than start / stop ?
- Should we include datasets in the repo or not? As they may change, grow, and even eventually have licenses we can't use, it's probably best if we don't.