Skip to content
Oleg Zhurakousky edited this page Apr 23, 2014 · 15 revisions

You can check Issues for the most up to date road map, but here are some of the core items:

==

YARN Assembly Scala DSL

While YAYA already provides a powerful Java DSL, Scala as a language is far more powerful and feature-full when building DSLs and given the fact that Scala is JVM-based language and nicely integrates with Java, having Scala-based DSL should be attractive not only to Scala developers but Java developers as well.

==

Spring Bindings

Spring is a popular general purpose Application Framework with one of the largest communities out there. It would only be natural to attract such community by providing variety of Spring Bindings in a form of namespace support as well as annotations.

==

Configurable Load Balancers

For long-lived Application Containers YarnApplication.launch() call returns an array of addressable ContainerDelegates. And while developers can already choose (based on the address or other logic) which container they want to interact with, the task would be greatly simplified by providing a Load Balancing strategy of some type which can internally maintain the knowledge of which actual Application Containers to interact with. For example; You may have HostBasedFirstAvailableLoadBalancingStartegy which as the name suggest will have some internally defined host filter (only use host 192.168.2.3) and you may have 5 out of 10 Application Containers running on this host. Some of them may be busy while others available. Such strategy will maintain filtering and distribution of messages between Application Containers.

==

Cross-cluster YARN Applications

One of the core requirement for any distributed system is its ability to share the load. This essentially implies delegation of work between available workers. In YARN such workers are represented through Application Containers. However, while performing work an Application Container may decide (based on variety of things) that the load is too high for it to handle on its own and it may choose to delegate part of its load to another YARN Application. Such application may or may not be running in the same YARN Cluster. What further complicates things is that in certain cases its hard to predetermine in advance how many Application Containers one would need to adequately process the load. Take a reverse Map/Reduce paradigm (e.g., Monte Carlo Simulation) where the input data is rather small but the computation performed on such data produces larger amounts of data which may need to be analyzed in real time and if so may result in production of more data to be analyzed essentially creating a non-deterministic work tree where the size of this tree and its growth is controlled by each branch spinning off (or not) another branch based on some condition.

Given such uncertainty it would be very difficult to impossible to maintain consistent system load within a single cluster while expecting timely responses to such computations. In other words we may need to start on-demand borrowing additional computation and IO resources from another cluster(s) (stand-by cluster).

What would make it even more powerful if it would coincide with an element of machine-learning where an Application Container may choose on its own when to delegate the work and how to split it by persisting its previous experiences (e.g., how long does it take to perform the type of task on the predetermined compute unit).

While its already possible to accomplish by simply creating and launching a new YARN Application from an already running Application Container, the goal of this road-map item is to simplify such distribution requirement through a higher level strategy so it could be exposed (preferably) as a simple method call.

==

This road-map is a living document and will be updated as needed. New items will be added and implemented items will be removed, so keep watching.