-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Generic support for overrides during an execution #475
Comments
pulling in the example here @task(cpu=1, memory=4GB)
def fn1(f: flyte.File) -> flyte.File:
...
@task(spark=Spark())
def fn2(f: flyte.File) -> flyte.File:
...
# Executing the tasks
fn1.with_override=Override(cpu=2, memory=8GB)).execute(f="test.txt")
fn2.with_override=Override(custom={spark_conf={}})).execute(f="test.txt")
@workflow
def WF1():
fn1(...)
fn2(...)
# Executing the workflow
WF1.with_overrides(overrides={
"fn1": Override(cpu=2, caching=False),
"fn2": Override(spark_conf={}),
caching=False,
}).execute(....) |
@anandswaminathan / @EngHabu / @kanterov / @jeevb, I deally I would like that at execution time, the Launch form has a JSON editor, with all the default config filled in and the person who is launching can modify any of the values |
+1 to this, we'd find it very useful as we have a relatively small number of distinct workflows (for training several different types of models), but need to run them on a wide range of datasets at various scales (so we can't fully configure everything we need with the A similar concept in Kubeflow Pipelines which might be relevant here is the PipelineConf/ import kfp
import kubernetes
def op_transformer(op: kfp.dsl.ContainerOp) -> kfp.dsl.ContainerOp:
# Set resource requets
op.add_resource_request("cpu", "100m")
op.add_resource_request("memory", "500Mi")
# Schedule on a particular node group
op.add_toleration(
kubernetes.client.V1Toleration(
key="capability",
value="model-training",
effect="NoSchedule",
),
)
return op |
Here's another real-world use-case where changing the pod spec during execution would be really useful: Changing the GPU type or the instance type. Right now, if we want to change the number of A100 or switch to a different GPU, we have to register a new task version. def generate_a100_pod_spec(ngpu: int):
cpu = ngpu * 12
mem = f"{ngpu * 85 - 5}G"
shared_volume_mount = V1VolumeMount(
name="dshm",
mount_path="/dev/shm",
)
pod_spec = V1PodSpec(
containers=[
V1Container(
name="primary",
resources=V1ResourceRequirements(
requests={"cpu": str(cpu - 1), "memory": mem, "gpu": str(ngpu)},
limits={"cpu": str(cpu), "memory": mem, "gpu": str(ngpu)}
),
volume_mounts=[shared_volume_mount]
)
],
volumes=[V1Volume(name="dshm", empty_dir=V1EmptyDirVolumeSource(medium="Memory"))],
node_selector={"cloud.google.com/gke-accelerator": "nvidia-tesla-a100",
"node.kubernetes.io/instance-type": f"a2-highgpu-{ngpu}g"},
tolerations=[
V1Toleration(effect="NoSchedule", key="cloud.google.com/gke-accelerator", value="nvidia-tesla-a100",
operator="Equal"),
V1Toleration(effect="NoSchedule", key="beta.kubernetes.io/instance-type", value=f"a2-highgpu-{ngpu}g",
operator="Equal"),
]
)
return pod_spec
@task(
task_config=Pod(pod_spec=generate_a100_pod_spec(4), primary_container_name="primary")
)
def my_task... |
* add examples to decorate tasks and workflows Signed-off-by: Niels Bantilan <[email protected]> * Apply suggestions from code review Co-authored-by: Samhita Alla <[email protected]> * update conditional examples Signed-off-by: Niels Bantilan <[email protected]> Co-authored-by: Samhita Alla <[email protected]>
Folks this is something very useful and we are thinking of starting to first design this and then implement it. |
This would be very useful for us too. Let me know if we can help you here. |
I am down to help the implementation 🙌 |
cc @rahul-theorem too |
I borrowed @rahul-theorem's idea and formulated a RFC (LINK). The UX will look like the following in my case:
Feel free to leave any comments!! Thanks!! |
Had a chat with @kumare3 yesterday about this. Below is a gist that he shared with us. I think it more accurately reflects some of the ideas here. This is the simpler case, and I think we should start here first. @task(task_config=Spark(...))
def t1():
...
@task(resources=Resources(...), cache=True, cache_version=1.0)
def t2():
...
@workflow
def sub(x: int, y: int):
return t1()
@workflow
def wf():
x, y = t1().with_overrides(name="n1")
t2(x, y).with_overrides(...)
m, n = t1().with_overrides(...)
return sub() This is the first step I think. It will allow users to call the same task with different resources, and it'll allow a workflow to call a task with different resources than the task defined for itself. The next step is a bit harder... the execution-time overrides. So the ux might look something like override_lp = Launchplan.create(wf,
overrides=wf.with_override(...))
flyteremote.execute(wf, WFOverrides(nodes=[
"n1": t1.with_overrides(config=Spark(...)),
"n2": t2.with_override
"n3": WFOverrides(nodes=[
t1.with_overrides(config=Spark(...)),
]
),
)
wf = flyteremote.fetch_workflow("wf")
flyteremote.execute(wf.with_overrides( Implementation: Create new messages as follows
Update executionSpec to have
In flytekit
|
I think it's desirable to keep the details of overrides abstracted away from the business logic of the workflow (since often these might be set by different tenants, ie a research scientist writing business logic vs an ops team setting resources to right-size workloads). Applying overrides at the node-level seems like it'd be a bit verbose (& also it's unclear how this would work for workflows w/ fan-out, etc). One option here is supporting applying a |
Thanks @rahul-theorem ! I will continue to polish the design and propose later. |
An idea built on top of the annotation idea @rahul-theorem mentioned is that maybe we can apply overrides to tasks whose execution/run-time signature matches (or partially matched) a specified pattern. For example,
|
I've just read through the thread and it seems like everyone is aligned on how this will be implemented on the backend, but the UX of how an override is assigned to a node is still TDB. The three competing options at the moment are:
The above list are just things that I could think of, feel free to add on to the pros and cons! What if we go with @bnsblue's suggestion (option 3 above), but expand it such that the |
@bstadlbauer Thanks for listing down all the options clearly! Just wondering if we go with option 1 + 3, what will the UI look like? |
Discussion has moved to #3553 |
Is there an update on this? I see that the RFC was merged. |
this feature is not under development rn due to the complexity of the problem and the bandwidth of the contributor. |
Hi all, |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
Motivation: Why do you think this is important?
As a user of Flyte, configuration like "resources (cpu/mem/gpu)", "catalog (enable/disable/version)", "retries",, "spark config", "hive cluster" are created only at registration time. The only way to update is to re-register. This is not desirable, as there are times, when the user may want to update these parameters for specific launchplans, or executions.
Goal: What should the final outcome look like, ideally?
A user should be able to optionally override any configuration information for an execution at the launchtime or launchplan creation time.
Describe alternatives you've considered
Re-register with the updated configuration
Flyte component
High level original proposal
Goal
The purpose of this Allow users to override task metadatas at launch time and decide caching configuration and resource allocation without needing to push a new commit in their PR.
Motivation
Users have requested cache toggling and cache version override at launch time. Other than that, users have also been wanting to be able to change the task resource quickly. So this is generally a highly requested feature as it allows users to change runtime configurations quickly without needing any code change, meaning that they don't need to go through the entire CI pipelines.
Brief description of proposal
We can achieve this by associating each execution with objects
StaticMetadata
,MetadataOverride
, andMetadataFinal
using some simple, serializable format like JSON.UX/UI
At registration time we can extract all the metadata of every task and create an
StaticMetadata
object.At launch time we let users specify
MetadataOverride
through the following ways:If using UI, in the "Launch Workflow" panel, we have an "Input" tab and a "Execution Parameter Override" tab. The latter tab, by default, shows the
StaticMetadata
but the fields are editable, which allows users to specify overrides of the metadataIf using flyte-cli, we add an option, e.g.,
--execution-parameter-override={"task1": {cache=false}}
when launching an executionEvery
MetadataOverride
object is also stored in Flyteadmin, and is versioned with the execution id.The
StaticMetadata
andMetadataOverride
objects were merged by Flyteadmin to produceMetadataFinal
and that is passed to the propellerPropeller or plugin parses that
MetadataFinal
object and creates pod, talks to data catalog, etc., accordinglyIn the console, we add an extra field/link in the header of the execution page, showing the
MetadataOverride
and/orMetadataFinal
Reproducibility
As far as reproducibility is concerned, we note from the above description that both the
StaticMetadata
and theMetadataOverride
are versioned.StaticMetadata
becomes available to Flyteadmin at workflow registration time because it comes with the workflow definition. An extra storage of this object is not required. This object is also inherently versioned and can be extracted from the registered workflow when needed.MetadataOverride
will be versioned differently thanStaticMetadata
. To enable the targeted high flexibility, we want to allow the users to associate each execution with oneMetadataOverride
object. The natural way to versionMetadataOverride
objects, therefore, is to version it with the execution's id. Due to this goal, we store theMetadataOverride
object in Flyteadmin.Flyteadmin APIs
https://gist.github.com/bnsblue/8499584ec7e1aacf4b43e81d9c50e0fb
FlyteIdl Protos
Additional context
This is a highly requested feature from multiple teams with Lyft
Is this a blocker for you to adopt Flyte
NA
The text was updated successfully, but these errors were encountered: