Add support in AWS Batch Operator for multinode jobs #29522

vandonr-amz · 2023-02-14T01:01:31Z

picking up #28321 after it's been somewhat abandoned by the original author.
Addressed my own comment about empty array, and it should be good to go I think.

Initial description from @camilleanne:

Adds support for AWS Batch multinode jobs by allowing a node_overrides json object to be passed through to the boto3 submit_job method.

Adds support for multinode jobs by properly parsing the output of describe_jobs (which is different for container vs multinode) to extract the log stream name.

closes: #25522

…obs; update client to collect log info from multinode job descriptions

…batch-operator-for-multinode-jobs

rename overrides param

…batch-operator-for-multinode-jobs

Taragolis · 2023-02-14T19:55:31Z

@vandonr-amz just and idea not a strong opinion. What if we create separate Operator for multinode Job?
node_overrides should conflicts with container_overrides (current overrides) and operator have a different logic for obtain logs, in additional BatchOperator use their own implementation of waiter (separate additional hook 🙄 ), I'm not sure is it would work with this changes or not.

airflow/providers/amazon/aws/hooks/batch_client.py

Co-authored-by: Andrey Anshin <[email protected]>

vandonr-amz · 2023-02-14T20:38:13Z

@vandonr-amz just and idea not a strong opinion. What if we create separate Operator for multinode Job? node_overrides should conflicts with container_overrides (current overrides) and operator have a different logic for obtain logs, in additional BatchOperator use their own implementation of waiter (separate additional hook 🙄 ), I'm not sure is it would work with this changes or not.

I'm not super fan of it, it's the same boto operation behind it, just with a different behavior... And as a user I think I would be surprised if I had to use a different operator for this.
I think an operator should be "do a thing" and then the parameters should be about the specifics of how the thing should be done. I don't think lanching a batch job or a multinode batch job are really different "things", but it's up for debate I guess.

Taragolis · 2023-02-14T21:27:26Z

There is 3 different sets of parameters for SubmitJob

containerOverrides which run batch job on either EC2 or Fargate
nodeOverrides which run batch job in EC2. This property include own containerOverrides
eksPropertiesOverride run batch job on EKS cluster

I also guess that arrayProperties only applicable for containerOverrides

IMHO SubmitJob it is pretty complicated. One potential benefit for keep all in one operator it is ability to set upstream task create kwargs for BatchOperator.partial(...).expand_kwargs(...). But right now BatchOperator can't work with Dynamic Tasks

vandonr-amz · 2023-02-14T22:09:59Z

Ok, maybe you're right after all. I'll give it a better look. I'm actually not that familiar with batch jobs 😅

vandonr-amz · 2023-02-17T22:47:49Z

After taking a closer look at it, I think having 2 (or more) operators would duplicate a lot of code, without removing much complexity.
I see your arguments about how submitting a batch job can mean very different things, but it's also an operation that takes very similar parameters, and for which the actions to take on our side of the API are super similar.

Also, maybe the user isn't always right, but the initial for of this PR comes from an actual user of the operator, so I'd tend to follow their way of thinking (not being a user myself).

Taragolis · 2023-02-18T04:50:18Z

As I mention before I do not not have strict concern about is it should be single operator or 3 operators (Regular, Node, EKS). I use combination Airflow + Batch since Sept 2019, and this combination cover a lot of limitations of each other. Like same implementation as Dynamic Task Mapping available in Batch years ago and work thought arrayProperties, in the other side dependency between Batch Jobs not such good rather Airflow

And there is no many changes happen in Batch operator since this time however the design of Hooks and BatchOperator still from pre-provider era and now it looks ugly even if it have exclusive backoff API caller.

Moth concern that potentially most of the parameters exclusively for containerOverrides options and we do not check it right now. I'm not a user of nodeOverrides because usually such architecture more suits for Hadoop Cluster, so for that purpose better to use EMR. Different users different point.

I would try to check that options and return back after weekend.

vandonr-amz · 2023-02-22T22:10:42Z

did you have time to check the options ?
If we want to do it, I think that rewriting the whole batch hook and operator(s) should probably separated from that PR, which is just about resolving a user's issue.

Taragolis · 2023-02-22T23:15:29Z

Sorry, not yet. Hectic days. I will try tomorrow morning.

I we also have a question about log links, but let me check it first.

Taragolis · 2023-02-23T15:24:51Z

airflow/providers/amazon/aws/hooks/batch_client.py

        """
-        job_container_desc = self.get_job_description(job_id=job_id).get("container", {})
-        log_configuration = job_container_desc.get("logConfiguration", {})
+        job_desc = self.get_job_description(job_id=job_id)


Let me add a bit more context what going on here initially.
Because everything executed outside of the Airflow users do not have any information about logs in AWS Batch.

For regular batch job we have 0 or 1 dict information about Cloudwatch: log group, region name, log stream.
This information mainly for generate operator extra link which is visible int the UI

Grid View

Graph View

Right now 0 could be in different situations:

User doesn't use Cloudwatch

This is Array Job

For some reason AWS API do not return Cloudwatch link, I personally do not have this situation, but potentially this could happen if JOB finished very quick. That also the reason why we check this in the end of operator execution.

If user use nodeProperties, than jobs would run in multiple places and there is 0..many, in this case we cant utilise Operator Extra Link, so better we could do here is print all links to cloudwatch in Airflow log, but with current implementation the only one would be returned.

@vandonr-amz can you please add this context to the PR description ^^^ It would be great to have future users able to immediately understand what's going on here.

ok I see your point, but should there really be more than one log link ?
I'm looking at it, and it seems that in the case of a multinode job, there is multiple log_configuration (one per node), but from that log config we get

the log group

the region

I'd imagine that multinode batch jobs would not be multi-region ? So that'd would be a constant across all nodes.
And also, I suppose in an overwhelming majority of the cases, the log group would be the same for all nodes (it would be very weird if it wasn't).

Then we get the stream name from the attempts, but this does not depend on the number of nodes. I imagine in most cases there would be one attempt. If there are more, we make the choice of returning the stream name for the last attempt, which makes sense.

The job runs on many nodes, but the logs all end up in the same log stream.

What we can do is iterate on the log configs to make sure they are all sending logs

to aws

in the same region

in the same group

and log a warning if it's not the case.

I'd imagine that multinode batch jobs would not be multi-region ?

AFAIK, Batch resources are resource specific for any type of jobs

Compute Environment (ECS or EKS clusters)

Job Definition

Job Queues

You could configure logging to another region (Cloudwatch) or supported logger drivers. But it configure during creation (register) Batch Job Definition and it couldn't change by submit job. So it should all store in one destination

The job runs on many nodes, but the logs all end up in the same log stream.

Nope, each node has own logs within unique log stream

batch-multinode-jobs.mp4

wow ok that's hmm... surprising.

added a commit to log links to all logs

dimberman · 2023-02-23T20:01:50Z

airflow/providers/amazon/aws/hooks/batch_client.py

        """
-        job_container_desc = self.get_job_description(job_id=job_id).get("container", {})
-        log_configuration = job_container_desc.get("logConfiguration", {})
+        job_desc = self.get_job_description(job_id=job_id)


@vandonr-amz can you please add this context to the PR description ^^^ It would be great to have future users able to immediately understand what's going on here.

airflow/providers/amazon/aws/hooks/batch_client.py

airflow/providers/amazon/aws/operators/batch.py

Taragolis · 2023-03-18T14:29:15Z

I've tested on own AWS Account this simple DAG

import pendulum
from airflow import DAG
from airflow.providers.amazon.aws.operators.batch import BatchOperator

JOB_NAME = "multi-node-sample"
JOB_DEFINITION = "batch-nodes"
JOB_QUEUE = "multinode-job-queue"
CONTAINER_OVERRIDES = None
ARRAY_PROPERTIES = None
NODE_OVERRIDES = {
    "numNodes": 5
}

with DAG(
    dag_id="example_batch_submit_job_multi_node",
    schedule_interval=None,
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    tags=["example", "amazon-provider", "batch", "multi-node"],
    catchup=False,
):
    submit_batch_job = BatchOperator(
        task_id="submit_batch_job",
        job_name=JOB_NAME,
        job_queue=JOB_QUEUE,
        job_definition=JOB_DEFINITION,
        container_overrides=CONTAINER_OVERRIDES,
        array_properties=ARRAY_PROPERTIES,
        node_overrides=NODE_OVERRIDES,
        aws_conn_id=None,
    )

If only set NODE_OVERRIDES then it run mostly successfully, time to time one or many nodes fail during run without no reason. But it refers ether miss configuration of multinode environment or some internals of AWS.

If I set CONTAINER_OVERRIDES to any value rather than None (even {}), I've got:

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Container overrides and node overrides are mutually exclusive, only one can be set.

If I set ARRAY_PROPERTIES than I've got

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Multinode Array Job not supported.

vandonr-amz · 2023-03-20T21:40:42Z

If I set CONTAINER_OVERRIDES to any value rather than None (even {}), I've got:

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Container overrides and node overrides are mutually exclusive, only one can be set.

do you mean while still keeping NODE_OVERRIDES set ? That'd be normal, and the error message explains it. If you want to use containers override, you need to unset the node override.

If I set ARRAY_PROPERTIES than I've got

botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Multinode Array Job not supported.

Same story here, boto is telling you that you cannot set NODE_OVERRIDES (which implies a multi-nodes job) and ARRAY_PROPERTIES at the same time, though in a less clear way.

I'm not super familiar with batch jobs, but I think the valid combinations are:

container_overrides
container_overrides + array_properties
node_overrides

I haven't tested the array properties, but container and node overrides both work well when not mixed.

Taragolis · 2023-03-21T00:12:29Z

do you mean while still keeping NODE_OVERRIDES set ? That'd be normal, and the error message explains it. If you want to use containers override, you need to unset the node override.

Yep array_properties only allowed for container_overrides, not for node overrides

I'm not super familiar with batch jobs, but I think the valid combinations are

And also I think all of them would conflicts with Batch jobs on EKS 🤣 Bet lets keep it as surprise of future implementations.
Personally I'm not use Multi-node batch jobs (due to all limitations ) but I guess some users might found it useful in some specific cases

vandonr-amz · 2023-03-21T01:32:25Z

Personally I'm not use Multi-node batch jobs (due to all limitations ) but I guess some users might found it useful in some specific cases

well yes, this comes from #25522 which was opened by a user !

vandonr-amz · 2023-04-04T17:53:31Z

@dimberman @Taragolis do you think you can take a look at the latest changes and see if it looks OK to you ?

dimberman

Thank you for addressing the changes @vandonr-amz ! LGTM 👍

camilleanne and others added 18 commits December 2, 2022 15:29

add node_overrides parameter to batch operator to support multinode j…

917c415

…obs; update client to collect log info from multinode job descriptions

use trim_none_values to pass only truthy parameters to boto

57d8675

add test

e9cf26e

access logstreamname for multinode jobs; add batch_client test

5713c48

better conditionals on attempts array length

4c226fa

lint

d59729f

fix line length; extend test for multiple attempts

59e9a87

Merge branch 'main' of https://github.com/apache/airflow into ct/aws-…

5d16298

…batch-operator-for-multinode-jobs

fix bad tab

c2182e3

update logstream tests

830bbe9

update tests for new expectations around arrayProperties

b6750cf

Merge branch 'main' of https://github.com/apache/airflow into ct/aws-…

6169610

…batch-operator-for-multinode-jobs

rename overrides param

79e488c

Merge pull request #2 from aws-mwaa/vandonr/multinode

40453ca

rename overrides param

raise exception instead of a warning on unrecognized job type

179fc21

Merge branch 'main' of https://github.com/apache/airflow into ct/aws-…

528bb99

…batch-operator-for-multinode-jobs

Merge remote-tracking branch 'origin/main' into vandonr/batch

8fada5c

add check against job_node_range_properties being empty

8cb9529

vandonr-amz requested review from eladkal and o-nikolas as code owners February 14, 2023 01:01

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 14, 2023

Taragolis reviewed Feb 14, 2023

View reviewed changes

airflow/providers/amazon/aws/hooks/batch_client.py Outdated Show resolved Hide resolved

string formating suggestion

318e91a

Co-authored-by: Andrey Anshin <[email protected]>

vandonr-amz added 2 commits February 16, 2023 11:07

static check fixes

4744db2

Merge remote-tracking branch 'origin/main' into vandonr/batch

87bc61a

vandonr-amz force-pushed the vandonr/batch branch from 90a4681 to 87bc61a Compare February 16, 2023 19:08

vandonr-amz added 2 commits February 17, 2023 14:55

take the opportunity to remove hook creation from operator ctor

e4444bb

Merge remote-tracking branch 'origin/main' into vandonr/batch

95f3c95

Taragolis reviewed Feb 23, 2023

View reviewed changes

dimberman requested changes Feb 23, 2023

View reviewed changes

o-nikolas mentioned this pull request Mar 13, 2023

Add support in AWS Batch Operator for multinode jobs #28321

Closed

vandonr-amz added 2 commits March 13, 2023 16:14

Merge remote-tracking branch 'origin/main' into vandonr/batch

86cd265

rework a bit deprecation warning

871cd96

potiuk requested a review from dimberman March 20, 2023 14:58

vandonr-amz added 2 commits March 27, 2023 14:59

Merge remote-tracking branch 'origin/main' into vandonr/batch

bc458ce

log links to all log streams

8bc2f95

dimberman approved these changes Apr 12, 2023

View reviewed changes

dimberman merged commit 2ce1130 into apache:main Apr 12, 2023

vandonr-amz deleted the vandonr/batch branch April 12, 2023 05:49

This was referenced Apr 21, 2023

Status of testing Providers that were prepared on April 21, 2023 #30803

Closed

Status of testing Providers that were prepared on April 25, 2023 #30849

Closed

Status of testing Providers that were prepared on April 26, 2023 #30880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support in AWS Batch Operator for multinode jobs #29522

Add support in AWS Batch Operator for multinode jobs #29522

vandonr-amz commented Feb 14, 2023

Taragolis commented Feb 14, 2023

vandonr-amz commented Feb 14, 2023

Taragolis commented Feb 14, 2023 •

edited

Loading

vandonr-amz commented Feb 14, 2023

vandonr-amz commented Feb 17, 2023

Taragolis commented Feb 18, 2023

vandonr-amz commented Feb 22, 2023

Taragolis commented Feb 22, 2023

Taragolis Feb 23, 2023

dimberman Feb 23, 2023

vandonr-amz Mar 14, 2023 •

edited

Loading

Taragolis Mar 18, 2023

vandonr-amz Mar 20, 2023

vandonr-amz Mar 28, 2023

dimberman Feb 23, 2023

Taragolis commented Mar 18, 2023

vandonr-amz commented Mar 20, 2023

Taragolis commented Mar 21, 2023

vandonr-amz commented Mar 21, 2023

vandonr-amz commented Apr 4, 2023

dimberman left a comment

Add support in AWS Batch Operator for multinode jobs #29522

Add support in AWS Batch Operator for multinode jobs #29522

Conversation

vandonr-amz commented Feb 14, 2023

Taragolis commented Feb 14, 2023

vandonr-amz commented Feb 14, 2023

Taragolis commented Feb 14, 2023 • edited Loading

vandonr-amz commented Feb 14, 2023

vandonr-amz commented Feb 17, 2023

Taragolis commented Feb 18, 2023

vandonr-amz commented Feb 22, 2023

Taragolis commented Feb 22, 2023

Taragolis Feb 23, 2023

Choose a reason for hiding this comment

dimberman Feb 23, 2023

Choose a reason for hiding this comment

vandonr-amz Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Taragolis Mar 18, 2023

Choose a reason for hiding this comment

vandonr-amz Mar 20, 2023

Choose a reason for hiding this comment

vandonr-amz Mar 28, 2023

Choose a reason for hiding this comment

dimberman Feb 23, 2023

Choose a reason for hiding this comment

Taragolis commented Mar 18, 2023

vandonr-amz commented Mar 20, 2023

Taragolis commented Mar 21, 2023

vandonr-amz commented Mar 21, 2023

vandonr-amz commented Apr 4, 2023

dimberman left a comment

Choose a reason for hiding this comment

Taragolis commented Feb 14, 2023 •

edited

Loading

vandonr-amz Mar 14, 2023 •

edited

Loading