-
Notifications
You must be signed in to change notification settings - Fork 66
Feature: Disable scheduling on group of nodes #540
Conversation
aztk/models/models.py
Outdated
@@ -6,7 +6,7 @@ | |||
from aztk.models.plugins import PluginConfiguration | |||
from aztk.internal import ConfigurationBase | |||
from .toolkit import Toolkit | |||
|
|||
from .scheduling_target import SchedulingTarget | |||
|
|||
class FileShare: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert second new line here
log.info("Task scheduling is already enabled for this node") | ||
|
||
|
||
def setup_node_scheduling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this backwards compatible? Or is this breaking for existing clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't i guess, the actual scheduling shouldn't change
id, | ||
applications, | ||
vm_size, | ||
id = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason these values had no default is because they are required. Why are we defaulting to None
? To facilitate merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this means we cant specify the default object, though I guess it could set it to none
aztk/node_scripts/core/config.py
Outdated
@@ -28,7 +28,7 @@ | |||
cluster_id = os.environ.get("AZTK_CLUSTER_ID") | |||
pool_id = os.environ["AZ_BATCH_POOL_ID"] | |||
node_id = os.environ["AZ_BATCH_NODE_ID"] | |||
is_dedicated = os.environ["AZ_BATCH_NODE_IS_DEDICATED"] | |||
is_dedicated = os.environ.get("AZ_BATCH_NODE_IS_DEDICATED") == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this os.environ.get()
and the above are not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could switch them all, though those are batch variables so they should always be set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah so should we switch them all to .get
or leave them with []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like they should all be []
since that will throw an error if the environment variable is not present. .get
will not until the variable is used -- and that is more confusing. Setting those environment variables is part of the service contract, so we should be fine failing if the service doesn't set them.
@@ -18,4 +20,5 @@ def run(): | |||
|
|||
|
|||
if __name__ == "__main__": | |||
logger.setup_logging() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anywhere it seems. Is this just for use later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some logging in the node script which were basically not being printed to the stdout. This I think should fix it, I'll double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep it works, I just updated the one still using logging which don't work
aztk/spark/client.py
Outdated
|
||
def create_cluster(self, cluster_conf: models.ClusterConfiguration, wait: bool = False): | ||
def create_cluster(self, configuration: models.ClusterConfiguration, wait: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not do this since this is a breaking change. In the sdk rewrite, we will make sure the parameter names are better. But for now, we should preserve backwards compatibility and then deprecate these methods.
aztk/spark/client.py
Outdated
@@ -200,8 +199,9 @@ def cluster_download(self, cluster_id: str, source_path: str, destination_path: | |||
''' | |||
job submission | |||
''' | |||
def submit_job(self, job_configuration): | |||
def submit_job(self, configuration: models.JobConfiguration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
fix #527 (Don't schedule on low pri nodes)
fix #375
Nodes that should not run the driver will have scheduling disabled
Todo: