Skip to content
This repository has been archived by the owner on Feb 3, 2021. It is now read-only.

Feature: Disable scheduling on group of nodes #540

Merged
merged 25 commits into from
May 30, 2018

Conversation

timotheeguerin
Copy link
Member

@timotheeguerin timotheeguerin commented May 4, 2018

fix #527 (Don't schedule on low pri nodes)
fix #375

Nodes that should not run the driver will have scheduling disabled

Todo:

  • Handle when no nodes can match criteria
  • Docs for the new setting
  • Check job submission works
  • Tests

@@ -6,7 +6,7 @@
from aztk.models.plugins import PluginConfiguration
from aztk.internal import ConfigurationBase
from .toolkit import Toolkit

from .scheduling_target import SchedulingTarget

class FileShare:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert second new line here

log.info("Task scheduling is already enabled for this node")


def setup_node_scheduling(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this backwards compatible? Or is this breaking for existing clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't i guess, the actual scheduling shouldn't change

id,
applications,
vm_size,
id = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason these values had no default is because they are required. Why are we defaulting to None? To facilitate merging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this means we cant specify the default object, though I guess it could set it to none

@timotheeguerin timotheeguerin changed the title Feature: Disable scheduling on nodes not having Feature: Disable scheduling on group of nodes May 30, 2018
@@ -28,7 +28,7 @@
cluster_id = os.environ.get("AZTK_CLUSTER_ID")
pool_id = os.environ["AZ_BATCH_POOL_ID"]
node_id = os.environ["AZ_BATCH_NODE_ID"]
is_dedicated = os.environ["AZ_BATCH_NODE_IS_DEDICATED"]
is_dedicated = os.environ.get("AZ_BATCH_NODE_IS_DEDICATED") == "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this os.environ.get() and the above are not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could switch them all, though those are batch variables so they should always be set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so should we switch them all to .get or leave them with []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like they should all be [] since that will throw an error if the environment variable is not present. .get will not until the variable is used -- and that is more confusing. Setting those environment variables is part of the service contract, so we should be fine failing if the service doesn't set them.

@@ -18,4 +20,5 @@ def run():


if __name__ == "__main__":
logger.setup_logging()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anywhere it seems. Is this just for use later?

Copy link
Member Author

@timotheeguerin timotheeguerin May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some logging in the node script which were basically not being printed to the stdout. This I think should fix it, I'll double check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it works, I just updated the one still using logging which don't work


def create_cluster(self, cluster_conf: models.ClusterConfiguration, wait: bool = False):
def create_cluster(self, configuration: models.ClusterConfiguration, wait: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not do this since this is a breaking change. In the sdk rewrite, we will make sure the parameter names are better. But for now, we should preserve backwards compatibility and then deprecate these methods.

@@ -200,8 +199,9 @@ def cluster_download(self, cluster_id: str, source_path: str, destination_path:
'''
job submission
'''
def submit_job(self, job_configuration):
def submit_job(self, configuration: models.JobConfiguration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@timotheeguerin timotheeguerin merged commit 8fea9ce into master May 30, 2018
@timotheeguerin timotheeguerin deleted the feature/scheduling branch May 30, 2018 20:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't schedule driver to run on low priority instances Make driver always execute on the Spark master
2 participants