Feature: Disable scheduling on group of nodes#540
Conversation
aztk/models/models.py
Outdated
|
|
||
| from .scheduling_target import SchedulingTarget | ||
|
|
||
| class FileShare: |
| log.info("Task scheduling is already enabled for this node") | ||
|
|
||
|
|
||
| def setup_node_scheduling( |
There was a problem hiding this comment.
Is this backwards compatible? Or is this breaking for existing clusters?
There was a problem hiding this comment.
This shouldn't i guess, the actual scheduling shouldn't change
| id, | ||
| applications, | ||
| vm_size, | ||
| id = None, |
There was a problem hiding this comment.
The reason these values had no default is because they are required. Why are we defaulting to None? To facilitate merging?
There was a problem hiding this comment.
Yeah this means we cant specify the default object, though I guess it could set it to none
aztk/node_scripts/core/config.py
Outdated
| pool_id = os.environ["AZ_BATCH_POOL_ID"] | ||
| node_id = os.environ["AZ_BATCH_NODE_ID"] | ||
| is_dedicated = os.environ["AZ_BATCH_NODE_IS_DEDICATED"] | ||
| is_dedicated = os.environ.get("AZ_BATCH_NODE_IS_DEDICATED") == "true" |
There was a problem hiding this comment.
why is this os.environ.get() and the above are not?
There was a problem hiding this comment.
we could switch them all, though those are batch variables so they should always be set
There was a problem hiding this comment.
There was a problem hiding this comment.
yeah so should we switch them all to .get or leave them with []
There was a problem hiding this comment.
I feel like they should all be [] since that will throw an error if the environment variable is not present. .get will not until the variable is used -- and that is more confusing. Setting those environment variables is part of the service contract, so we should be fine failing if the service doesn't set them.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| logger.setup_logging() |
There was a problem hiding this comment.
This isn't used anywhere it seems. Is this just for use later?
There was a problem hiding this comment.
There was some logging in the node script which were basically not being printed to the stdout. This I think should fix it, I'll double check
There was a problem hiding this comment.
Yep it works, I just updated the one still using logging which don't work
aztk/spark/client.py
Outdated
| super().__init__(secrets_config) | ||
|
|
||
| def create_cluster(self, cluster_conf: models.ClusterConfiguration, wait: bool = False): | ||
| def create_cluster(self, configuration: models.ClusterConfiguration, wait: bool = False): |
There was a problem hiding this comment.
I think we should not do this since this is a breaking change. In the sdk rewrite, we will make sure the parameter names are better. But for now, we should preserve backwards compatibility and then deprecate these methods.
aztk/spark/client.py
Outdated
| job submission | ||
| ''' | ||
| def submit_job(self, job_configuration): | ||
| def submit_job(self, configuration: models.JobConfiguration): |
fix #527 (Don't schedule on low pri nodes)
fix #375
Nodes that should not run the driver will have scheduling disabled
Todo: