Conversation
…tk into feature/spark-diagnostic-tool
aztk/client.py
Outdated
| cluster_nodes = [(node, self.__get_remote_login_settings(pool.id, node.id)) for node in nodes] | ||
| try: | ||
| ssh_key = self.__create_user_on_pool('aztk', pool.id, nodes) | ||
| asyncio.get_event_loop().run_until_complete(ssh_lib.clus_copy(container_name=container_name, |
There was a problem hiding this comment.
add
output = asyncio ...
return output?
| remote_path = "/tmp/debug.zip" | ||
| output = spark_client.cluster_copy(cluster_id, remote_path, local_path, host=True, get=True) | ||
| # write run output to debug/ directory | ||
| with open(os.path.join(os.path.dirname(local_path), "debug-output.txt"), 'w', encoding="UTF-8") as f: |
There was a problem hiding this comment.
should this be 'w' or 'w+' (for overwrite)? Not sure what the right thing here would be unless the logs have timestamps on them.
There was a problem hiding this comment.
Not sure what you mean by this. debug-output.txt is the output from the cluster_run command. It is mainly just there to see if the tool crashed or not.
The file is also only written once so I'm not sure why I would overwrite.
There was a problem hiding this comment.
I was under the impression you could run the tool multiple times. If that happens, do we want to append or overwrite?
| Diagnostic program that runs on each node in the cluster | ||
| This program must be run with sudo | ||
| """ | ||
| import io |
There was a problem hiding this comment.
These are PEP8 sorted, and alpha sorted within the groups.
There was a problem hiding this comment.
I have no idea why I felt these were not sorted... You're right.
aztk/utils/ssh.py
Outdated
| else: | ||
| cmd = '/bin/bash 2>&1 -c \'set -e; set -o pipefail; {0}; wait\''.format(command) | ||
| stdin, stdout, stderr = client.exec_command(cmd, get_pty=True) | ||
| # [print(line.decode('utf-8')) for line in stdout.read().splitlines()] |
| result = spark_client.cluster_run(args.cluster_id, args.command) | ||
| results = spark_client.cluster_run(args.cluster_id, args.command) | ||
| for result in results: | ||
| print("---------------------------") #TODO: replace with nodename |
aztk/spark/client.py
Outdated
| raise error.AztkError(helpers.format_batch_exception(e)) | ||
|
|
||
| def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str): | ||
| def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str, host=False, get=False): |
There was a problem hiding this comment.
get means retrieve files from the nodes. previously cluster copy was limited to copying a local file to all nodes on the cluster. Now it can work both ways.
There was a problem hiding this comment.
hhm, I find that a bit weird to have that this way
There was a problem hiding this comment.
what would you suggest? is it just that the name of the parameter confusing?
There was a problem hiding this comment.
Not sure if having two methods would be more clear?
def copy_to_cluster(...):
...
def copy_from_cluster(...):
...Thoguhts?
There was a problem hiding this comment.
A change like that would be breaking for any script using cluster_copy() today, unless we have cluster_copy() and copy_from_cluster(). I feel like that is not the best naming.
In general, I think we should consider an entire SDK rewrite to align cluster and job function names (didn't do a particularly good job naming them). It would also be nice to split the client so it has a cluster and a job submodule. So you would do client.cluster.get_log() or client.job.get_log().
There was a problem hiding this comment.
yeah I'm down to rename stuff. We can either keep the old stuff to call the new function and mark as depractated or just remove them as we are technically only releasing it next version
There was a problem hiding this comment.
i think deprecating is probably better with something like that https://stackoverflow.com/questions/2536307/how-do-i-deprecate-python-functions
There was a problem hiding this comment.
Depreciating is fine, I think so long as we have a set time frame (maybe 1 or 2 versions) where we actually remove the code.
| parser.add_argument('--id', dest='cluster_id', required=True, | ||
| help='The unique id of your spark cluster') | ||
|
|
||
| parser.add_argument('--output', '-o', required=True, |
There was a problem hiding this comment.
couldn't we make that optional to be something like aztk_debug/[cluster_id] by default
There was a problem hiding this comment.
yeah, that's a good idea. by default, it can be debug-{cluster-id}/ in the working directory.
pylintrc
Outdated
| # Add files or directories to the blacklist. They should be base names, not | ||
| # paths. | ||
| ignore=CVS | ||
| ignore=CVS,debug.py |
There was a problem hiding this comment.
why did you remove this one?
There was a problem hiding this comment.
No idea but it does not seem necessary at all.
There was a problem hiding this comment.
maybe we should add this as a dev dependency then or have travis install separately
There was a problem hiding this comment.
switched to single # pylint: disable=import-error on the import line and it seems to work fine.
…tk into feature/spark-diagnostic-tool

Fix #347