Feature: spark debug tool by jafreck · Pull Request #455 · Azure/aztk

jafreck · 2018-03-22T21:28:59Z

…tk into feature/spark-diagnostic-tool

paselem · 2018-04-04T22:26:39Z

aztk/client.py

+        cluster_nodes = [(node, self.__get_remote_login_settings(pool.id, node.id)) for node in nodes]
        try:
            ssh_key = self.__create_user_on_pool('aztk', pool.id, nodes)
            asyncio.get_event_loop().run_until_complete(ssh_lib.clus_copy(container_name=container_name,


add

output = asyncio ... return output

?

paselem · 2018-04-04T22:32:47Z

aztk/spark/helpers/cluster_diagnostic_helper.py

+    remote_path = "/tmp/debug.zip"
+    output = spark_client.cluster_copy(cluster_id, remote_path, local_path, host=True, get=True)
+    # write run output to debug/ directory
+    with open(os.path.join(os.path.dirname(local_path), "debug-output.txt"), 'w', encoding="UTF-8") as f:


should this be 'w' or 'w+' (for overwrite)? Not sure what the right thing here would be unless the logs have timestamps on them.

Not sure what you mean by this. debug-output.txt is the output from the cluster_run command. It is mainly just there to see if the tool crashed or not.
The file is also only written once so I'm not sure why I would overwrite.

I was under the impression you could run the tool multiple times. If that happens, do we want to append or overwrite?

paselem · 2018-04-04T22:33:42Z

aztk/spark/utils/debug.py

+    Diagnostic program that runs on each node in the cluster
+    This program must be run with sudo
+"""
+import io


nitpick: alpha sort

These are PEP8 sorted, and alpha sorted within the groups.

I have no idea why I felt these were not sorted... You're right.

paselem · 2018-04-05T02:23:19Z

aztk/utils/ssh.py

+    else:
+        cmd = '/bin/bash 2>&1 -c \'set -e; set -o pipefail; {0}; wait\''.format(command)
+    stdin, stdout, stderr = client.exec_command(cmd, get_pty=True)
+    # [print(line.decode('utf-8')) for line in stdout.read().splitlines()]


remove commented log?

paselem · 2018-04-05T02:24:42Z

aztk_cli/spark/endpoints/cluster/cluster_run.py

-    result = spark_client.cluster_run(args.cluster_id, args.command)
+    results = spark_client.cluster_run(args.cluster_id, args.command)
+    for result in results:
+        print("---------------------------") #TODO: replace with nodename


timotheeguerin · 2018-04-05T17:30:39Z

aztk/spark/client.py

            raise error.AztkError(helpers.format_batch_exception(e))

-    def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str):
+    def cluster_copy(self, cluster_id: str, source_path: str, destination_path: str, host=False, get=False):


what does get do?

get means retrieve files from the nodes. previously cluster copy was limited to copying a local file to all nodes on the cluster. Now it can work both ways.

hhm, I find that a bit weird to have that this way

what would you suggest? is it just that the name of the parameter confusing?

Not sure if having two methods would be more clear?

def copy_to_cluster(...): ... def copy_from_cluster(...): ...

Thoguhts?

A change like that would be breaking for any script using cluster_copy() today, unless we have cluster_copy() and copy_from_cluster(). I feel like that is not the best naming.

In general, I think we should consider an entire SDK rewrite to align cluster and job function names (didn't do a particularly good job naming them). It would also be nice to split the client so it has a cluster and a job submodule. So you would do client.cluster.get_log() or client.job.get_log().

yeah I'm down to rename stuff. We can either keep the old stuff to call the new function and mark as depractated or just remove them as we are technically only releasing it next version

i think deprecating is probably better with something like that https://stackoverflow.com/questions/2536307/how-do-i-deprecate-python-functions

Depreciating is fine, I think so long as we have a set time frame (maybe 1 or 2 versions) where we actually remove the code.

timotheeguerin · 2018-04-05T17:32:17Z

aztk_cli/spark/endpoints/cluster/cluster_debug.py

+    parser.add_argument('--id', dest='cluster_id', required=True,
+                        help='The unique id of your spark cluster')
+
+    parser.add_argument('--output', '-o', required=True,


couldn't we make that optional to be something like aztk_debug/[cluster_id] by default

yeah, that's a good idea. by default, it can be debug-{cluster-id}/ in the working directory.

timotheeguerin · 2018-04-05T17:33:00Z

pylintrc

 # Add files or directories to the blacklist. They should be base names, not
 # paths.
-ignore=CVS
+ignore=CVS,debug.py


why did you remove this one?

No idea but it does not seem necessary at all.

Ah, now I remember:

The debug.py file has a dependency that is installed at runtime (docker python module). Travis doesn't have that dependency so the build fails.

maybe we should add this as a dev dependency then or have travis install separately

switched to single # pylint: disable=import-error on the import line and it seems to work fine.

…tk into feature/spark-diagnostic-tool

jafreck added 8 commits March 20, 2018 17:34

start implementation of cluster debug utility

48bfc9c

update debug program

dc55972

update debug

0cb1c2c

fix output directory structure

d67e474

cleanup output, add error checking

878c56d

sort imports

9ae6f31

start untar

ac4b706

extract tar

33dd49f

jafreck added this to the v0.7.0 milestone Mar 22, 2018

jafreck added in progress needs docs labels Mar 22, 2018

jafreck added 8 commits March 22, 2018 14:44

add debug.py to pylintc ignore, line too long

c789cc1

Merge branch 'master' into feature/spark-diagnostic-tool

677326a

crlf->lf

0d35286

Merge branch 'feature/spark-diagnostic-tool' of github.com:jafreck/az…

cc7992a

…tk into feature/spark-diagnostic-tool

add app logs

27fc883

call get_spark_app_logs, typos

c2cbc4c

add docs

bbffe88

Merge branch 'master' into feature/spark-diagnostic-tool

dd7034e

jafreck removed the needs docs label Mar 27, 2018

jafreck added 2 commits March 27, 2018 16:18

Merge branch 'master' into feature/spark-diagnostic-tool

1da088e

Merge branch 'master' into feature/spark-diagnostic-tool

044740d

paselem suggested changes Apr 5, 2018

View reviewed changes

timotheeguerin reviewed Apr 5, 2018

View reviewed changes

jafreck added 4 commits April 5, 2018 11:07

remove debug.py from pylintrc ignore

e0f37c6

added debug.py back to pylint ignore

39b5b3f

Merge branch 'master' into feature/spark-diagnostic-tool

b881c9f

change pylint ignore

f6377e1

jafreck added 7 commits April 5, 2018 12:36

Merge branch 'feature/spark-diagnostic-tool' of github.com:jafreck/az…

0a19b72

…tk into feature/spark-diagnostic-tool

remove commented log

9c722d1

merge

aa1b055

update cluster_run

dbee669

refactor cluster_copy

964b075

update debug, add spinner for run and copy

c15c411

make new sdk cluster_download endpoint

e43aa92

paselem approved these changes Apr 9, 2018

View reviewed changes

jafreck merged commit 44a0765 into Azure:master Apr 9, 2018

jafreck removed the in progress label Apr 9, 2018

Conversation

jafreck commented Mar 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jafreck Apr 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jafreck commented Mar 22, 2018 •

edited

Loading

jafreck Apr 5, 2018 •

edited

Loading