Skip to content
This repository was archived by the owner on Feb 3, 2021. It is now read-only.
This repository was archived by the owner on Feb 3, 2021. It is now read-only.

Creating a cluster fails when using SchedulingTarget.Master and a cluster ID previously used by a now-deleted, older cluster #689

@jamesclarke

Description

@jamesclarke

Background

In certain cases, such as automated daily-run data processing jobs, we want to:

  • use the AZTK SDK to create a cluster with the SchedulingTarget.Master option so that our driver always runs on the master;
  • use a pre-determined, known cluster ID rather than generating one at runtime so that if the data processing job errors, we know what cluster ID to check for to see if the cluster already exists when we re-run the job.

AZTK version

v0.10.1 release, installed from PyPI using pip

Issue

It seems that:

  • when a cluster is created with the option scheduling_target=SchedulingTarget.Master, a 'task table' is created in the storage account's table service, using a hashed version of the cluster ID as its ID;
  • when the cluster is deleted, the task table is not deleted;
    • I traced the code through to a call to aztk.client.cluster.helpers.delete.delete_pool_and_job_and_table() and for some reasons this seems not to delete the task table but without failing/raising an error.
  • when a later cluster is created with the same cluster ID, the cluster-creation code attempts to create a new task table with the hashed cluster ID but fails when it finds a table with that ID already exists, raising the following error:
    • AztkError: Conflict
      {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US",
      "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-
      0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
      

Steps to reproduce (using the AZTK SDK)

  1. Create a cluster with:
    • a set name (e.g. test-aztk-cluster)
    • the option scheduling_target=SchedulingTarget.Master
  2. <Do whatever>
  3. Delete the cluster.
  4. Wait a while, until the underlying Azure Batch pool and job have been deleted.
  5. Check the Azure Storage table used to track tasks and see that it has not been deleted.
  6. Create another cluster with the same name (test-aztk-cluster) and scheduling_target=SchedulingTarget.Master
  7. See cluster creation fail with an error like:
    • AztkError: Conflict
      {"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US",
      "value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-
      0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}
      
  8. Delete the cluster (i.e. Azure Batch pool and job).
  9. Go to the Azure dashboard and manually delete the Azure Storage task table.
  10. Once everything is finished being deleted, create the same cluster again.
  11. See that this time cluster creation works without error.
  12. Delete the cluster and wait for the pool and job to be deleted but this time do not manually delete the task table.
  13. Now create a cluster with the same name (test-aztk-cluster) but, this time, use the option scheduling_target=SchedulingTarget.Any.
  14. See that this time, although the task table is still there, the cluster is created without error (as a task table is not used with the option SchedulingTarget.Any)

Error logs

---------------------------------------------------------------------------
AzureConflictHttpError                    Traceback (most recent call last)
.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
      7             try:
----> 8                 return function(*args, **kwargs)
      9             except catch_exceptions as e:

.../lib/python3.6/site-packages/aztk/utils/retry.py in wrapper(*args, **kwargs)
     16                 try:
---> 17                     return function(*args, **kwargs)
     18                 except exceptions:

.../lib/python3.6/site-packages/aztk/client/base/helpers/task_table.py in create_task_table(table_service, id)
     66     """
---> 67     return table_service.create_table(helpers.convert_id_to_table_id(id), fail_on_exist=True)
     68 

.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in create_table(self, table_name, fail_on_exist, timeout)
    541         else:
--> 542             self._perform_request(request)
    543             return True

.../lib/python3.6/site-packages/azure/cosmosdb/table/tableservice.py in _perform_request(self, request, parser, parser_args, operation_context)
   1105         _update_storage_table_header(request)
-> 1106         return super(TableService, self)._perform_request(request, parser, parser_args, operation_context)

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    429                                  exception_str_in_one_line)
--> 430                     raise ex
    431             finally:

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    357                     retry_context.exception = ex
--> 358                     raise ex
    359                 except Exception as ex:

.../lib/python3.6/site-packages/azure/storage/common/storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
    343                         _http_error_handler(
--> 344                             HTTPError(response.status, response.message, response.headers, response.body))
    345 

.../lib/python3.6/site-packages/azure/storage/common/_error.py in _http_error_handler(http_error)
    114 
--> 115     raise ex
    116 

AzureConflictHttpError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}

During handling of the above exception, another exception occurred:

AztkError                                 Traceback (most recent call last)
<ipython-input-2-d2c3d2edc779> in <module>

...

.../lib/python3.6/site-packages/aztk/spark/client/cluster/operations.py in create(self, cluster_configuration, wait)
     30             :obj:`aztk.spark.models.Cluster`: An Cluster object representing the state and configuration of the cluster.
     31         """
---> 32         return create.create_cluster(self._core_cluster_operations, self, cluster_configuration, wait)
     33 
     34     def delete(self, id: str, keep_logs: bool = False):

.../lib/python3.6/site-packages/aztk/spark/client/cluster/helpers/create.py in create_cluster(core_cluster_operations, spark_cluster_operations, cluster_conf, wait)
     64 
     65         cluster = core_cluster_operations.create(cluster_conf, software_metadata_key, start_task,
---> 66                                                  constants.SPARK_VM_IMAGE)
     67 
     68         # Wait for the master to be ready

.../lib/python3.6/site-packages/aztk/client/cluster/operations.py in create(self, cluster_configuration, software_metadata_key, start_task, vm_image_model)
     21         """
     22         return create.create_pool_and_job_and_table(self, cluster_configuration, software_metadata_key, start_task,
---> 23                                                     vm_image_model)
     24 
     25     def get(self, id: str):

.../lib/python3.6/site-packages/aztk/client/cluster/helpers/create.py in create_pool_and_job_and_table(core_cluster_operations, cluster_conf, software_metadata_key, start_task, VmImageModel)
     71     # create storage task table
     72     if cluster_conf.scheduling_target != models.SchedulingTarget.Any:
---> 73         core_cluster_operations.create_task_table(cluster_conf.cluster_id)
     74 
     75     return helpers.get_cluster(cluster_conf.cluster_id, core_cluster_operations.batch_client)

.../lib/python3.6/site-packages/aztk/client/base/base_operations.py in create_task_table(self, id)
    233             id (:obj:`str`): the id of the cluster
    234         """
--> 235         return task_table.create_task_table(self.table_service, id)
    236 
    237     def list_task_table_entries(self, id):

.../lib/python3.6/site-packages/aztk/utils/try_func.py in wrapper(*args, **kwargs)
     11                     raise raise_exception(exception_formatter(e))
     12                 else:
---> 13                     raise raise_exception(str(e))
     14 
     15         return wrapper

AztkError: Conflict
{"odata.error":{"code":"TableAlreadyExists","message":{"lang":"en-US","value":"The table specified already exists.\nRequestId:a1b5e2f8-9002-0110-1327-8e86f6000000\nTime:2018-12-07T12:20:47.0946074Z"}}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions