SageMaker / Client / batch_add_cluster_nodes

batch_add_cluster_nodes

SageMaker.Client.batch_add_cluster_nodes(**kwargs)

Adds nodes to a HyperPod cluster by incrementing the target count for one or more instance groups. This operation returns a unique NodeLogicalId for each node being added, which can be used to track the provisioning status of the node. This API provides a safer alternative to UpdateCluster for scaling operations by avoiding unintended configuration changes.

Note

This API is only supported for clusters using Continuous as the NodeProvisioningMode.

See also: AWS API Documentation

Request Syntax

response = client.batch_add_cluster_nodes(
    ClusterName='string',
    ClientToken='string',
    NodesToAdd=[
        {
            'InstanceGroupName': 'string',
            'IncrementTargetCountBy': 123
        },
    ]
)
Parameters:
  • ClusterName (string) –

    [REQUIRED]

    The name of the HyperPod cluster to which you want to add nodes.

  • ClientToken (string) –

    A unique, case-sensitive identifier that you provide to ensure the idempotency of the request. This token is valid for 8 hours. If you retry the request with the same client token within this timeframe and the same parameters, the API returns the same set of NodeLogicalIds with their latest status.

    This field is autopopulated if not provided.

  • NodesToAdd (list) –

    [REQUIRED]

    A list of instance groups and the number of nodes to add to each. You can specify up to 5 instance groups in a single request, with a maximum of 50 nodes total across all instance groups.

    • (dict) –

      Specifies an instance group and the number of nodes to add to it.

      • InstanceGroupName (string) – [REQUIRED]

        The name of the instance group to which you want to add nodes.

      • IncrementTargetCountBy (integer) – [REQUIRED]

        The number of nodes to add to the specified instance group. The total number of nodes across all instance groups in a single request cannot exceed 50.

Return type:

dict

Returns:

Response Syntax

{
    'Successful': [
        {
            'NodeLogicalId': 'string',
            'InstanceGroupName': 'string',
            'Status': 'Running'|'Failure'|'Pending'|'ShuttingDown'|'SystemUpdating'|'DeepHealthCheckInProgress'|'NotFound'
        },
    ],
    'Failed': [
        {
            'InstanceGroupName': 'string',
            'ErrorCode': 'InstanceGroupNotFound'|'InvalidInstanceGroupStatus',
            'FailedCount': 123,
            'Message': 'string'
        },
    ]
}

Response Structure

  • (dict) –

    • Successful (list) –

      A list of NodeLogicalIDs that were successfully added to the cluster. The NodeLogicalID is unique per cluster and does not change between instance replacements. Each entry includes a NodeLogicalId that can be used to track the node’s provisioning status (with DescribeClusterNode), the instance group name, and the current status of the node.

      • (dict) –

        Information about a node that was successfully added to the cluster.

        • NodeLogicalId (string) –

          A unique identifier assigned to the node that can be used to track its provisioning status through the DescribeClusterNode operation.

        • InstanceGroupName (string) –

          The name of the instance group to which the node was added.

        • Status (string) –

          The current status of the node. Possible values include Pending, Running, Failed, ShuttingDown, SystemUpdating, DeepHealthCheckInProgress, and NotFound.

    • Failed (list) –

      A list of errors that occurred during the node addition operation. Each entry includes the instance group name, error code, number of failed additions, and an error message.

      • (dict) –

        Information about an error that occurred during the node addition operation.

        • InstanceGroupName (string) –

          The name of the instance group for which the error occurred.

        • ErrorCode (string) –

          The error code associated with the failure. Possible values include InstanceGroupNotFound and InvalidInstanceGroupState.

        • FailedCount (integer) –

          The number of nodes that failed to be added to the specified instance group.

        • Message (string) –

          A descriptive message providing additional details about the error.

Exceptions

  • SageMaker.Client.exceptions.ResourceNotFound

  • SageMaker.Client.exceptions.ResourceLimitExceeded