Skip to content

DisruptionBlocked : Node isn't initialized #2187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RangaSamudrala opened this issue May 2, 2025 · 3 comments
Open

DisruptionBlocked : Node isn't initialized #2187

RangaSamudrala opened this issue May 2, 2025 · 3 comments
Labels
kind/support Categorizes issue or PR as a support question. needs-priority triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@RangaSamudrala
Copy link

RangaSamudrala commented May 2, 2025

Hello
I am trying to provision GPU nodes in AWS using Karpenter a node pool. We are able to provision non-GPU nodes but GPU based nodes encounter an issue where the node claim says the node isn't initialized. I have tried to not configure any node disruption. I have created a startup toleration hoping PODs do not get scheduled until the node is ready. K8S says node is ready, but the node claim says node isn't initialized.

What can I do to be able to provision GPU based nodes ?

nodes with taints:
ip-192-16-0-1.ec2.internal   [map[effect:NoSchedule key:hub.jupyter.org/node-purpose value:user] map[effect:NoSchedule key:nvidia.com/gpu value:true]]
------------------------------------------
kubectl get node:
ip-192-16-0-1.ec2.internal   Ready    <none>   3m32s   v1.31.5-eks-5d632ec
-------------------------------------------
describe nodeclaim jupyterhub-nodes-qdcbq 
Name:         jupyterhub-nodes-qdcbq
Namespace:    
Labels:       dedicated=gpu
              hub.jupyter.org/node-purpose=user
              karpenter.k8s.aws/ec2nodeclass=jupyterhub-nodeclass
              karpenter.k8s.aws/instance-category=g
              karpenter.k8s.aws/instance-cpu=4
              karpenter.k8s.aws/instance-cpu-manufacturer=amd
              karpenter.k8s.aws/instance-cpu-sustained-clock-speed-mhz=3300
              karpenter.k8s.aws/instance-ebs-bandwidth=3500
              karpenter.k8s.aws/instance-encryption-in-transit-supported=true
              karpenter.k8s.aws/instance-family=g5
              karpenter.k8s.aws/instance-generation=5
              karpenter.k8s.aws/instance-gpu-count=1
              karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
              karpenter.k8s.aws/instance-gpu-memory=22888
              karpenter.k8s.aws/instance-gpu-name=a10g
              karpenter.k8s.aws/instance-hypervisor=nitro
              karpenter.k8s.aws/instance-local-nvme=250
              karpenter.k8s.aws/instance-memory=16384
              karpenter.k8s.aws/instance-network-bandwidth=2500
              karpenter.k8s.aws/instance-size=xlarge
              karpenter.sh/capacity-type=spot
              karpenter.sh/nodepool=jupyterhub-nodes
              kubernetes.io/arch=amd64
              kubernetes.io/os=linux
              node.kubernetes.io/instance-type=g5.xlarge
              topology.k8s.aws/zone-id=use1-az2
              topology.kubernetes.io/region=us-east-1
              topology.kubernetes.io/zone=us-east-1a
Annotations:  compatibility.karpenter.k8s.aws/cluster-name-tagged: true
              karpenter.k8s.aws/ec2nodeclass-hash: 4420794736441035326
              karpenter.k8s.aws/ec2nodeclass-hash-version: v4
              karpenter.k8s.aws/tagged: true
              karpenter.sh/nodepool-hash: 12937045070593867698
              karpenter.sh/nodepool-hash-version: v3
API Version:  karpenter.sh/v1
Kind:         NodeClaim
Metadata:
  Creation Timestamp:  2025-05-02T13:14:25Z
  Finalizers:
    karpenter.sh/termination
  Generate Name:  jupyterhub-nodes-
  Generation:     1
  Owner References:
    API Version:           karpenter.sh/v1
    Block Owner Deletion:  true
    Kind:                  NodePool
    Name:                  jupyterhub-nodes
    UID:                   a03b9af5-5f91-46fe-ba5b-bf727565428e
  Resource Version:        821475755
  UID:                     8542c73b-2fd6-4d53-ab65-bc2bc2d5c95b
Spec:
  Expire After:  720h0m0s
  Node Class Ref:
    Group:  karpenter.k8s.aws
    Kind:   EC2NodeClass
    Name:   jupyterhub-nodeclass
  Requirements:
    Key:       topology.kubernetes.io/zone
    Operator:  In
    Values:
      us-east-1a
    Key:       dedicated
    Operator:  In
    Values:
      gpu
    Key:       karpenter.sh/capacity-type
    Operator:  In
    Values:
      on-demand
      spot
    Key:       kubernetes.io/os
    Operator:  In
    Values:
      linux
    Key:       node.kubernetes.io/instance-type
    Operator:  In
    Values:
      g5.12xlarge
      g5.16xlarge
      g5.24xlarge
      g5.2xlarge
      g5.4xlarge
      g5.8xlarge
      g5.xlarge
      g5g.16xlarge
      g5g.2xlarge
      g5g.4xlarge
      g5g.8xlarge
      g5g.metal
      g5g.xlarge
      g6.12xlarge
      g6.16xlarge
      g6.24xlarge
      g6.2xlarge
      g6.4xlarge
      g6.8xlarge
      g6.xlarge
      g6e.12xlarge
      g6e.16xlarge
      g6e.2xlarge
      g6e.4xlarge
      g6e.8xlarge
      g6e.xlarge
    Key:       karpenter.k8s.aws/instance-category
    Operator:  In
    Values:
      g
      p
      t
    Key:       karpenter.sh/nodepool
    Operator:  In
    Values:
      jupyterhub-nodes
    Key:       karpenter.k8s.aws/instance-generation
    Operator:  Gt
    Values:
      4
    Key:       hub.jupyter.org/node-purpose
    Operator:  In
    Values:
      user
    Key:       karpenter.k8s.aws/ec2nodeclass
    Operator:  In
    Values:
      jupyterhub-nodeclass
  Resources:
    Requests:
      Cpu:             460m
      Memory:          1204Mi
      nvidia.com/gpu:  1
      Pods:            8
  Taints:
    Effect:                  NoSchedule
    Key:                     hub.jupyter.org/node-purpose
    Value:                   user
    Effect:                  NoSchedule
    Key:                     nvidia.com/gpu
    Value:                   true
  Termination Grace Period:  30m0s
Status:
  Allocatable:
    Cpu:                        3920m
    Ephemeral - Storage:        449Gi
    Memory:                     15140996Ki
    nvidia.com/gpu:             1
    Pods:                       58
    vpc.amazonaws.com/pod-eni:  4
  Capacity:
    Cpu:                        4
    Ephemeral - Storage:        500Gi
    Memory:                     16157828Ki
    nvidia.com/gpu:             1
    Pods:                       58
    vpc.amazonaws.com/pod-eni:  4
  Conditions:
    Last Transition Time:  2025-05-02T13:14:25Z
    Message:               Resource "nvidia.com/gpu" was requested but not registered
    Observed Generation:   1
    Reason:                ResourceNotRegistered
    Status:                Unknown
    Type:                  Initialized
    Last Transition Time:  2025-05-02T13:14:27Z
    Message:               
    Observed Generation:   1
    Reason:                Launched
    Status:                True
    Type:                  Launched
    Last Transition Time:  2025-05-02T13:14:51Z
    Message:               
    Observed Generation:   1
    Reason:                Registered
    Status:                True
    Type:                  Registered
    Last Transition Time:  2025-05-02T13:14:25Z
    Message:               Initialized=Unknown
    Observed Generation:   1
    Reason:                ReconcilingDependents
    Status:                Unknown
    Type:                  Ready
  Image ID:                ami-056952fbcf6dfacab
  Node Name:               ip-192-168-0-1.ec2.internal
  Provider ID:             aws:///us-east-1a/i-09def779419432d37
Events:
  Type    Reason             Age    From       Message
  ----    ------             ----   ----       -------
  Normal  Launched           2m42s  karpenter  Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
  Normal  DisruptionBlocked  2m33s  karpenter  Nodeclaim does not have an associated node
  Normal  Registered         2m18s  karpenter  Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
  Normal  DisruptionBlocked  30s    karpenter  Node isn't initialized


-------------------------------------------

kubectl get pod -w
jupyter-x-user1---51c059ee   0/1     Pending   0          0s
jupyter-x-user1---51c059ee   0/1     Pending   0          0s
jupyter-x-user1---51c059ee   0/1     Pending   0          29s
jupyter-x-user1---51c059ee   0/1     Pending   0          31s
jupyter-x-user1---51c059ee   0/1     Pending   0          45s
jupyter-x-user1---51c059ee   0/1     Terminating   0          5m
jupyter-x-user1---51c059ee   0/1     Terminating   0          5m
-------------------------------------------

Below is the values file:

ec2NodeClassesAndPools:
  # jupyterhub Nodepool Configurations
  - name: jupyterhub-nodeclass
    ec2NodeName: "jupyterhub-nodes"
    belongingCluster: "apps"
    amiFamily: "AL2023"
    customAmiSpecification:
      - id: ami-056952fbcf6dfacab
        name: amazon-eks-node-al2023-x86_64-nvidia-1.31-v20250403
    blockDeviceMappings:
      - deviceName: /dev/xvda
        ebs:
          volumeSize: 500Gi
          volumeType: gp3
    nodepool:
      - name: jupyterhub-nodes
        metadataLabels:
          - key: hub.jupyter.org/node-purpose
            value: user
          - key: dedicated
            value: gpu
        startupTaints:
          - key: wait-for-node-ready
            effect: NoSchedule
        taints:
          - key: "hub.jupyter.org/node-purpose"
            value: "user"
            effect: "NoSchedule"
          - key: nvidia.com/gpu
            value: "true"
            effect: NoSchedule
        requirements:
          zone: ["us-east-1a", "us-east-1b", "us-east-1c"]
          cpu: []
          capacityType: ["spot", "on-demand"]
          minInstanceGpuSize: ""
          osType:
            - linux
          instanceCategory: ["p", "g", "t"]
          instanceGeneration: ["4"]
          excludeInstanceSize: []
        computeLimitation:
          cpu: 1000
          memory: 500Gi
        consolidateAfter: 15m
        nodeDisruptionRules: []
        #   # On Weekdays during business hours, don't do any deprovisioning regarding drift or underutilized nodes.
        #   - nodes: "0"
        #     schedule: "0 9 * * mon-fri"
        #     duration: 10h
        #     reasons:
        #       - Drifted
        #       - Underutilized
          # - nodes: "100%" # Distrupt if node is empty
          #   reasons:
          #     - Empty
          # During non-business hours do drift or underutilized for up to 1 node at a time
          # - nodes: "1"
          #   schedule: "0 21 * * mon-fri"
          #   duration: 9h
          #   reasons:
          #   - Drifted
          #   - Underutilized

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels May 2, 2025
@jmdeal
Copy link
Member

jmdeal commented May 2, 2025

There's a couple of possibilities, the two most common reasons a NodeClaim fails to become initialized:

  • There are startup taints on the node which haven't been removed
  • There are extended resources that haven't registered

I have created a startup toleration hoping PODs do not get scheduled until the node is ready

I assume you mean startup taint and you're referring to the wait-for-no-ready taint? I think, based on what you said above, you may misunderstand how startup taints work: Karpenter doesn't remove a startup taint once the node is ready, external processes remove the taint and that is the signal for Karpenter to mark the NodeClaim as initialized. For example, the EBS CSI driver will remove the ebs.csi.aws.com/agent-not-ready:NoExecute taint once the driver is ready.

If you do have something removing that taint, the next most likely candidate is the GPU extended resources failing to register? Can you verify that the expected GPU resources are present on the Node object?

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2025
@jmdeal
Copy link
Member

jmdeal commented May 2, 2025

/kind support

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label May 2, 2025
@ricardorqr
Copy link

Anyone? Having the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-priority triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants