跳到主要内容

Karpenter 弹性伸缩实践

Jacob
虚心学习

在 EKS 上替掉 Cluster Autoscaler,换成 Karpenter 做节点弹性伸缩。这里记录一下最小可用配置和踩过的坑。

为什么换 Karpenter

Cluster Autoscaler 要预先定义 ASG、实例规格一旦改动就得重建,扩容也慢。Karpenter 直接和 EC2 对话,Pending 出现后几十秒内就能把节点拉起来,而且可以按 Pod 的实际需求选实例规格,不用维护一堆 ASG。

总体思路

karpenter

  • IAM:用官方 CloudFormation 模板把 KarpenterNodeRoleKarpenterControllerPolicy 建好
  • 中断事件:配 SQS + EventBridge,接收 Spot 中断、健康事件、实例状态变化
  • NodeClass:统一定义 AMI、EBS、子网、安全组发现方式
  • NodePool:两个池子,主池偏好 AMD(便宜),次池兜底

IAM & 中断队列

直接用官方的 CFN 模板部署,关键资源:

  • KarpenterNodeRole-<cluster>:节点 EC2 role,挂 AmazonEKS_CNI_Policy / AmazonEKSWorkerNodePolicy / AmazonEC2ContainerRegistryPullOnly / AmazonSSMManagedInstanceCore
  • KarpenterControllerPolicy-<cluster>:控制器用的策略,限制 RunInstances / CreateFleet 等动作必须带 kubernetes.io/cluster/<cluster>=owned 标签,避免越权
  • KarpenterInterruptionQueue:SQS 队列,接 4 条 EventBridge 规则
    • aws.health - AWS Health Event
    • aws.ec2 - EC2 Spot Instance Interruption Warning
    • aws.ec2 - EC2 Instance Rebalance Recommendation
    • aws.ec2 - EC2 Instance State-change Notification

子网和安全组别忘了打标签 karpenter.sh/discovery: <cluster-name>,Karpenter 就是靠这个发现网络资源的。

CloudFormation 模板 karpenter-cfn.yaml(展开查看)
karpenter-cfn.yaml
AWSTemplateFormatVersion: "2010-09-09"
Description: Resources used by https://github.com/aws/karpenter
Parameters:
ClusterName:
Type: String
Description: "EKS cluster name"
Resources:
KarpenterNodeRole:
Type: "AWS::IAM::Role"
Properties:
RoleName: !Sub "KarpenterNodeRole-${ClusterName}"
Path: /
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: !Sub "ec2.${AWS::URLSuffix}"
Action:
- "sts:AssumeRole"
ManagedPolicyArns:
- !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEKS_CNI_Policy"
- !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly"
- !Sub "arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore"
KarpenterControllerPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub "KarpenterControllerPolicy-${ClusterName}"
PolicyDocument: !Sub |
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowScopedEC2InstanceAccessActions",
"Effect": "Allow",
"Resource": [
"arn:${AWS::Partition}:ec2:${AWS::Region}::image/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}::snapshot/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:security-group/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:subnet/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:capacity-reservation/*"
],
"Action": ["ec2:RunInstances", "ec2:CreateFleet"]
},
{
"Sid": "AllowScopedEC2LaunchTemplateAccessActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:ec2:${AWS::Region}:*:launch-template/*",
"Action": ["ec2:RunInstances", "ec2:CreateFleet"],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${ClusterName}": "owned"
},
"StringLike": {
"aws:ResourceTag/karpenter.sh/nodepool": "*"
}
}
},
{
"Sid": "AllowScopedEC2InstanceActionsWithTags",
"Effect": "Allow",
"Resource": [
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:fleet/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:instance/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:volume/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:network-interface/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:launch-template/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:spot-instances-request/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:capacity-reservation/*"
],
"Action": ["ec2:RunInstances", "ec2:CreateFleet", "ec2:CreateLaunchTemplate"],
"Condition": {
"StringEquals": {
"aws:RequestTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:RequestTag/eks:eks-cluster-name": "${ClusterName}"
},
"StringLike": {
"aws:RequestTag/karpenter.sh/nodepool": "*"
}
}
},
{
"Sid": "AllowScopedResourceCreationTagging",
"Effect": "Allow",
"Resource": [
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:fleet/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:instance/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:volume/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:network-interface/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:launch-template/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:spot-instances-request/*"
],
"Action": "ec2:CreateTags",
"Condition": {
"StringEquals": {
"aws:RequestTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:RequestTag/eks:eks-cluster-name": "${ClusterName}",
"ec2:CreateAction": ["RunInstances", "CreateFleet", "CreateLaunchTemplate"]
},
"StringLike": {
"aws:RequestTag/karpenter.sh/nodepool": "*"
}
}
},
{
"Sid": "AllowScopedResourceTagging",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:ec2:${AWS::Region}:*:instance/*",
"Action": "ec2:CreateTags",
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${ClusterName}": "owned"
},
"StringLike": {
"aws:ResourceTag/karpenter.sh/nodepool": "*"
},
"StringEqualsIfExists": {
"aws:RequestTag/eks:eks-cluster-name": "${ClusterName}"
},
"ForAllValues:StringEquals": {
"aws:TagKeys": ["eks:eks-cluster-name", "karpenter.sh/nodeclaim", "Name"]
}
}
},
{
"Sid": "AllowScopedDeletion",
"Effect": "Allow",
"Resource": [
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:instance/*",
"arn:${AWS::Partition}:ec2:${AWS::Region}:*:launch-template/*"
],
"Action": ["ec2:TerminateInstances", "ec2:DeleteLaunchTemplate"],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${ClusterName}": "owned"
},
"StringLike": {
"aws:ResourceTag/karpenter.sh/nodepool": "*"
}
}
},
{
"Sid": "AllowRegionalReadActions",
"Effect": "Allow",
"Resource": "*",
"Action": [
"ec2:DescribeCapacityReservations",
"ec2:DescribeImages",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeSubnets"
],
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "${AWS::Region}"
}
}
},
{
"Sid": "AllowSSMReadActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:ssm:${AWS::Region}::parameter/aws/service/*",
"Action": "ssm:GetParameter"
},
{
"Sid": "AllowPricingReadActions",
"Effect": "Allow",
"Resource": "*",
"Action": "pricing:GetProducts"
},
{
"Sid": "AllowInterruptionQueueActions",
"Effect": "Allow",
"Resource": "${KarpenterInterruptionQueue.Arn}",
"Action": ["sqs:DeleteMessage", "sqs:GetQueueUrl", "sqs:ReceiveMessage"]
},
{
"Sid": "AllowPassingInstanceRole",
"Effect": "Allow",
"Resource": "${KarpenterNodeRole.Arn}",
"Action": "iam:PassRole",
"Condition": {
"StringEquals": {
"iam:PassedToService": ["ec2.amazonaws.com", "ec2.amazonaws.com.cn"]
}
}
},
{
"Sid": "AllowScopedInstanceProfileCreationActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:instance-profile/*",
"Action": ["iam:CreateInstanceProfile"],
"Condition": {
"StringEquals": {
"aws:RequestTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:RequestTag/eks:eks-cluster-name": "${ClusterName}",
"aws:RequestTag/topology.kubernetes.io/region": "${AWS::Region}"
},
"StringLike": {
"aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowScopedInstanceProfileTagActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:instance-profile/*",
"Action": ["iam:TagInstanceProfile"],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:ResourceTag/topology.kubernetes.io/region": "${AWS::Region}",
"aws:RequestTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:RequestTag/eks:eks-cluster-name": "${ClusterName}",
"aws:RequestTag/topology.kubernetes.io/region": "${AWS::Region}"
},
"StringLike": {
"aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*",
"aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowScopedInstanceProfileActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:instance-profile/*",
"Action": [
"iam:AddRoleToInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:DeleteInstanceProfile"
],
"Condition": {
"StringEquals": {
"aws:ResourceTag/kubernetes.io/cluster/${ClusterName}": "owned",
"aws:ResourceTag/topology.kubernetes.io/region": "${AWS::Region}"
},
"StringLike": {
"aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*"
}
}
},
{
"Sid": "AllowInstanceProfileReadActions",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:instance-profile/*",
"Action": "iam:GetInstanceProfile"
},
{
"Sid": "AllowUnscopedInstanceProfileListAction",
"Effect": "Allow",
"Resource": "*",
"Action": "iam:ListInstanceProfiles"
},
{
"Sid": "AllowAPIServerEndpointDiscovery",
"Effect": "Allow",
"Resource": "arn:${AWS::Partition}:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}",
"Action": "eks:DescribeCluster"
}
]
}
KarpenterInterruptionQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: !Sub "${ClusterName}"
MessageRetentionPeriod: 300
SqsManagedSseEnabled: true
KarpenterInterruptionQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref KarpenterInterruptionQueue
PolicyDocument:
Id: EC2InterruptionPolicy
Statement:
- Effect: Allow
Principal:
Service:
- events.amazonaws.com
- sqs.amazonaws.com
Action: sqs:SendMessage
Resource: !GetAtt KarpenterInterruptionQueue.Arn
- Sid: DenyHTTP
Effect: Deny
Action: sqs:*
Resource: !GetAtt KarpenterInterruptionQueue.Arn
Condition:
Bool:
aws:SecureTransport: false
Principal: "*"
ScheduledChangeRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source: [aws.health]
detail-type: [AWS Health Event]
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
SpotInterruptionRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source: [aws.ec2]
detail-type: [EC2 Spot Instance Interruption Warning]
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
RebalanceRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source: [aws.ec2]
detail-type: [EC2 Instance Rebalance Recommendation]
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn
InstanceStateChangeRule:
Type: 'AWS::Events::Rule'
Properties:
EventPattern:
source: [aws.ec2]
detail-type: [EC2 Instance State-change Notification]
Targets:
- Id: KarpenterInterruptionQueueTarget
Arn: !GetAtt KarpenterInterruptionQueue.Arn

EC2NodeClass

几个要点:

  • AMI 锁版本al2023@v20260304 显式固定版本号,避免新 AMI 发布时自动滚升级触发意外
  • EBS 加密 + gp3 50G,按需再调
  • 子网/安全组靠 tag 发现,多 AZ 自然覆盖
ec2-node-class-default.yaml(展开查看)
ec2-node-class-default.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: "KarpenterNodeRole-<cluster-name>"
amiSelectorTerms:
- alias: "al2023@v20260304"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "<cluster-name>"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "<cluster-name>"

NodePool:主池 + 兜底池

两个池子,用 weight 控制优先级。主池 weight=100 偏好 AMD;兜底池 weight=10,不锁 CPU 厂商。

为什么这么分?

  • AMD 优先:c/m 系列的 AMD 实例(c6a、m6a 等)比 Intel 版本便宜约 10%,性能差距可忽略
  • 兜底池:AMD 某些区域会抢不到容量,兜底池放开 CPU 厂商限制
  • CPU 范围锁 8~64 核:太小的节点碎片多,太大的节点故障爆炸半径大
  • 只用 c / m 系列:不要 t 系列(突发性能不稳定)、不要 r / x 系列(内存太多浪费)
  • 仅第 6、7 代:老代际性价比差
  • expireAfter: 720h:30 天强制重建,保证节点定期滚更新
  • consolidateAfter: 10m:利用率低的节点 10 分钟后被合并,避免频繁抖动
node-pool-default-amd.yaml(展开查看)
node-pool-default-amd.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default-amd
spec:
weight: 100
template:
metadata:
labels:
nodepool: default-amd
capacity-profile: elastic
cpu-preference: amd
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m"]
- key: karpenter.k8s.aws/instance-cpu
operator: In
values: ["8", "16", "32", "64"]
- key: karpenter.k8s.aws/instance-cpu-manufacturer
operator: In
values: ["amd"]
- key: karpenter.k8s.aws/instance-generation
operator: In
values: ["6", "7"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: 800
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 10m
node-pool-default-fallback.yaml(展开查看)
node-pool-default-fallback.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default-fallback
spec:
weight: 10
template:
metadata:
labels:
nodepool: default-fallback
capacity-profile: elastic
cpu-preference: generic
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m"]
- key: karpenter.k8s.aws/instance-cpu
operator: In
values: ["8", "16", "32", "64"]
- key: karpenter.k8s.aws/instance-generation
operator: In
values: ["6", "7"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: 200
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 10m

用 Kustomize 组织

一键 kubectl apply -k . 下发:

kustomization.yaml(展开查看)
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- node-pool-default-amd.yaml
- node-pool-default-fallback.yaml
- ec2-node-class-default.yaml

验证弹性伸缩

用一个能撑起节点的 Deployment 验证。pause 容器不跑任何逻辑,只申请资源:

test.yaml(展开查看)
test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-karpenter
namespace: default
spec:
replicas: 10
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: "2"
memory: "4Gi"

扩容过程看日志:

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f

看到 Created nodeclaimRegistered node → Pod 调度,整个过程通常 40~60 秒。

扩完再缩:

kubectl scale deployment test-karpenter --replicas=0

10 分钟左右(consolidateAfter)节点被合并释放。

踩过的坑

1. 子网/安全组没打标签

Karpenter Pending 半天不动,日志报找不到子网。kubectl describe ec2nodeclass default 里 status 里子网列表是空的,补上 karpenter.sh/discovery tag 立刻恢复。

2. AMI alias 不写版本号

最开始写 alias: al2023@latest,节点 expireAfter 到期自动用新 AMI 重建,结果新 AMI 跟某个 DaemonSet 冲突,一波节点挂了。改成固定版本号后只在可控窗口手动升级。

3. NodePool CPU limits 设太低

业务突然起量时,NodePool 达到 limits.cpu 上限就不再扩。主池撞墙后靠兜底池顶上,所以两个池子的 limits 要结合业务峰值评估。

4. Spot 中断没接 SQS

早期没配中断队列,Spot 实例被回收时节点上 Pod 直接被 kill,没有优雅驱逐。配上 SQS + EventBridge 后,Karpenter 收到中断 2 分钟预警就开始预扩节点 + cordon 旧节点 + 迁移 Pod。

5. t 系列 burstable 实例别选

一开始贪便宜把 t 系列加进 requirements,结果跑 CI 任务 CPU credit 打完后节点直接假死。换成 c/m 系列稳定得多。

回滚

如果 Karpenter 出问题要回到 CA:

# 1. 先把 NodePool 删掉,阻止新节点被创建
kubectl delete nodepool --all

# 2. 等 Karpenter 管理的节点上 Pod 迁移或驱逐
# 可以人工 cordon + drain 加速

# 3. 卸载 Karpenter
helm uninstall karpenter -n kube-system

# 4. 恢复 CA 配置

CFN 栈和 IAM role 可以保留,不影响其他东西。

参考