上一篇文章《嚐鮮阿里雲容器服務Kubernetes 1.16,共享TensorFlow實驗室》我們講述瞭如何通過CGPU的方案來實現CGPU資源的共享和隔離。
本文介紹基於CGPU資源的彈性能力。
ps:下面的說明是基於上一篇文章的環境來進行的描述,環境的搭建請參考上一篇文章。
配置彈性伸縮組
- 在“集群列表”中目標集群的“更多”的下拉菜單中選中“自動伸縮”
- 配置基礎的“縮容規則”後,“創建伸縮組”,選擇“共享GPU實例”
- 然後選中需要的類型,比如本例中選擇規格“ecs.gn6i-c4g1.xlarge”,其中我們已經默認設置了彈出節點的標籤 "cgpu: true, workload_type: gpushare"
- 點擊確定後,彈性伸縮組配置完成
觸發擴容
將下面的內存存儲為 mem_deployment.yaml,通過命令 kubectl apply -f mem_deployment.yaml
來初始化環境
---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
replicas: 1
selector: # define how the deployment finds the pods it mangages
matchLabels:
app: tf-notebook
template: # define the pods specifications
metadata:
labels:
app: tf-notebook
spec:
containers:
- name: tf-notebook
image: tensorflow/tensorflow:1.4.1-gpu-py3
resources:
limits:
aliyun.com/gpu-mem: 4
requests:
aliyun.com/gpu-mem: 4
ports:
- containerPort: 8888
env:
- name: PASSWORD
value: mypassw0rd
# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
name: tf-notebook
spec:
ports:
- port: 80
targetPort: 8888
name: jupyter
selector:
app: tf-notebook
type: LoadBalancer
通過命令kubectl scale --replicas 7 deploy/tf-notebook
擴大副本數至7,觸發彈性伸縮組擴容
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl scale --replicas 7 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-jm2cb 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-lmn5w 0/1 Pending 0 19s <none> <none> <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Running 0 19s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Running 0 19s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 58m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 0/1 Pending 0 19s <none> <none> <none> <none>
#彈出資源需要一定的時間...
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 1/1 Running 0 2m10s 172.20.67.21 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-jm2cb 1/1 Running 0 2m10s 172.20.67.20 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-lmn5w 1/1 Running 0 2m10s 172.20.67.79 cn-zhangjiakou.192.168.3.199 <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Running 0 2m10s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Running 0 2m10s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 60m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 1/1 Running 0 2m10s 172.20.67.22 cn-zhangjiakou.192.168.3.198 <none> <none>
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node -L cgpu,workload_type
NAME STATUS ROLES AGE VERSION CGPU WORKLOAD_TYPE
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1 true
cn-zhangjiakou.192.168.3.189 Ready <none> 7d9h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 Ready <none> 134m v1.16.6-aliyun.1 true gpushare
cn-zhangjiakou.192.168.3.199 Ready <none> 129m v1.16.6-aliyun.1 true gpushare
jumper(⎈ |zjk-gpu:default)➜ ~ arena top node -s -d
NAME: cn-zhangjiakou.192.168.3.184
IPADDRESS: 192.168.3.184
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-n9ldb default 4
tf-notebook-7cf4575d78-rzgtl default 4
tf-notebook-7cf4575d78-vzxvb default 4
Allocated : 12 (85%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
NAME: cn-zhangjiakou.192.168.3.198
IPADDRESS: 192.168.3.198
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-dc2fr default 4
tf-notebook-7cf4575d78-jm2cb default 4
tf-notebook-7cf4575d78-w6spt default 4
Allocated : 12 (85%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
NAME: cn-zhangjiakou.192.168.3.199
IPADDRESS: 192.168.3.199
NAME NAMESPACE GPU0(Allocated)
tf-notebook-7cf4575d78-lmn5w default 4
Allocated : 4 (28%)
Total : 14
----------------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
28/42 (GiB) (66%)
如上所示,當副本數調至7時,額外彈出了兩個gpu節點,“cgpu: true,workload_type: gpushare”
通過arena的命令可以看到顯存資源使用了 28/42
觸發縮容
由上可見,對於共享型GPU,是可以正常的彈出資源的。接下來我們把資源釋放,來驗證共享GPU資源的縮容情況
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl scale --replicas 1 deploy/tf-notebook
deployment.extensions/tf-notebook scaled
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tf-notebook-7cf4575d78-dc2fr 1/1 Terminating 0 4m7s 172.20.67.21 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-jm2cb 1/1 Terminating 0 4m7s 172.20.67.20 cn-zhangjiakou.192.168.3.198 <none> <none>
tf-notebook-7cf4575d78-lmn5w 1/1 Terminating 0 4m7s 172.20.67.79 cn-zhangjiakou.192.168.3.199 <none> <none>
tf-notebook-7cf4575d78-n9ldb 1/1 Terminating 0 4m7s 172.20.64.39 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-rzgtl 1/1 Terminating 0 4m7s 172.20.64.40 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-vzxvb 1/1 Running 0 62m 172.20.64.36 cn-zhangjiakou.192.168.3.184 <none> <none>
tf-notebook-7cf4575d78-w6spt 1/1 Terminating 0 4m7s 172.20.67.22 cn-zhangjiakou.192.168.3.198 <none> <none>
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189 Ready <none> 7d8h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 Ready <none> 78m v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199 Ready <none> 73m v1.16.6-aliyun.1
#此時新彈出來的機器的狀態都是Ready,在下一個縮容週期中會縮容這些新彈出的Node,一段時間之後,這個時間取決於彈性伸縮組的縮容週期的設置
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-zhangjiakou.192.168.0.138 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113 Ready <none> 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115 Ready master 19d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.184 Ready <none> 8d v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189 Ready <none> 7d9h v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.198 NotReady <none> 142m v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.199 NotReady <none> 137m v1.16.6-aliyun.1
如上所示,通過降低副本數後,經過一段時間,新彈出的機器會重新釋放 -- 此處使用了ECS的極速模式,故大家看到的狀態是NotReady而不是節點直接消失,極速模式可以讓下次啟動的速度更快,代價是會產生少量的存儲費用。