雲計算

Kubeflow 1.0 上線: 體驗生產級的機器學習平臺

背景

從2017年12月Kubeflow在Kubecon USA宣佈開源至今,已經經過2年多的時間。在過去的兩年裡Kubeflow已經成長為一個擁有數百名貢獻者的優秀開源項目。Kubeflow的目標是讓機器學習工程師或者數據科學家可以利用本地或者共有的雲資源構建屬於自己的ML的工作負載。2020年3月,Kubeflow正式發佈1.0版本。在Kubeflow 1.0的版本中, 有多項重要的核心應用畢業,這些應用幫助用戶在Kubernetes的平臺上高效的開發、構建、訓練和部署模型。畢業的應用有:

  • Kubeflow's UI, the central dashboard
  • Jupyter notebook controller and web app
  • Tensorflow Operator (TFJob) and PyTorch Operator for distributed training
  • kfctl for deployment and upgrades
  • Profile controller and UI for multiuser management

通過使用Kubeflow1.0, 用戶可以通過Jupyter開發模型,然後利用Kubeflow提供的工具fairing(Kubeflow’s python SDK)來構建鏡像,最後通過Kubernetes調度資源來訓練模型。一旦他們有了訓練好的模型,他們可以使用KFServing來創建部署推理服務。


376257940305ae2238f663078b7ab029.png

在阿里雲上部署Kubeflow1.0

環境檢查

1.需要Kubernetes集群1.14及以上
2.有些用戶在安裝時,因為集群預先安裝過istio,所以導致安裝失敗,建議之前先清理掉istio的相關內容

Kubeflow1.0 服務部署

1.下載kfctl工具用於kubeflow部署

wget http://kubeflow.oss-cn-beijing.aliyuncs.com/kfctl_v1.0-0-g94c35cf_linux.tar.gz
tar zxvf kfctl_v1.0-0-g94c35cf_linux.tar.gz
mv kfctl /usr/bin/

2.初始化部署環境和下載部署所用manifests (此處為了適配阿里雲ACK的一鍵部署,類似國內鏡像拉取超時、服務訪問不通等問題已經處理,請使用阿里適配後的CONFIG文件)

export KF_NAME=my-kubeflow
export BASE_DIR=/root/
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="http://kubeflow.oss-cn-beijing.aliyuncs.com/kfctl_k8s_istio.v1.0.1.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
kfctl build -V -f ${CONFIG_URI}
export CONFIG=${KF_DIR}/kfctl_k8s_istio.v1.0.1.yaml

3.創建服務運行時需要的本地PV,用戶可以根據自己情況創建不同類型的PV,

cat << EOF > local_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pipeline-mysql-pv
  namespace: kubeflow
  labels:
    type: local
    app: pipeline-mysql-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/pipeline-mysql
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pipeline-minio-pv
  namespace: kubeflow
  labels:
    type: local
    app: pipeline-minio-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/pipeline-minio
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: katib-mysql
  namespace: kubeflow
  labels:
    type: local
    app: katib-mysql
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/katib-mysql
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: metadata-mysql-pv
  namespace: kubeflow
  labels:
    type: local
    app: metadata-mysql-pv
    key: kubeflow-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/metadata-mysql
    type: DirectoryOrCreate
EOF
kubectl create -f local_pv.yaml

4.服務部署,時間稍長,請耐心等待....(預計10min以內)

kfctl apply -V -f ${CONFIG}

部署後環境檢查

1.istio-system下的pod是否啟動正常,執行kubectl get pods -n istio-system


0838e04374c8a0a32eb8e3b3747572d6.png

2.Kubeflow下的pod是否啟動正常(此處由於服務部署的先後順序,有些Pod會運行失敗,可以等待服務重啟),執行kubectl get pods -n kubeflow


a485b9f935918fe02bacb7f7615a3f28.png

3.是否能夠正常訪問Kubeflow頁面
3.1執行 kubectl get svc -n istio-system istio-ingressgateway,查詢istio對外暴露的IP
8d0e4d6a83a7b6b68f990be301de3925.png

3.2訪問對應的IP:80顯示Kubeflow的主頁面標識部署成功


87a420596f9001059c433a1a40022859.png

體驗Jupyter Hub

1.在主界面上選擇Notebook Servers


9af2e9e8f6c92202666bc960016b8e88.png

2.新建New Server


1b627c9aed7de2732e5158c1022f3654.png

如果有提示No default Storage Class is set. 用戶可以自行設置default Storage Class


8a8036b94778f34a0a9b57f784182048.png

例如在阿里雲上設定disk-ssd為default stroage class

kubectl patch storageclass alicloud-disk-ssd -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

3.編輯Server信息
用戶可以選擇Custom Image解決gcr鏡像拉取不下來的問題


db624cee3ec6349a19fdbe44181d828c.png

阿里雲提供如下的docker鏡像:

registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0

如果配置了Workspace Volume並且將alicloud-disk-ssd設為default stroage class,由於最低限制的原因,請將Size設置為20G


ecb74744d3586e809540a9f483ac678a.png

選擇設定需要的GPU資源


ecb74744d3586e809540a9f483ac678a.png

4.使用Jupyter
服務啟動成功後,狀態顯示為綠色


3af3372e2b1a3570df1ee71ba0d127a8.png

登錄Terminal,檢查GPU環境


dd37a16b3890810e1a1cbb1fc978e206.png


8799161b2208bd7ac227adae66a22377.png

使用Jupyter的開發環境


f819c5d958d3c00e41a39ff8f9284fa9.png

將示例代碼複製到文本框中

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

import tensorflow as tf

x = tf.placeholder(tf.float32, [None, 784])

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy: ", sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

運行代碼得出訓練結果


a144aa1a62ed71aef25d9fcc3f176e86.png

展望

未來我們會帶來更多的Kubeflow的實踐文章,敬請期待~

參考文獻: 1 https://medium.com/kubeflow/kubeflow-1-0-cloud-native-ml-for-everyone-a3950202751

Leave a Reply

Your email address will not be published. Required fields are marked *