运行TensorFlow作业
本章节概述了如何设置 training-operator 以及如何使用 YuniKorn 调度器运行 Tensorflow 作业。 training-operator 是由 Kubeflow 维护的一体化集成的训练 operator。它不仅支持 TensorFlow,还支持 PyTorch、XGboots 等。
安装 training-operator
您可以使用以下命令在 kubeflow 命名空间中默认安装 training operator。如果安装有问题, 请参阅 此文档 来查找相关的详细信息。
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
准备 docker 镜像
在开始于 Kubernetes 上运行 TensorFlow 作业之前,您需要构建 docker 镜像。
- 从 deployment/examples/tfjob 上下载文件
- 使用以下命令构建这个 docker 镜像
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
运行一个 TensorFlow 作业
以下是一个使用 MNIST 样例 的 TFJob yaml.
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: dist-mnist-for-e2e-test
namespace: kubeflow
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Worker:
replicas: 4
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
创建 TFJob
kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
您可以从 YuniKorn UI 中查看作业信息。 如果您不知道如何访问 YuniKorn UI, 请阅读此 文档。
使用GPU Time-slicing
前提
要使用 Time-slicing GPU,您需要先设定丛集以让GPU和Time-slicing GPU能被使用。
- 节点上必须连接GPU
- Kubernetes版本为1.24
- 丛集中需要安装 GPU drivers
- 透过 GPU Operator 自动化的建置与管理节点中的 NVIDIA 软体组件
- 在Kubernetes中设定 Time-Slicing GPUs in Kubernetes
在安装完 GPU Operator 及 Time-slicing GPU 以后,确认pods的状态以确保所有的containers正在运行或完成:
kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-fd5x4 2/2 Running 0 5d2h
gpu-operator-569d9c8cb-kbn7s 1/1 Running 14 (39h ago) 5d2h
gpu-operator-node-feature-discovery-master-84c7c7c6cf-f4sxz 1/1 Running 0 5d2h
gpu-operator-node-feature-discovery-worker-p5plv 1/1 Running 8 (39h ago) 5d2h
nvidia-container-toolkit-daemonset-zq766 1/1 Running 0 5d2h
nvidia-cuda-validator-5tldf 0/1 Completed 0 5d2h
nvidia-dcgm-exporter-95vm8 1/1 Running 0 5d2h
nvidia-device-plugin-daemonset-7nzvf 2/2 Running 0 5d2h
nvidia-device-plugin-validator-gj7nn 0/1 Completed 0 5d2h
nvidia-operator-validator-nz84d 1/1 Running 0 5d2h
确认时间片设定是否被成功的使用:
kubectl describe node
Capacity:
nvidia.com/gpu: 16
...
Allocatable:
nvidia.com/gpu: 16
...
使用GPU测试TensorFlow job
在这个段落中会在 Time-slicing GPU 的支援下,测试及验证TFJob的运行
新建一个workload的测试档案tf-gpu.yaml:
vim tf-gpu.yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
namespace: kubeflow
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp:
labels:
applicationId: "tf_job_20200521_001"
spec:
schedulerName: yunikorn
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
applicationId: "tf_job_20200521_001"
spec:
schedulerName: yunikorn
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 2
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure创建TFJob
kubectl apply -f tf-gpu.yaml
在Yunikorn中验证TFJob是否运行
察看pod的日志:
kubectl logs logs po/tf-smoke-gpu-worker-0 -n kubeflow
.......
..Found device 0 with properties:
..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
.......
..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
.......