Run TensorFlow Jobs
This guide gives an overview of how to set up training-operator and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.
Install training-operator
You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation, please refer to this doc for details.
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
Prepare the docker image
Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.
- Download files from deployment/examples/tfjob
- To build this docker image with the following command
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
Run a TensorFlow job
Here is a TFJob yaml for MNIST example.
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: dist-mnist-for-e2e-test
namespace: kubeflow
spec:
tfReplicaSpecs:
PS:
replicas: 2
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Worker:
replicas: 4
restartPolicy: Never
template:
metadata:
labels:
applicationId: "tf_job_20200521_001"
queue: root.sandbox
spec:
schedulerName: yunikorn
containers:
- name: tensorflow
image: kubeflow/tf-dist-mnist-test:1.0
Create the TFJob
kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the document here.