Version: 1.2.0

Gang Scheduling

What is Gang Scheduling

When Gang Scheduling is enabled, YuniKorn schedules the app only when the app’s minimal resource request can be satisfied. Otherwise, apps will be waiting in the queue. Apps are queued in hierarchy queues, with gang scheduling enabled, each resource queue is assigned with the maximum number of applications running concurrently with min resource guaranteed.

Gang Scheduling

Enable Gang Scheduling

There is no cluster-wide configuration needed to enable Gang Scheduling. The scheduler actively monitors the metadata of each app, if the app has included a valid taskGroups definition, it will be considered as gang scheduling desired.

Task Group

A task group is a “gang” of tasks in an app, these tasks are having the same resource profile and the same placement constraints. They are considered as homogeneous requests that can be treated as the same kind in the scheduler.

Prerequisite

For the queues which runs gang scheduling enabled applications, the queue sorting policy should be set to FIFO. To configure queue sorting policy, please refer to doc: app sorting policies.

Why the `FIFO` sorting policy

When Gang Scheduling is enabled, the scheduler proactively reserves resources for each application. If the queue sorting policy is not FIFO based (StateAware is FIFO based sorting policy), the scheduler might reserve partial resources for each app and causing resource segmentation issues.

Side effects of `StateAware` sorting policy

We do not recommend using StateAware, even-though it is a FIFO based policy. A failure of the first pod or a long initialisation period of that pod could slow down the processing. This is specifically an issue with Spark jobs when the driver performs a lot of pre-processing before requesting the executors. The StateAware timeout in those cases would slow down processing to just one application per timeout. This in effect will overrule the gang reservation and cause slowdowns and excessive resource usage.

App Configuration

On Kubernetes, YuniKorn discovers apps by loading metadata from individual pod, the first pod of the app is required to include a full copy of app metadata. If the app does not have any notion about the first or second pod, then all pods are required to carry the same taskGroups info. Gang scheduling requires taskGroups definition, which can be specified via pod annotations. The required fields are:

Annotation	Value
yunikorn.apache.org/task-group-name	Task group name, it must be unique within the application
yunikorn.apache.org/task-groups	A list of task groups, each item contains all the info defined for the certain task group
yunikorn.apache.org/schedulingPolicyParameters	Optional. A arbitrary key value pairs to define scheduling policy parameters. Please read schedulingPolicyParameters section

How many task groups needed?

This depends on how many different types of pods this app requests from K8s. A task group is a “gang” of tasks in an app, these tasks are having the same resource profile and the same placement constraints. They are considered as homogeneous requests that can be treated as the same kind in the scheduler. Use Spark as an example, each job will need to have 2 task groups, one for the driver pod and the other one for the executor pods.

How to define task groups?

The task group definition is a copy of the app’s real pod definition, values for fields like resources, node-selector, toleration and affinity should be the same as the real pods. This is to ensure the scheduler can reserve resources with the exact correct pod specification.

Scheduling Policy Parameters

Scheduling policy related configurable parameters. Apply the parameters in the following format in pod's annotation:

annotations:
   yunikorn.apache.org/schedulingPolicyParameters: "PARAM1=VALUE1 PARAM2=VALUE2 ..."

Currently, the following parameters are supported:

placeholderTimeoutInSeconds

Default value: 15 minutes. This parameter defines the reservation timeout for how long the scheduler should wait until giving up allocating all the placeholders. The timeout timer starts to tick when the scheduler allocates the first placeholder pod. This ensures if the scheduler could not schedule all the placeholder pods, it will eventually give up after a certain amount of time. So that the resources can be freed up and used by other apps. If non of the placeholders can be allocated, this timeout won't kick-in. To avoid the placeholder pods stuck forever, please refer to troubleshooting for solutions.

gangSchedulingStyle

Valid values: Soft, Hard

Default value: Soft. This parameter defines the fallback mechanism if the app encounters gang issues due to placeholder pod allocation. See more details in Gang Scheduling styles section

More scheduling parameters will added in order to provide more flexibility while scheduling apps.

Example

The following example is a yaml file for a job. This job launches 2 pods and each pod sleeps 30 seconds. The notable change in the pod spec is spec.template.metadata.annotations, where we defined yunikorn.apache.org/task-group-name and yunikorn.apache.org/task-groups.

apiVersion: batch/v1
kind: Job
metadata:
  name: gang-scheduling-job-example
spec:
  completions: 2
  parallelism: 2
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "gang-scheduling-job-example"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: task-group-example
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "task-group-example",
              "minMember": 2,
              "minResource": {
                "cpu": "100m",
                "memory": "50M"
              },
              "nodeSelector": {},
              "tolerations": [],
              "affinity": {}
          }]
    spec:
      schedulerName: yunikorn
      restartPolicy: Never
      containers:
        - name: sleep30
          image: "alpine:latest"
          command: ["sleep", "30"]
          resources:
            requests:
              cpu: "100m"
              memory: "50M"

When this job is submitted to Kubernetes, 2 pods will be created using the same template, and they all belong to one taskGroup: “task-group-example”. YuniKorn will create 2 placeholder pods, each uses the resources specified in the taskGroup definition. When all 2 placeholders are allocated, the scheduler will bind the the real 2 sleep pods using the spot reserved by the placeholders.

You can add more than one taskGroups if necessary, each taskGroup is identified by the taskGroup name, it is required to map each real pod with a pre-defined taskGroup by setting the taskGroup name. Note, the task group name is only required to be unique within an application.

Enable Gang scheduling for Spark jobs

Each Spark job runs 2 types of pods, driver and executor. Hence, we need to define 2 task groups for each job. The annotations for the driver pod looks like:

Annotations:
  yunikorn.apache.org/schedulingPolicyParameters: “placeholderTimeoutSeconds=30”
  yunikorn.apache.org/taskGroupName: “spark-driver”
  yunikorn.apache.org/taskGroup: “
    TaskGroups: [
     {
       Name: “spark-driver”,
       minMember: 1,
       minResource: {
         Cpu: 1,
         Memory: 2Gi
       },
       Node-selector: ...,
       Tolerations: ...,
       Affinity: ...
     },
      {
        Name: “spark-executor”,
        minMember: 10, 
        minResource: {
          Cpu: 1,
          Memory: 2Gi
        }
      }
  ]
  ”

note

The TaskGroup resources must account for the memory overhead for Spark drivers and executors. See the Spark documentation for details on how to calculate the values.

For all the executor pods,

Annotations:
  # the taskGroup name should match to the names
  # defined in the taskGroups field
  yunikorn.apache.org/taskGroupName: “spark-executor”

Once the job is submitted to the scheduler, the job won’t be scheduled immediately. Instead, the scheduler will ensure it gets its minimal resources before actually starting the driver/executors.

Gang scheduling Styles

There are 2 gang scheduling styles supported, Soft and Hard respectively. It can be configured per app-level to define how the app will behave in case the gang scheduling fails.

Hard style: when this style is used, we will have the initial behavior, more precisely if the application cannot be scheduled according to gang scheduling rules, and it times out, it will be marked as failed, without retrying to schedule it.
Soft style: when the app cannot be gang scheduled, it will fall back to the normal scheduling, and the non-gang scheduling strategy will be used to achieve the best-effort scheduling. When this happens, the app transits to the Resuming state and all the remaining placeholder pods will be cleaned up.

Default style used: Soft

Enable a specific style: the style can be changed by setting in the application definition the ‘gangSchedulingStyle’ parameter to Soft or Hard.

Example

apiVersion: batch/v1
kind: Job
metadata:
  name: gang-app-timeout
spec:
  completions: 4
  parallelism: 4
  template:
    metadata:
      labels:
        app: sleep
        applicationId: gang-app-timeout
        queue: fifo
      annotations:
        yunikorn.apache.org/task-group-name: sched-style
        yunikorn.apache.org/schedulingPolicyParameters: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard"
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "sched-style",
              "minMember": 4,
              "minResource": {
                "cpu": "1",
                "memory": "1000M"
              },
              "nodeSelector": {},
              "tolerations": [],
              "affinity": {}
          }]
    spec:
      schedulerName: yunikorn
      restartPolicy: Never
      containers:
        - name: sleep30
          image: "alpine:latest"
          imagePullPolicy: "IfNotPresent"
          command: ["sleep", "30"]
          resources:
            requests:
              cpu: "1"
              memory: "1000M"

Verify Configuration

To verify if the configuration has been done completely and correctly, check the following things:

When an app is submitted, verify the expected number of placeholders are created by the scheduler. If you define 2 task groups, 1 with minMember 1 and the other with minMember 5, that means we are expecting 6 placeholder gets created once the job is submitted.
Verify the placeholder spec is correct. Each placeholder needs to have the same info as the real pod in the same taskGroup. Check field including: namespace, pod resources, node-selector, toleration and affinity.
Verify the placeholders can be allocated on correct type of nodes, and verify the real pods are started by replacing the placeholder pods.

Troubleshooting

Please see the troubleshooting doc when gang scheduling is enabled here.

What is Gang Scheduling​

Enable Gang Scheduling​

Prerequisite​

Why the FIFO sorting policy​

Side effects of StateAware sorting policy​

App Configuration​

How many task groups needed?​

How to define task groups?​

Scheduling Policy Parameters​

Example​

Enable Gang scheduling for Spark jobs​

Gang scheduling Styles​

Example​

Verify Configuration​

Troubleshooting​

What is Gang Scheduling

Enable Gang Scheduling

Prerequisite

Why the `FIFO` sorting policy

Side effects of `StateAware` sorting policy

App Configuration

How many task groups needed?

How to define task groups?

Scheduling Policy Parameters

Example

Enable Gang scheduling for Spark jobs

Gang scheduling Styles

Example

Verify Configuration

Troubleshooting