Design of YuniKorn scheduler
Scheduler API Server:
Responsible for communication between RM and Scheduler, which implements scheduler-interface GRPC protocol, or just APIs. (For intra-process communication w/o Serde).
Caches all data related to scheduler state, such as used resources of each queues, nodes, allocations. Relationship between allocations and nodes, etc. Should not include temporary data helps with scheduler. For example to-be-preempted allocation candidates. Fair share resource of queues, etc.
Scheduler Cache Event Handler:
Handles all events which needs to update scheduler internal state. So all the write operations will be carefully handled.
Handles request from Admin, which can also load configurations from storage and update scheduler policies.
Scheduler and Preemptor
Handles Scheduler's internal state. (Which is not belong to scheduelr cache), such as internal reservations, etc. Scheduler and preemptor will work together, make scheduling or preemption decisions.
All allocate/preempt request will be handled by event handler.
Scheduler needs to do following responsibilities
- According to resource usages between queues, sort queues, applications, and figure out order of application allocation. (This will be used by preemption as well).
- It is possible that we cannot satisfy some of the allocation request, we need to skip them and find next request.
- It is possible that some allocation request cannot be satisfied because of resource fragmentation. We need to reserve room for such requests.
- Different nodes may belong to different disjoint partitions, we can make independent scheduler runs
- Locality is still important for many scenarios, especially for on-prem cases.
- Be able to config and change ordering policies for apps, queues.
- Application can choose their own way to manage sort of nodes.
- It is important to know "who wanna the resource", so we can do preemption based on allocation orders.
- When do preemption, it is also efficient to trigger allocation op. Think about how to do it.
- Preemption needs to take care about queue resource balancing.
Configurations & Semantics
Example of configuration:
Partition is name space.
Same queues can under different partitions, but enforced to have same hierarchy.
Good:partition=x partition=ya a/ \ / \b c b c
Good (c in partition y acl=""):partition=x partition=ya a/ \ /b c b
Bad (c in different hierarchy)partition=x partition=ya a/ \ / \b c b d/c
Bad (Duplicated c)partition=xa/ \b c/c
Different hierarchies can be addedpartitions:- name: defaultqueues:root:configs:acls:childrens:- a- b- c- ...a:configs:acls:capacity: (capacity is not allowed to set for root)max-capacity: ...mapping-policies:...- name: partition_a:queues:root:...