Scheduler Metrics
YuniKorn leverages Prometheus to record metrics. The metrics system keeps tracking of scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories for these metrics:
- scheduler: generic metrics of the scheduler, such as allocation latency, num of apps etc.
- queue: each queue has its own metrics sub-system, tracking queue status.
- event: record various changes of events in YuniKorn.
all metrics are declared in yunikorn
namespace.
Scheduler Metrics
Metrics Name | Metrics Type | Description |
---|---|---|
containerAllocation | Counter | Total number of attempts to allocate containers. State of the attempt includes allocated , rejected , error , released . Increase only. |
applicationSubmission | Counter | Total number of application submissions. State of the attempt includes accepted and rejected . Increase only. |
applicationStatus | Gauge | Total number of application status. State of the application includes running and completed . |
totalNodeActive | Gauge | Total number of active nodes. |
totalNodeFailed | Gauge | Total number of failed nodes. |
nodeResourceUsage | Gauge | Total resource usage of node, by resource name. |
schedulingLatency | Histogram | Latency of the main scheduling routine, in seconds. |
nodeSortingLatency | Histogram | Latency of all nodes sorting, in seconds. |
appSortingLatency | Histogram | Latency of all applications sorting, in seconds. |
queueSortingLatency | Histogram | Latency of all queues sorting, in seconds. |
tryNodeLatency | Histogram | Latency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds. |
Queue Metrics
Metrics Name | Metrics Type | Description |
---|---|---|
appMetrics | Counter | Application Metrics, record the total number of applications. State of the application includes accepted ,rejected and Completed . |
usedResourceMetrics | Gauge | Queue used resource. |
pendingResourceMetrics | Gauge | Queue pending resource. |
availableResourceMetrics | Gauge | Used resource metrics related to queues etc. |
Event Metrics
Metrics Name | Metrics Type | Description |
---|---|---|
totalEventsCreated | Gauge | Total events created. |
totalEventsChanneled | Gauge | Total events channeled. |
totalEventsNotChanneled | Gauge | Total events not channeled. |
totalEventsProcessed | Gauge | Total events processed. |
totalEventsStored | Gauge | Total events stored. |
totalEventsNotStored | Gauge | Total events not stored. |
totalEventsCollected | Gauge | Total events collected. |
Access Metrics
YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service. Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.
Aggregate Metrics to Prometheus
It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:
-
Setup Prometheus (read more from Prometheus docs)
-
Configure Prometheus rules: a sample configuration
global:
scrape_interval: 3s
evaluation_interval: 15s
scrape_configs:
- job_name: 'yunikorn'
scrape_interval: 1s
metrics_path: '/ws/v1/metrics'
static_configs:
- targets: ['docker.for.mac.host.internal:9080']
- start Prometheus
docker pull prom/prometheus:latest
docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Use docker.for.mac.host.internal
instead of localhost
if you are running Prometheus in a local docker container
on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from
YuniKorn scheduler.