Skip to main content
Version: Next

Scheduler Metrics

YuniKorn leverages Prometheus to record metrics. The metrics system keeps tracking of scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories for these metrics:

  • scheduler: generic metrics of the scheduler, such as allocation latency, num of apps etc.
  • queue: each queue has its own metrics sub-system, tracking queue status.
  • event: record various changes of events in YuniKorn.

all metrics are declared in yunikorn namespace.

Scheduler Metrics

Metrics NameMetrics TypeDescription
containerAllocationCounterTotal number of attempts to allocate containers. State of the attempt includes allocated, rejected, error, released. Increase only.
applicationSubmissionCounterTotal number of application submissions. State of the attempt includes accepted and rejected. Increase only.
applicationStatusGaugeTotal number of application status. State of the application includes running and completed.
totalNodeActiveGaugeTotal number of active nodes.
totalNodeFailedGaugeTotal number of failed nodes.
nodeResourceUsageGaugeTotal resource usage of node, by resource name.
schedulingLatencyHistogramLatency of the main scheduling routine, in seconds.
nodeSortingLatencyHistogramLatency of all nodes sorting, in seconds.
appSortingLatencyHistogramLatency of all applications sorting, in seconds.
queueSortingLatencyHistogramLatency of all queues sorting, in seconds.
tryNodeLatencyHistogramLatency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds.

Queue Metrics

Metrics NameMetrics TypeDescription
appMetricsCounterApplication Metrics, record the total number of applications. State of the application includes accepted,rejected and Completed.
usedResourceMetricsGaugeQueue used resource.
pendingResourceMetricsGaugeQueue pending resource.
availableResourceMetricsGaugeUsed resource metrics related to queues etc.

Event Metrics

Metrics NameMetrics TypeDescription
totalEventsCreatedGaugeTotal events created.
totalEventsChanneledGaugeTotal events channeled.
totalEventsNotChanneledGaugeTotal events not channeled.
totalEventsProcessedGaugeTotal events processed.
totalEventsStoredGaugeTotal events stored.
totalEventsNotStoredGaugeTotal events not stored.
totalEventsCollectedGaugeTotal events collected.

Access Metrics

YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service. Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.

Aggregate Metrics to Prometheus

It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:

  • Setup Prometheus (read more from Prometheus docs)

  • Configure Prometheus rules: a sample configuration

scrape_interval: 3s
evaluation_interval: 15s

- job_name: 'yunikorn'
scrape_interval: 1s
metrics_path: '/ws/v1/metrics'
- targets: ['']
  • start Prometheus
docker pull prom/prometheus:latest
docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Use instead of localhost if you are running Prometheus in a local docker container on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from YuniKorn scheduler.