Prometheus Monitoring for TF operator

Available Metrics

Currently available metrics to monitor are listed below.

Metrics for Each Component Container for TF operator

Component Containers:

tf-operator
tf-chief
tf-ps
tf-worker

Each Container Reports on its:

Use prometheus graph to run the following example commands to visualize metrics.

Note: These metrics are derived from cAdvisor kubelet integration which reports to Prometheus through our prometheus-operator installation. You may see a complete list of metrics available in \metrics page of your Prometheus web UI which you can further use to compose your own queries.

CPU usage

sum (rate (container_cpu_usage_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)

GPU Usage

sum (rate (container_accelerator_memory_used_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)

Memory Usage

sum (rate (container_memory_usage_bytes{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)

Network Usage

sum (rate (container_network_transmit_bytes_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)

I/O Usage

sum (rate (container_fs_write_seconds_total{pod_name=~"tfjob-name-.*"}[1m])) by (pod_name)

Keep-Alive check

up

This is maintained by Prometheus on its own with its up metric detailed in the documentation here.

Is Leader check

tf_operator_is_leader

Note: Replace tfjob-name with your own TF Job name you want to monitor for the example queries above.

Report TFJob metrics:

Note: If you are using release v1 tf-operator, these TFJob metrics don't have suffix total. So you have to use metric name like tf_operator_jobs_created to get your metrics. See PR to get more information.

Job Creation

tf_operator_jobs_created_total

Job Creation

sum (rate (tf_operator_jobs_created_total[60m]))

Job Deletion

tf_operator_jobs_deleted_total

Successful Job Completions

tf_operator_jobs_successful_total

Failed Jobs

tf_operator_jobs_failed_total

Restarted Jobs

tf_operator_jobs_restarted_total

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Prometheus Monitoring for TF operator

Available Metrics

Metrics for Each Component Container for TF operator

Each Container Reports on its:

Report TFJob metrics:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Prometheus Monitoring for TF operator

Available Metrics

Metrics for Each Component Container for TF operator

Each Container Reports on its:

Report TFJob metrics: