监控 linux 系统基本指标:cpu,负载,线程,句柄,procs,磁盘 io,磁盘使用率,内存,交换内存,网络。
- 最小依赖: 只需要系统默认的基础 python 库,所有指标来自 /proc。
- 最小权限: 可运行在非特权用户下。不需要 root 权限
- 没有采样: 重要指标如 CPU,磁盘IO,网络IO 等的收集方式是通过对基于内核提供的累计值进行计算。这些累计值由内核启动时就开始采集。当第一次运行 checks 程序,数值会被复制到 interim 目录下。下一次再运行时,两次的差值将被作为结果输出。这样确保不会收系统峰值的影响。
- CPU
<script> cpu [warn%] [critical%]
[user@localhost ~]$ ./check_linux_metrics.py cpu
This was the first run, run again to get values
# 第一次运行,再运行一次将得到数值。
[user@localhost ~]$ ./check_linux_metrics.py cpu
CPU Usage: 7.57% [t:60.04] | cpu=7.57% user=1.00% system=0.54% iowait=5.97% nice=0.04% irq=0.00% softirq=0.01% steal=0.00%
[user@localhost ~]$ ./check_linux_metrics.py cpu 80 99
CPU Usage: 9.17% [t:60.13] (OK) | cpu=9.17%;80;99 user=1.01%;80;99 system=0.55%;80;99 iowait=7.54%;80;99 nice=0.05%;80;99 irq=0.00%;80;99 softirq=0.02%;80;99 steal=0.00%;80;99
# CPU Usage / cpu:cpu 使用率
# t:取样用时
# user: 表示用户空间程序的cpu使用率(没有通过nice调度)
# system: 表示系统空间的cpu使用率,主要是内核程序。
# iowait: cpu运行时在等待io的时间
# nice: 表示用户空间且通过nice调度过的程序的cpu使用率。
# irq: cpu处理硬中断的数量
# softirq: cpu处理软中断的数量
# steal: 被虚拟机偷走的cpu
- Load
<script> load [warn(load1,load5,load15)] [critical(load1,load5,load15)]
[user@localhost ~]$ ./check_linux_metrics.py load
Load1: 0.34 Load5: 0.36 Load15: 0.36 | load1=0.34 load5=0.36 load15=0.36
[user@localhost ~]$ ./check_linux_metrics.py load 7,6,5 20,15,10
Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) (OK) (OK) | load1=0.34;7;20 load5=0.36;6;15 load15=0.36;5;10
[user@localhost ~]$ ./check_linux_metrics.py load 7 20
Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;7;20 load5=0.36 load15=0.36
[user@localhost ~]$ ./check_linux_metrics.py load ,6, ,15,
Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;; load5=0.36;6;15 load15=0.36;;
[user@localhost ~]$ ./check_linux_metrics.py load ,,5 ,,10
Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;; load5=0.36;; load15=0.36;5;10
# load1: 1分钟平均负载
# load5: 5分钟平均负载
# load15: 15分钟平均负载
- Threads
<script> threads [warn#] [critical#]
[user@localhost ~]$ ./check_linux_metrics.py threads
Threads: 1/207 | running=1.00 total=207.00
[user@localhost ~]$ ./check_linux_metrics.py threads 10 50
Threads: 1/207 (OK) | running=1.00;10;50 total=207.00
# threads:线程数
# running:在采样时刻,运行队列的任务的数目
# total:在采样时刻,系统中活跃的任务的个数(不包括运行已经结束的任务)
- Open Files
<script> files [warn#] [critical#]
[user@localhost ~]$ ./check_linux_metrics.py files
Open Files: 1344 (free: 0) | open=1344.00 free=0.00
[user@localhost ~]$ ./check_linux_metrics.py files 5000 50000
Open Files: 1344 (free: 0) (OK) | open=1344.00;5000;50000;0;1202794 free=0.00
# Open Files:已分配文件句柄的数目
# free:已使用文件句柄的数目
- Processes
<script> procs [warn#(total,running,waiting)] [critical#(total,running,waiting)]
[user@localhost ~]$ ./check_linux_metrics.py procs
This was the first run, run again to get values
# 第一次运行,再运行一次将得到数值。
[user@localhost ~]$ ./check_linux_metrics.py procs
Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.55/s | total=149.00 forks=4.55 sleeping=148.00 running=1.00 waiting=0.00 zombie=0.00 others=0.00
[user@localhost ~]$ ./check_linux_metrics.py procs 500,16,8 1500,32,16
Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.78/s (OK) (OK) (OK) | total=149.00;500;1500 forks=4.78 sleeping=148.00 running=1.00;16;32 waiting=0.00;8;16 zombie=0.00 others=0.00
[user@localhost ~]$ ./check_linux_metrics.py procs 500 1500
Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.40/s (OK) | total=149.00;500;1500 forks=4.40 sleeping=148.00 running=1.00 waiting=0.00 zombie=0.00 others=0.00
[user@localhost ~]$ ./check_linux_metrics.py procs ,,8 ,,16
Total:149 Running:1 Sleeping:147 Waiting:1 Zombie:0 Others:0 New_Forks:4.73/s (OK) | total=149.00;; forks=4.73 sleeping=147.00 running=1.00;; waiting=1.00;8;16 zombie=0.00 others=0.00
[user@localhost ~]$ ./check_linux_metrics.py procs ,16, ,32,
Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.52/s (OK) | total=149.00;; forks=4.52 sleeping=148.00 running=1.00;16;32 waiting=0.00;; zombie=0.00 others=0.00
# Total:全部进程
# Running:运行的进程
# Sleeping:休眠的进程
# Waiting:等待的进程
# Zombie:僵尸进程
# Other:其他
# New_Forks: 建立新进程的速度
- Disk IO
<script> diskio block_device [warn(read,write)] [critical(read,write)]
note: unit is sectors/sec
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0
This was the first run, run again to get values: diskio(cciss/c0d0)
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0
/dev/cciss/c0d0(cciss/c0d0) Read: 0.00 sec/s (0.00 t/s) Write: 785.82 sec/s (63.47 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=63.47 write_sectors=785.82 write_time=18868.11
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0 50,100 200,250
/dev/cciss/c0d0(cciss/c0d0) Read: 0.00 sec/s (0.00 t/s) Write: 765.68 sec/s (55.47 t/s) [t:60.05] (Critical) | read_operations=0.00 read_sectors=0.00;50;200 read_time=0.00 write_operations=55.47 write_sectors=765.68;100;250 write_time=15716.77
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/mapper/VolGroup-lv_root
This was the first run, run again to get values: diskio(dm-0)
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/mapper/VolGroup-lv_root
/dev/mapper/VolGroup-lv_root(dm-0) Read: 0.00 sec/s (0.00 t/s) Write: 1016.04 sec/s (127.01 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=127.01 write_sectors=1016.04 write_time=31707.88
[user@localhost ~]$ ./check_linux_metrics.py diskio /dev/VolGroup/lv_root
/dev/VolGroup/lv_root(dm-0) Read: 0.00 sec/s (0.00 t/s) Write: 1074.80 sec/s (134.35 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=134.35 write_sectors=1074.80 write_time=34072.15
# read_operations:读操作
# read_sectors:读扇区数
# read_time:读时间
# write_operations:写操作
# write_sectors:写扇区数
# write_time:写时间
- Disk Usage
<script> disku [warn%] [critical%]
[user@localhost ~]$ ./check_linux_metrics.py disku /
/ Used: 76.45 GB / 196.74 GB (38.86%) | used=38.86%
[user@localhost ~]$ ./check_linux_metrics.py disku / 75 90
/ Used: 76.45 GB / 196.74 GB (38.86%) (OK) | used=38.86%;75;90
[user@localhost ~]$ ./check_linux_metrics.py disku /boot 75 90
/boot Used: 0.10 GB / 0.47 GB (21.32%) (OK) | used=21.32%;75;90
[user@localhost ~]$ ./check_linux_metrics.py disku /var 75 90
Plugin Error: Mount point not valid: (/var)
- Memory
<script> memory [warn%] [critical%]
note: used memory is calculated as: total - free - cached
[user@localhost ~]$ ./check_linux_metrics.py memory
Memory Used: 786.41MB / 11845.97MB (6.64%) | used=786.41;;;0;11845 cached=10911.13 active=7144.62
[user@localhost ~]$ ./check_linux_metrics.py memory 75 90
Memory Used: 786.90MB / 11845.97MB (6.64%) (OK) | used=786.90;8884;10661;0;11845 cached=10911.13 active=7144.82
# Memory Used:已使用内存
# cached:缓存
# active:最近被使用的内存
- Swap
<script> cpu [warn%] [critical%]
note: used cached is calculated as: total - free - cached
[user@localhost ~]$ ./check_linux_metrics.py swap
Swap Used: 0.11MB / 5992.00MB (0.00%) | used=0.11;;;0;5991 cached=0.18
[user@localhost ~]$ ./check_linux_metrics.py swap 75 90
Swap Used: 0.11MB / 5992.00MB (0.00%) (OK) | used=0.11;4493;5392;0;5991 cached=0.18
# Swap Used:已使用Swap
# cached:缓存
- Network
<script> network device [warn(rx,tx)] [critical(rx,tx)]
note: unit is MB/s
[user@localhost ~]$ ./check_linux_metrics.py network eth0
This was the first run, run again to get values: net:eth0
[user@localhost ~]$ ./check_linux_metrics.py network eth0
eth0 Rx: 0.01 MB/s (16.74 p/s) Tx: 0.00 MB/s (11.16 p/s) [t:60.04] | RX_MBps=0.01 RX_PKps=16.74 TX_MBps=0.00 TX_PKps=11.16 PK_ERRORS=0.00
[user@localhost ~]$ ./check_linux_metrics.py network eth0 30,50 60,80
eth0 Rx: 0.01 MB/s (17.80 p/s) Tx: 0.00 MB/s (11.62 p/s) [t:60.05] (OK) | RX_MBps=0.01;30;60 RX_PKps=17.80 TX_MBps=0.00;50;80 TX_PKps=11.62 PK_ERRORS=0.00
# RX_MBps:接收流量
# RX_PKps:接收数据包
# TX_MBps:发送流量
# TX_PKps:发送数据包
# PK_ERRORS:数据包错误