Skip to content

A monitoring plugin for icinga/nagios/nsca, that reports basic system metrics of a linux host: cpu, load, threads, openfiles, procs, diskio, disku, memory, swap, network

Notifications You must be signed in to change notification settings

samgogo/check_linux_metrics

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

check_linux_metrics

监控 linux 系统基本指标:cpu,负载,线程,句柄,procs,磁盘 io,磁盘使用率,内存,交换内存,网络。

Key features

主要特点

  • 最小依赖: 只需要系统默认的基础 python 库,所有指标来自 /proc。
  • 最小权限: 可运行在非特权用户下。不需要 root 权限
  • 没有采样: 重要指标如 CPU,磁盘IO,网络IO 等的收集方式是通过对基于内核提供的累计值进行计算。这些累计值由内核启动时就开始采集。当第一次运行 checks 程序,数值会被复制到 interim 目录下。下一次再运行时,两次的差值将被作为结果输出。这样确保不会收系统峰值的影响。

Usage Examples

  • CPU

<script> cpu [warn%] [critical%]

    [user@localhost ~]$ ./check_linux_metrics.py cpu
    This was the first run, run again to get values
    # 第一次运行,再运行一次将得到数值。
    
    [user@localhost ~]$ ./check_linux_metrics.py cpu
    CPU Usage: 7.57% [t:60.04] | cpu=7.57% user=1.00% system=0.54% iowait=5.97% nice=0.04% irq=0.00% softirq=0.01% steal=0.00%

    [user@localhost ~]$ ./check_linux_metrics.py cpu 80 99
    CPU Usage: 9.17% [t:60.13] (OK) | cpu=9.17%;80;99 user=1.01%;80;99 system=0.55%;80;99 iowait=7.54%;80;99 nice=0.05%;80;99 irq=0.00%;80;99 softirq=0.02%;80;99 steal=0.00%;80;99
    
    # CPU Usage / cpu:cpu 使用率
    # t:取样用时
    # user: 表示用户空间程序的cpu使用率(没有通过nice调度)
    # system: 表示系统空间的cpu使用率,主要是内核程序。
    # iowait: cpu运行时在等待io的时间
    # nice: 表示用户空间且通过nice调度过的程序的cpu使用率。
    # irq: cpu处理硬中断的数量
    # softirq: cpu处理软中断的数量
    # steal: 被虚拟机偷走的cpu
  • Load

<script> load [warn(load1,load5,load15)] [critical(load1,load5,load15)]

    [user@localhost ~]$ ./check_linux_metrics.py load
    Load1: 0.34 Load5: 0.36 Load15: 0.36 | load1=0.34 load5=0.36 load15=0.36

    [user@localhost ~]$ ./check_linux_metrics.py load 7,6,5 20,15,10
    Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) (OK) (OK) | load1=0.34;7;20 load5=0.36;6;15 load15=0.36;5;10

    [user@localhost ~]$ ./check_linux_metrics.py load 7 20
    Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;7;20 load5=0.36 load15=0.36

    [user@localhost ~]$ ./check_linux_metrics.py load ,6, ,15,
    Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;; load5=0.36;6;15 load15=0.36;;

    [user@localhost ~]$ ./check_linux_metrics.py load ,,5 ,,10
    Load1: 0.34 Load5: 0.36 Load15: 0.36 (OK) | load1=0.34;; load5=0.36;; load15=0.36;5;10
    
    # load1: 1分钟平均负载
    # load5: 5分钟平均负载
    # load15: 15分钟平均负载
  • Threads

<script> threads [warn#] [critical#]

    [user@localhost ~]$ ./check_linux_metrics.py threads
    Threads: 1/207  | running=1.00 total=207.00

    [user@localhost ~]$ ./check_linux_metrics.py threads 10 50
    Threads: 1/207  (OK) | running=1.00;10;50 total=207.00
    
    # threads:线程数
    # running:在采样时刻,运行队列的任务的数目
    # total:在采样时刻,系统中活跃的任务的个数(不包括运行已经结束的任务)
  • Open Files

<script> files [warn#] [critical#]

    [user@localhost ~]$ ./check_linux_metrics.py files
    Open Files: 1344 (free: 0) | open=1344.00 free=0.00

    [user@localhost ~]$ ./check_linux_metrics.py files 5000 50000
    Open Files: 1344 (free: 0) (OK) | open=1344.00;5000;50000;0;1202794 free=0.00
    
    # Open Files:已分配文件句柄的数目
    # free:已使用文件句柄的数目
  • Processes

<script> procs [warn#(total,running,waiting)] [critical#(total,running,waiting)]

    [user@localhost ~]$ ./check_linux_metrics.py procs
    This was the first run, run again to get values
    # 第一次运行,再运行一次将得到数值。

    [user@localhost ~]$ ./check_linux_metrics.py procs
    Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.55/s | total=149.00 forks=4.55 sleeping=148.00 running=1.00 waiting=0.00 zombie=0.00 others=0.00

    [user@localhost ~]$ ./check_linux_metrics.py procs 500,16,8 1500,32,16
    Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.78/s (OK) (OK) (OK) | total=149.00;500;1500 forks=4.78 sleeping=148.00 running=1.00;16;32 waiting=0.00;8;16 zombie=0.00 others=0.00

    [user@localhost ~]$ ./check_linux_metrics.py procs 500 1500
    Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.40/s (OK) | total=149.00;500;1500 forks=4.40 sleeping=148.00 running=1.00 waiting=0.00 zombie=0.00 others=0.00

    [user@localhost ~]$ ./check_linux_metrics.py procs ,,8 ,,16
    Total:149 Running:1 Sleeping:147 Waiting:1 Zombie:0 Others:0 New_Forks:4.73/s (OK) | total=149.00;; forks=4.73 sleeping=147.00 running=1.00;; waiting=1.00;8;16 zombie=0.00 others=0.00

    [user@localhost ~]$ ./check_linux_metrics.py procs ,16, ,32,
    Total:149 Running:1 Sleeping:148 Waiting:0 Zombie:0 Others:0 New_Forks:4.52/s (OK) | total=149.00;; forks=4.52 sleeping=148.00 running=1.00;16;32 waiting=0.00;; zombie=0.00 others=0.00
    
    # Total:全部进程
    # Running:运行的进程
    # Sleeping:休眠的进程
    # Waiting:等待的进程
    # Zombie:僵尸进程
    # Other:其他
    # New_Forks: 建立新进程的速度
  • Disk IO

<script> diskio block_device [warn(read,write)] [critical(read,write)]

note: unit is sectors/sec

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0
    This was the first run, run again to get values: diskio(cciss/c0d0)

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0
    /dev/cciss/c0d0(cciss/c0d0) Read: 0.00 sec/s (0.00 t/s) Write: 785.82 sec/s (63.47 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=63.47 write_sectors=785.82 write_time=18868.11

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/cciss/c0d0 50,100 200,250
    /dev/cciss/c0d0(cciss/c0d0) Read: 0.00 sec/s (0.00 t/s) Write: 765.68 sec/s (55.47 t/s) [t:60.05] (Critical) | read_operations=0.00 read_sectors=0.00;50;200 read_time=0.00 write_operations=55.47 write_sectors=765.68;100;250 write_time=15716.77

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/mapper/VolGroup-lv_root
    This was the first run, run again to get values: diskio(dm-0)

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/mapper/VolGroup-lv_root
    /dev/mapper/VolGroup-lv_root(dm-0) Read: 0.00 sec/s (0.00 t/s) Write: 1016.04 sec/s (127.01 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=127.01 write_sectors=1016.04 write_time=31707.88

    [user@localhost ~]$ ./check_linux_metrics.py diskio /dev/VolGroup/lv_root
    /dev/VolGroup/lv_root(dm-0) Read: 0.00 sec/s (0.00 t/s) Write: 1074.80 sec/s (134.35 t/s) [t:60.04] | read_operations=0.00 read_sectors=0.00 read_time=0.00 write_operations=134.35 write_sectors=1074.80 write_time=34072.15
    
    # read_operations:读操作
    # read_sectors:读扇区数
    # read_time:读时间
    # write_operations:写操作
    # write_sectors:写扇区数
    # write_time:写时间
  • Disk Usage

<script> disku [warn%] [critical%]

    [user@localhost ~]$ ./check_linux_metrics.py disku /
    / Used: 76.45 GB / 196.74 GB (38.86%) | used=38.86%

    [user@localhost ~]$ ./check_linux_metrics.py disku / 75 90
    / Used: 76.45 GB / 196.74 GB (38.86%) (OK) | used=38.86%;75;90

    [user@localhost ~]$ ./check_linux_metrics.py disku /boot 75 90
    /boot Used: 0.10 GB / 0.47 GB (21.32%) (OK) | used=21.32%;75;90

    [user@localhost ~]$ ./check_linux_metrics.py disku /var 75 90
    Plugin Error: Mount point not valid: (/var)
  • Memory

<script> memory [warn%] [critical%]

note: used memory is calculated as: total - free - cached

    [user@localhost ~]$ ./check_linux_metrics.py memory
    Memory Used: 786.41MB / 11845.97MB (6.64%) | used=786.41;;;0;11845 cached=10911.13 active=7144.62

    [user@localhost ~]$ ./check_linux_metrics.py memory 75 90
    Memory Used: 786.90MB / 11845.97MB (6.64%) (OK) | used=786.90;8884;10661;0;11845 cached=10911.13 active=7144.82
    
    # Memory Used:已使用内存
    # cached:缓存
    # active:最近被使用的内存
  • Swap

<script> cpu [warn%] [critical%]

note: used cached is calculated as: total - free - cached

    [user@localhost ~]$ ./check_linux_metrics.py swap
    Swap Used: 0.11MB / 5992.00MB (0.00%) | used=0.11;;;0;5991 cached=0.18

    [user@localhost ~]$ ./check_linux_metrics.py swap 75 90
    Swap Used: 0.11MB / 5992.00MB (0.00%) (OK) | used=0.11;4493;5392;0;5991 cached=0.18
    
    # Swap Used:已使用Swap
    # cached:缓存
  • Network

<script> network device [warn(rx,tx)] [critical(rx,tx)]

note: unit is MB/s

    [user@localhost ~]$ ./check_linux_metrics.py network eth0
    This was the first run, run again to get values: net:eth0

    [user@localhost ~]$ ./check_linux_metrics.py network eth0
    eth0 Rx: 0.01 MB/s (16.74 p/s) Tx: 0.00 MB/s (11.16 p/s) [t:60.04] | RX_MBps=0.01 RX_PKps=16.74 TX_MBps=0.00 TX_PKps=11.16 PK_ERRORS=0.00

    [user@localhost ~]$ ./check_linux_metrics.py network eth0 30,50 60,80
    eth0 Rx: 0.01 MB/s (17.80 p/s) Tx: 0.00 MB/s (11.62 p/s) [t:60.05] (OK) | RX_MBps=0.01;30;60 RX_PKps=17.80 TX_MBps=0.00;50;80 TX_PKps=11.62 PK_ERRORS=0.00
    
    # RX_MBps:接收流量
    # RX_PKps:接收数据包
    # TX_MBps:发送流量
    # TX_PKps:发送数据包
    # PK_ERRORS:数据包错误

About

A monitoring plugin for icinga/nagios/nsca, that reports basic system metrics of a linux host: cpu, load, threads, openfiles, procs, diskio, disku, memory, swap, network

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%