title: 2.2.监控linux服务器
order: 8
icon: lightbulb
一、环境
主机名 | IP地址 | 系统 | 说明 |
localhost | 192.168.11.61 | Ubuntu 20.04 | docker方式安装的prometheus。docker版本23.0.1 |
test | 192.168.11.62 | Ubuntu 20.04 | docker版本23.0.1,安装node_exporter对这台服务器进行监控 |
真实环境会有很多需要监控的主机,我新链接克隆了一台主机ip地址修改为192.168.11.62来做测试
后面的课程都以docker安装的prometheus进行讲解
1、准备环境
主机更名
hostnamectl set-hostname test
安装Docker
镜像加速
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": ["http://hub-mirror.c.163.com"]
}
EOF
安装docker
export DOWNLOAD_URL="http://mirrors.163.com/docker-ce"
curl -fsSL https://get.docker.com/ | sh
检查
docker -v
或:
systemctl status docker
安装Docker-compose
安装命令
#方式一(2选1)curl命令是一行,如果出现换行就不对
curl -L https://get.daocloud.io/docker/compose/releases/download/1.29.2/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
#方式二(2选1)curl命令是一行,如果出现换行就不对
curl -L https://github.com/docker/compose/releases/download/1.29.2/docker-compose-Linux-x86_64 > /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
检查
docker-compose -v
二、node_exporter
1、二进制安装(二选一)
官网下载地址:https://prometheus.io/download/
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
ls -l
mv node_exporter-1.5.0.linux-amd64 /opt/prometheus/node_exporter
创建用户
useradd -M -s /usr/sbin/nologin prometheus
更改node_exporter
文件夹权限:
chown prometheus:prometheus -R /opt/prometheus/node_exporter
创建 systemd 服务
cat > /etc/systemd/system/node_exporter.service <<"EOF"
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动 node_exporter
systemctl daemon-reload
systemctl start node_exporter.service
加入到开机自启动
systemctl enable node_exporter.service
检查
systemctl status node_exporter.service
检查日志
journalctl -u node_exporter.service -f
修改prometheus配置
prometheus服务器操作
nano /opt/prometheus/prometheus/prometheus.yml
# 再scrape_configs这行下面添加如下配置:
#node-exporter配置
- job_name: 'node-exporter'
scrape_interval: 15s
static_configs:
- targets: ['192.168.11.62:9100']
labels:
instance: test服务器
重载prometheus
curl -X POST http://localhost:9090/-/reload
2、docker安装(二选一)
mkdir /data/node_exporter -p
cd /data/node_exporter
cat > docker-compose.yaml <<"EOF"
version: '3.3'
services:
node_exporter:
image: prom/node-exporter:v1.5.0
container_name: node-exporter
restart: always
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'
ports:
- '9100:9100'
EOF
检查
cat docker-compose.yaml
启动
docker-compose up -d
检查
docker ps
或:
docker logs -f node-exporter
修改prometheus配置
prometheus服务器操作
nano /data/docker-prometheus/prometheus/prometheus.yml
# 再scrape_configs这行下面添加如下配置:
#node-exporter配置
- job_name: 'node-exporter'
scrape_interval: 15s
static_configs:
- targets: ['192.168.11.62:9100']
labels:
instance: test服务器
重载prometheus配置
curl -X POST http://localhost:9090/-/reload
web访问地址
应用 | 访问地址 | 备注 |
node-exporter | 无用户和密码 |
Prometheus web上检查
http://192.168.11.61:9090/targets?search=
3、常用的监控指标
查看:http://192.168.11.61:9090/graph
cpu采集
node_cpu_seconds_total
名称 | 含义 | |
node_load1 | 一分钟内cpu负载 | |
node_load5 | 5分钟内cpu负载 | |
node_load15 | 15分钟内cpu负载 |
内存采集
/proc/meminfo文件
node_memory_
名称 | 含义 | 备注 |
node_memory_MemTotal_bytes | 内存总大小 | 单位字节,/1024/1024=MB,/1024/1024/1024=GB |
node_memory_MemAvailable_bytes | 空闲可使用的内存大小(=free + buffer + cache) | |
node_memory_MemFree_bytes | 空闲物理内存大小 | |
node_memory_SwapFree_bytes | swap内存空闲大小 | |
node_memory_SwapTotal_bytes | swap内存总大小 |
磁盘采集
node_disk_
文件系统采集
node_filesystem_
名称 | 含义 | |
node_filesystem_avail_bytes | 空闲磁盘大小,单位字节 | /1024/1024=MB,/1024/1024/1024=GB |
node_filesystem_size_bytes | 磁盘总大小 | |
node_filesystem_files_free | 空闲inode大小,单位个 | |
node_filesystem_files | inode总大小,单位个 |
网络采集
node_network_
名称 | 含义 | |
node_network_transmit_bytes_total | 网络流出流量,单位字节(Byte) | /1024/1024=Mb/s |
node_network_receive_bytes_total | 网络流入流量,单位字节(Byte) |
文件描述符
node_filefd_allocated: 已分配的文件描述符数。通过cat /proc/sys/fs/file-nr查看
node_filefd_maximum: 系统支持的最大文件描述符数,通过/proc/sys/fs/file-max或/proc/sys/fs/file-nr
进程文件描述符
process_max_fds: 进程可打开的最大文件描述符数。
process_open_fds: 进程当前打开的文件描述符数。通过lsof -p <PID> | wc -l计算
4、触发器设置
cd /data/docker-prometheus/
cat >> prometheus/alert.yml <<"EOF"
- name: node-exporter
rules:
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: "主机内存不足,实例:{{ $labels.instance }}"
description: "内存可用率<10%,当前值:{{ $value }}"
- alert: HostMemoryUnderMemoryPressure
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "内存压力不足,实例:{{ $labels.instance }}"
description: "节点内存压力大。 重大页面错误率高,当前值为:{{ $value }}"
- alert: HostUnusualNetworkThroughputIn
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "异常流入网络吞吐量,实例:{{ $labels.instance }}"
description: "网络流入流量 > 100 MB/s,当前值:{{ $value }}"
- alert: HostUnusualNetworkThroughputOut
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "异常流出网络吞吐量,实例:{{ $labels.instance }}"
description: "网络流出流量 > 100 MB/s,当前值为:{{ $value }}"
- alert: HostUnusualDiskReadRate
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "异常磁盘读取,实例:{{ $labels.instance }}"
description: "磁盘读取> 50 MB/s,当前值:{{ $value }}"
- alert: HostUnusualDiskWriteRate
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 2m
labels:
severity: warning
annotations:
summary: "异常磁盘写入,实例:{{ $labels.instance }}"
description: "磁盘写入> 50 MB/s,当前值:{{ $value }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘空间不足告警,实例:{{ $labels.instance }}"
description: "剩余磁盘空间< 10% ,当前值:{{ $value }}"
- alert: HostDiskWillFillIn24Hours
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘空间将在24小时内耗尽,实例:{{ $labels.instance }}"
description: "以当前写入速率预计磁盘空间将在 24 小时内耗尽,当前值:{{ $value }}"
- alert: HostOutOfInodes
expr: node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint="/"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘Inodes不足,实例:{{ $labels.instance }}"
description: "剩余磁盘 inodes < 10%,当前值: {{ $value }}"
- alert: HostUnusualDiskReadLatency
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "异常磁盘读取延迟,实例:{{ $labels.instance }}"
description: "磁盘读取延迟 > 100ms,当前值:{{ $value }}"
- alert: HostUnusualDiskWriteLatency
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "异常磁盘写入延迟,实例:{{ $labels.instance }}"
description: "磁盘写入延迟 > 100ms,当前值:{{ $value }}"
- alert: high_load
expr: node_load1 > 4
for: 2m
labels:
severity: page
annotations:
summary: "CPU1分钟负载过高,实例:{{ $labels.instance }}"
description: "CPU1分钟负载>4,已经持续2分钟。当前值为:{{ $value }}"
- alert: HostCpuIsUnderUtilized
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "cpu负载高,实例:{{ $labels.instance }}"
description: "cpu负载> 80%,当前值:{{ $value }}"
- alert: HostCpuStealNoisyNeighbor
expr: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
for: 0m
labels:
severity: warning
annotations:
summary: "CPU窃取率异常,实例:{{ $labels.instance }}"
description: "CPU 窃取率 > 10%。 嘈杂的邻居正在扼杀 VM 性能,或者 Spot 实例可能失去信用,当前值:{{ $value }}"
- alert: HostSwapIsFillingUp
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘swap空间使用率异常,实例:{{ $labels.instance }}"
description: "磁盘swap空间使用率>80%"
- alert: HostNetworkReceiveErrors
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "异常网络接收错误,实例:{{ $labels.instance }}"
description: "网卡{{ $labels.device }}在过去2分钟接收{{ $value }}个错误"
- alert: HostNetworkTransmitErrors
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "异常网络传输错误,实例:{{ $labels.instance }}"
description: "网卡{{ $labels.device }}在过去2分钟传输{{ $value }}个错误"
- alert: HostNetworkInterfaceSaturated
expr: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8 < 10000
for: 1m
labels:
severity: warning
annotations:
summary: "异常网络接口饱和,实例:{{ $labels.instance }}"
description: "网卡{{ $labels.device }}正在超载,当前值{{ $value }}"
- alert: HostConntrackLimit
expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "异常连接数,实例:{{ $labels.instance }}"
description: "连接数过大,当前连接数:{{ $value }}"
- alert: HostClockSkew
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
for: 2m
labels:
severity: warning
annotations:
summary: "异常时钟偏差,实例:{{ $labels.instance }}"
description: "检测到时钟偏差,时钟不同步。值为:{{ $value }}"
- alert: HostClockNotSynchronising
expr: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
for: 2m
labels:
severity: warning
annotations:
summary: "时钟不同步,实例:{{ $labels.instance }}"
description: "时钟不同步"
- alert: NodeFileDescriptorLimit
expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "预计内核将很快耗尽文件描述符限制"
description: "{{ $labels.instance }}}已分配的文件描述符数超过了限制的80%,当前值为:{{ $value }}"
EOF
检查:
cat prometheus/alert.yml
检查配置
docker exec -it prometheus promtool check config /etc/prometheus/prometheus.yml
重新加载配置
curl -X POST http://localhost:9090/-/reload
检查
检查
http://192.168.11.61:9090/rules
5、grafana展示node-exporter的数据
因为我们在安装prometheus时,已经在grafana上添加prometheus的数据源,并倒入过id为1860
的模版,所以就不需要到倒入了,直接查看,查看方式如下图:
用户名:admin
密码:password
https://grafana.com/grafana/dashboards/8919
需要修改下:
然后json model把
$instance:.+ 全部修改为 $instance
评论区