Prometheus
Prometheus(普罗米修斯)是一套开源的监控&报警&时间序列数据库的组合,由 SoundCloud 公司开发。
Prometheus 基本原理是通过 HTTP 协议周期性抓取被监控组件的状态,这样做的好处是任意组件只要提供 HTTP 接口就可以接入监控系统,不需要任何 SDK 或者其他的集成过程。这样做非常适合虚拟化环境比如 VM 或者 Docker 。
Prometheus 应该是为数不多的适合 Docker、Mesos、Kubernetes 环境的监控系统之一。
GitHub:https://github.com/prometheus/prometheus
官网:https://prometheus.io/
架构图
Grafana
Grafana是一个跨平台的开源的度量分析和可视化工具,支持从多种数据源(如prometheus)获取数据进行可视化数据展示。
部署组件
prometheus不只有prometheus,而是有一些列的组件,本次使用docker-compose部署prometheus、node_export、grafana、alertmanager一套完整prometheus监控体系。
docker-compose.yml
version: '2'
networks:
monitor:
driver: bridge
services:
prometheus:
image: prom/prometheus
container_name: prometheus
hostname: prometheus
restart: always
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/node_down.yml:/etc/prometheus/node_down.yml
ports:
- "9090:9090"
networks:
- monitor
alertmanager:
image: prom/alertmanager
container_name: alertmanager
hostname: alertmanager
restart: always
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
networks:
- monitor
node-exporter:
image: quay.io/prometheus/node-exporter
container_name: node-exporter
hostname: node-exporter
restart: always
ports:
- "9100:9100"
networks:
- monitor
pushgateway:
image: prom/pushgateway
container_name: pushgateway
restart: always
ports:
- 9091:9091
networks:
- monitor
grafana:
image: grafana/grafana
container_name: grafana
hostname: grafana
restart: always
ports:
- "3000:3000"
depends_on:
- loki
- promtail
volumes:
- ./grafana:/var/lib/grafana
networks:
- monitor
# mysql的exporter,可选
mysql-exporter:
image: prom/mysqld-exporter
container_name: mysql-exporter
hostname: mysql-exporter
restart: always
ports:
- "9104:9104"
networks:
- monitor
environment:
DATA_SOURCE_NAME: "username:password(localhost:3306)/"
# redis的exporter,可选
redis-exporter:
image: oliver006/redis_exporter
container_name: redis_exporter
hostname: redis_exporter
restart: always
ports:
- "9121:9121"
networks:
- monitor
command:
- '--redis.addr=redis://localhost:6379'
- '--redis.password=abcdefg'
配置文件
prometheus配置文件:prometheus.yml
prometheus.yml主要配置prometheus抓取的监控信息,只要配置对应的exporter暴露的指标,都能被prometheus获取到。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['127.0.0.1:9093'] #alertmanager地址
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
#- "node_down.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# prometheus抓取的监控的信息
- job_name: 'prometheus'
scrape_interval: 8s
static_configs:
- targets: ['127.0.0.1:9090']
- targets: ['127.0.0.1:9100','192.168.0.1:9100']
labels:
group: 'client-node-exporter'
- targets: ['127.0.0.1:9091']
labels:
group: 'client-pushgateway'
- targets: ['127.0.0.1:9104']
labels:
group: 'client-mysql'
# prometheus抓取redis监控的信息
- job_name: 'redis_exporter'
static_configs:
- targets:
- 127.0.0.1:9121
# prometheus抓取GPU监控的信息
# - job_name: 'dcgm'
# static_configs:
# - targets: ['192.168.0.3:9400']
# - job_name: 'nvidia_gpu_exporter'
# static_configs:
# - targets: ['192.168.0.4:9835']
node_down.yml:自定义的告警规则文件,文件中配置主机宕机和某个服务down 了之后的告警。可以根据需要自行配置。
groups:
- name: node_down
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
user: test
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- name: node-up
rules:
- alert: node-up
expr: up{job="xxx"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
alertmanager配置文件:
alertmanager.yml,主要配置告警的频率,告警的邮箱等。除了邮件告警之外,还阔以配置短信、微信、钉钉等。
global:
resolve_timeout: 5m
smtp_from: 'xxxx@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: 'xxxx@qq.com'
smtp_auth_password: 'xxxxxxxxx'
smtp_require_tls: false
smtp_hello: 'xxx监控告警'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'xxxxx@qq.com'
send_resolved: true
完成描述文件和配置文件之后,就可以启动了。
启动
docker-compose up -d
使用prometheus
输入ip+对应的端口打开应用
prometheus:http://192.168.0.1:9090/
xxx-exporter 通过对应的端口,可以查看exporter信息
主要是一些数据Metrics信息,这些数据看起来并不方便,所以可以使用grafana
grafana http://192.168.0.1:3000/login
登录进入之后,配置prometheus数据源,可以自定义一些监控图标,也可以去grafana的商店下载一些dashboard。详细grafana的使用自行了解。
添加好数据源后就可以手动创建一些dashboard,自己手工创建dashboard有点困难,可以借助官方开源的一些模版(https://grafana.com/grafana/dashboards/)