docker部署prometheus+Grafana

Prometheus

Prometheus（普罗米修斯）是一套开源的监控&报警&时间序列数据库的组合，由 SoundCloud 公司开发。

Prometheus 基本原理是通过 HTTP 协议周期性抓取被监控组件的状态，这样做的好处是任意组件只要提供 HTTP 接口就可以接入监控系统，不需要任何 SDK 或者其他的集成过程。这样做非常适合虚拟化环境比如 VM 或者 Docker 。
Prometheus 应该是为数不多的适合 Docker、Mesos、Kubernetes 环境的监控系统之一。

GitHub：https://github.com/prometheus/prometheus
官网：https://prometheus.io/

架构图

Grafana

Grafana是一个跨平台的开源的度量分析和可视化工具,支持从多种数据源（如prometheus）获取数据进行可视化数据展示。

部署组件

prometheus不只有prometheus，而是有一些列的组件，本次使用docker-compose部署prometheus、node_export、grafana、alertmanager一套完整prometheus监控体系。

docker-compose.yml

version: '2'

networks:
  monitor:
    driver: bridge

services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    hostname: prometheus
    restart: always
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/node_down.yml:/etc/prometheus/node_down.yml
    ports:
      - "9090:9090"
    networks:
      - monitor

  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    hostname: alertmanager
    restart: always
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    networks:
      - monitor

  node-exporter:
    image: quay.io/prometheus/node-exporter
    container_name: node-exporter
    hostname: node-exporter
    restart: always
    ports:
      - "9100:9100"
    networks:
      - monitor

  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: always
    ports:
      - 9091:9091
    networks:
      - monitor

  grafana:
    image: grafana/grafana
    container_name: grafana
    hostname: grafana
    restart: always
    ports:
      - "3000:3000"
    depends_on:
      - loki
      - promtail
    volumes:
      - ./grafana:/var/lib/grafana
    networks:
      - monitor

  # mysql的exporter，可选
  mysql-exporter:
    image: prom/mysqld-exporter
    container_name: mysql-exporter
    hostname: mysql-exporter
    restart: always
    ports:
      - "9104:9104"
    networks:
      - monitor
    environment:
      DATA_SOURCE_NAME: "username:password(localhost:3306)/"

  # redis的exporter，可选
  redis-exporter:
    image: oliver006/redis_exporter
    container_name: redis_exporter
    hostname: redis_exporter
    restart: always
    ports:
      - "9121:9121"
    networks:
      - monitor
    command:
      - '--redis.addr=redis://localhost:6379'
      - '--redis.password=abcdefg'

配置文件

prometheus配置文件：prometheus.yml
prometheus.yml主要配置prometheus抓取的监控信息，只要配置对应的exporter暴露的指标，都能被prometheus获取到。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['127.0.0.1:9093'] #alertmanager地址
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
#- "node_down.yml"
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  # prometheus抓取的监控的信息
  - job_name: 'prometheus'
    scrape_interval: 8s
    static_configs:
      - targets: ['127.0.0.1:9090']
      - targets: ['127.0.0.1:9100','192.168.0.1:9100']
        labels:
          group: 'client-node-exporter'
      - targets: ['127.0.0.1:9091']
        labels:
          group: 'client-pushgateway'
      - targets: ['127.0.0.1:9104']
        labels:
          group: 'client-mysql'


  # prometheus抓取redis监控的信息
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
          - 127.0.0.1:9121

  # prometheus抓取GPU监控的信息
#  - job_name: 'dcgm'
#    static_configs:
#      - targets: ['192.168.0.3:9400']
#  - job_name: 'nvidia_gpu_exporter'
#    static_configs:
#      - targets: ['192.168.0.4:9835']

node_down.yml：自定义的告警规则文件，文件中配置主机宕机和某个服务down 了之后的告警。可以根据需要自行配置。

groups:
  - name: node_down
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          user: test
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
  - name: node-up
    rules:
      - alert: node-up
        expr: up{job="xxx"} == 0
        for: 15s
        labels:
          severity: 1
          team: node
        annotations:
          summary: "{{ $labels.instance }} 已停止运行超过 15s！"

alertmanager配置文件：
alertmanager.yml，主要配置告警的频率，告警的邮箱等。除了邮件告警之外，还阔以配置短信、微信、钉钉等。

global:
  resolve_timeout: 5m
  smtp_from: 'xxxx@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: 'xxxx@qq.com'
  smtp_auth_password: 'xxxxxxxxx'
  smtp_require_tls: false
  smtp_hello: 'xxx监控告警'

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'
receivers:
  - name: 'email'
    email_configs:
      - to: 'xxxxx@qq.com'
        send_resolved: true

完成描述文件和配置文件之后，就可以启动了。

启动

docker-compose up -d

使用prometheus

输入ip+对应的端口打开应用

prometheus：http://192.168.0.1:9090/

xxx-exporter 通过对应的端口，可以查看exporter信息
主要是一些数据Metrics信息，这些数据看起来并不方便，所以可以使用grafana
grafana http://192.168.0.1:3000/login

登录进入之后，配置prometheus数据源，可以自定义一些监控图标，也可以去grafana的商店下载一些dashboard。详细grafana的使用自行了解。