Prometheus監控系統教程

一、介紹

Prometheus是一個開源的監控系統，它由SoundCloud開發並於2016年加入了Cloud Native Computing Foundation。Prometheus旨在收集和處理來自各種來源的指標數據，包括伺服器應用程序和服務。

Prometheus可通過HTTP處理指標收集請求，並使用自定義查詢語言PromQL執行查詢。它還提供可視化UI、報警、警報分組、集成API等功能。Prometheus使用Pushgateway允許進行臨時的機器指標。

Prometheus的生成模型非常實用。使用PromQL進行度量指標查詢而不依賴於多個存儲節點，給了它一個非常標準化的視角，其中每個維度的是具有意義的變體。因此，基本上，我們可以通過內置的工具和集成API對其進行任何操作。

二、安裝和配置

1. 安裝

Prometheus使用Go編寫，可以從其GitHub倉庫中的tar包或源代碼構建中獲取。

wget https://github.com/prometheus/prometheus/releases/download/v2.15.2/prometheus-2.15.2.linux-amd64.tar.gz
tar xfz prometheus-*.tar.gz
cd prometheus-*
./prometheus

默認情況下，它會在當前目錄下查找prometheus.yml並使用它來配置它自己。如果您將其放在不同的位置。您將必須使用命令行標誌-prometheus.config.file=/path/to/prometheus.yml指定配置文件的位置。

2. 配置

Prometheus的配置文件是YAML格式的。Prometheus使用此文件配置如何抓取和處理指標。以下是一個示例配置文件：

global:
  scrape_interval:     15s      
  evaluation_interval: 15s      

scrape_configs:
  - job_name: 'prometheus'    
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'           
    static_configs:
      - targets: ['localhost:9100']

在上面的示例中，Prometheus將每15秒抓取已配置的作業。您可以在任何您需要監控的應用程序實例或節點上配置Prometheus客戶端進行抓取進而獲得指標數據。通過檢查節點的IP地址和埠號，Prometheus可以輕鬆地抓取每個節點的指標。

三、PromQL查詢語言

1. 基本查詢

PromQL是Prometheus Query Language的簡稱，是一種基於SQL的語法，用於從Prometheus監控數據源中提取指標數據。

以下是示例查詢：

http_requests_total

上述查詢將返回http_requests_total指標的時間序列數據，該指標表示所有發往webserver的請求的總數。

2. 歸併聚合

最常用的函數有count,sum,avg,min和max，它們可以應用於時間系列。以下是一個示例：

sum(http_requests_total)

上述查詢將返回所有類型中的http_requests_total的聚合值。您還可以將多個指標組合在一起進行操作，例如：

sum(http_requests_total) by (job)

上述查詢將返回按作業名稱分組的所有http_requests_total的聚合值。

四、報警和警報分組

1. 報警規則

Prometheus的報警規則是由基於PromQL的表達式形成的。

以下是一個示例報警規則：

alert: TargetDown
expr: up == 0
for: 5m
labels:
  severity: critical
annotations:
  summary: "Instance {{ $labels.instance }} down"

上述規則將在目標不可用時發出警報，並將匹配的標籤設置為「critical」。

2. 警報分組

警報分組使您可以將警報分組到單個警報通知中，而不是為每個警報生成自己的通知。

以下是一個示例警報規則：

groups:
  - name: Disk space
    rules:
      - alert: DiskUsageCritical
        expr: node_filesystem_utilisation{mountpoint="/"}/>
          node_filesystem_utilisation{mountpoint="/"} > 0.8
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: Instance {{ $labels.instance }} - high disk usage
  - name: Another group
    rules:
      - alert: AnotherAlert
        expr: another_expression
        labels:
          severity: high
        annotations:
          summary: Another alert
          description: Yet another alert for testing purposes

上述示例定義了兩個警報分組。第一個組計算可用磁碟空間的百分比，並在剩餘磁碟空間低於80％時向管理員發送警報。

五、可視化工具

1. Grafana

Grafana是一種基於web的開源分析和可視化解決方案，可在Prometheus數據源的基礎上進行構建和展示。

以下是一個示例配置文件：

apiVersion: 1

datasources:
- name: prometheus
  type: prometheus
  url: http://localhost:9090
  access: direct
  isDefault: true

dashboards:
- name: Example dashboard
  dataSource: prometheus
  panels:
  # ...
  templating:
  # ...

2. Prometheus Web UI

Prometheus附帶一個內置的Web UI，可用於查看可用的度量。您可以通過以下網址訪問它：

http://localhost:9090/graph

此界面允許您執行PromQL查詢，可視化和繪製時間序列，並創建警報規則並查看它們的狀態。

原創文章，作者：LZYSX，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/361610.html