Wiki major update
updated with recent documentation
This commit is contained in:
209
wiki/service-docs/alertmanager.md
Normal file
209
wiki/service-docs/alertmanager.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# Alertmanager - Alert Routing
|
||||
|
||||
## Table of Contents
|
||||
- [Overview](#overview)
|
||||
- [What is Alertmanager?](#what-is-alertmanager)
|
||||
- [Why Use Alertmanager?](#why-use-alertmanager)
|
||||
- [Configuration in AI-Homelab](#configuration-in-ai-homelab)
|
||||
- [Official Resources](#official-resources)
|
||||
- [Docker Configuration](#docker-configuration)
|
||||
|
||||
## Overview
|
||||
|
||||
**Category:** Alert Management
|
||||
**Docker Image:** [prom/alertmanager](https://hub.docker.com/r/prom/alertmanager)
|
||||
**Default Stack:** `monitoring.yml`
|
||||
**Web UI:** `http://SERVER_IP:9093`
|
||||
**Purpose:** Handle Prometheus alerts
|
||||
**Ports:** 9093
|
||||
|
||||
## What is Alertmanager?
|
||||
|
||||
Alertmanager handles alerts from Prometheus. It deduplicates, groups, and routes alerts to notification channels (email, Slack, PagerDuty, etc.). It also manages silencing and inhibition of alerts. The alerting component of the Prometheus ecosystem.
|
||||
|
||||
### Key Features
|
||||
- **Alert Routing:** Send to right channels
|
||||
- **Grouping:** Combine similar alerts
|
||||
- **Deduplication:** No duplicate alerts
|
||||
- **Silencing:** Mute alerts temporarily
|
||||
- **Inhibition:** Suppress dependent alerts
|
||||
- **Notifications:** Email, Slack, webhooks, etc.
|
||||
- **Web UI:** Manage alerts visually
|
||||
- **Free & Open Source:** Prometheus project
|
||||
|
||||
## Why Use Alertmanager?
|
||||
|
||||
1. **Prometheus Native:** Designed for Prometheus
|
||||
2. **Smart Routing:** Alerts go where needed
|
||||
3. **Deduplication:** No spam
|
||||
4. **Grouping:** Related alerts together
|
||||
5. **Silencing:** Maintenance mode
|
||||
6. **Multi-Channel:** Email, Slack, etc.
|
||||
|
||||
## Configuration in AI-Homelab
|
||||
|
||||
```
|
||||
/opt/stacks/monitoring/alertmanager/
|
||||
alertmanager.yml # Configuration
|
||||
data/ # Alert state
|
||||
```
|
||||
|
||||
### alertmanager.yml
|
||||
|
||||
```yaml
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
group_by: ['alertname']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 1h
|
||||
receiver: 'discord'
|
||||
|
||||
receivers:
|
||||
- name: 'discord'
|
||||
webhook_configs:
|
||||
- url: 'YOUR_DISCORD_WEBHOOK_URL'
|
||||
send_resolved: true
|
||||
|
||||
- name: 'email'
|
||||
email_configs:
|
||||
- to: 'alerts@yourdomain.com'
|
||||
from: 'alertmanager@yourdomain.com'
|
||||
smarthost: 'smtp.gmail.com:587'
|
||||
auth_username: 'your@gmail.com'
|
||||
auth_password: 'app_password'
|
||||
|
||||
inhibit_rules:
|
||||
- source_match:
|
||||
severity: 'critical'
|
||||
target_match:
|
||||
severity: 'warning'
|
||||
equal: ['alertname', 'instance']
|
||||
```
|
||||
|
||||
## Official Resources
|
||||
|
||||
- **Website:** https://prometheus.io/docs/alerting/latest/alertmanager
|
||||
- **Configuration:** https://prometheus.io/docs/alerting/latest/configuration
|
||||
|
||||
## Docker Configuration
|
||||
|
||||
```yaml
|
||||
alertmanager:
|
||||
image: prom/alertmanager:latest
|
||||
container_name: alertmanager
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- traefik-network
|
||||
ports:
|
||||
- "9093:9093"
|
||||
command:
|
||||
- '--config.file=/etc/alertmanager/alertmanager.yml'
|
||||
- '--storage.path=/alertmanager'
|
||||
volumes:
|
||||
- /opt/stacks/monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
|
||||
- /opt/stacks/monitoring/alertmanager/data:/alertmanager
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
1. **Configure Prometheus:**
|
||||
Add to prometheus.yml:
|
||||
```yaml
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: ['alertmanager:9093']
|
||||
|
||||
rule_files:
|
||||
- '/etc/prometheus/rules/*.yml'
|
||||
```
|
||||
|
||||
2. **Create Alert Rules:**
|
||||
`/opt/stacks/monitoring/prometheus/rules/alerts.yml`:
|
||||
```yaml
|
||||
groups:
|
||||
- name: example
|
||||
rules:
|
||||
- alert: InstanceDown
|
||||
expr: up == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Instance {{ $labels.instance }} down"
|
||||
description: "{{ $labels.instance }} has been down for more than 5 minutes."
|
||||
|
||||
- alert: HighCPU
|
||||
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage on {{ $labels.instance }}"
|
||||
description: "CPU usage is above 80% for more than 5 minutes."
|
||||
|
||||
- alert: HighMemory
|
||||
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High memory usage on {{ $labels.instance }}"
|
||||
description: "Memory usage is above 90%."
|
||||
|
||||
- alert: DiskSpaceLow
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Low disk space on {{ $labels.instance }}"
|
||||
description: "Disk space is below 10%."
|
||||
```
|
||||
|
||||
3. **Restart Prometheus:**
|
||||
```bash
|
||||
docker restart prometheus
|
||||
```
|
||||
|
||||
4. **Access Alertmanager UI:** `http://SERVER_IP:9093`
|
||||
|
||||
## Summary
|
||||
|
||||
Alertmanager routes alerts from Prometheus offering:
|
||||
- Alert deduplication
|
||||
- Grouping and routing
|
||||
- Multiple notification channels
|
||||
- Silencing and inhibition
|
||||
- Web UI management
|
||||
- Free and open-source
|
||||
|
||||
**Perfect for:**
|
||||
- Prometheus alert handling
|
||||
- Multi-channel notifications
|
||||
- Alert management
|
||||
- Maintenance silencing
|
||||
- Alert grouping
|
||||
|
||||
**Key Points:**
|
||||
- Receives alerts from Prometheus
|
||||
- Routes to notification channels
|
||||
- Deduplicates and groups
|
||||
- Supports silencing
|
||||
- Web UI for management
|
||||
- Configure in alertmanager.yml
|
||||
- Define rules in Prometheus
|
||||
|
||||
**Remember:**
|
||||
- Configure receivers (Discord, Email, etc.)
|
||||
- Create alert rules in Prometheus
|
||||
- Test alerts work
|
||||
- Use silencing for maintenance
|
||||
- Group related alerts
|
||||
- Set appropriate thresholds
|
||||
- Monitor alertmanager itself
|
||||
|
||||
Alertmanager manages your alerts intelligently!
|
||||
Reference in New Issue
Block a user