Prometheus metrics and Grafana dashboards for GPU health and state monitoring.
Prometheus Metrics#
The gpu-state-service exposes Prometheus metrics on port 9100:
# GPU state and configuration
gpu_config_info{config="big-chat",model="llama-3.3-70b"} 1
gpu_state_info{state="ready"} 1
gpu_ready 1
gpu_config_switches_total 5
gpu_config_uptime_seconds 3600
# Temperature monitoring (from separate AMD GPU exporter on port 9101)
amd_gpu_temperature_junction_celsius{gpu="0"} 72
amd_gpu_temperature_junction_celsius{gpu="1"} 75
amd_gpu_utilization_percent{gpu="0"} 85
amd_gpu_power_watts{gpu="0"} 180Prometheus Endpoints#
| Port | Service | Metrics |
|---|---|---|
| 9100 | gpu-state-service | Config state, switches, uptime |
| 9101 | AMD GPU exporter | Temperature, utilization, power |
| 9102 | NVIDIA GPU exporter | RTX 2080 metrics (if present) |
Temperature Alerts#
The service monitors junction temperatures and sends ntfy notifications:
| Threshold | Level | Action |
|---|---|---|
| 97°C | Warning | Notification sent |
| 100°C | High | Notification sent |
| 105°C | Critical | Notification sent |
| 110°C | Emergency | Notification sent |
Grafana Dashboard#
The “Feynman GPUs” dashboard displays:
- Current configuration and loaded model
- GPU state (ready/loading/switching)
- Junction temperatures with threshold lines
- GPU utilization (compute and memory)
- Power draw across all GPUs
- Configuration timeline showing when switches occurred
Prometheus Scrape Configuration#
Add these targets to your Prometheus configuration:
scrape_configs:
- job_name: 'feynman-gpu-state'
static_configs:
- targets: ['192.168.1.131:9100']
labels:
host: feynman
- job_name: 'feynman-amd-gpu'
static_configs:
- targets: ['192.168.1.131:9101']
labels:
host: feynman
- job_name: 'feynman-nvidia-gpu'
static_configs:
- targets: ['192.168.1.131:9102']
labels:
host: feynmanExample Queries#
Current Configuration#
gpu_config_infoGPU Temperature Over Time#
amd_gpu_temperature_junction_celsius{gpu="0"}
amd_gpu_temperature_junction_celsius{gpu="1"}Average GPU Utilization (last 5 minutes)#
avg_over_time(amd_gpu_utilization_percent[5m])Configuration Switch Rate#
rate(gpu_config_switches_total[1h])