Prometheus metrics and Grafana dashboards for GPU health and state monitoring.

Prometheus Metrics#

The gpu-state-service exposes Prometheus metrics on port 9100:

# GPU state and configuration
gpu_config_info{config="big-chat",model="llama-3.3-70b"} 1
gpu_state_info{state="ready"} 1
gpu_ready 1
gpu_config_switches_total 5
gpu_config_uptime_seconds 3600

# Temperature monitoring (from separate AMD GPU exporter on port 9101)
amd_gpu_temperature_junction_celsius{gpu="0"} 72
amd_gpu_temperature_junction_celsius{gpu="1"} 75
amd_gpu_utilization_percent{gpu="0"} 85
amd_gpu_power_watts{gpu="0"} 180

Prometheus Endpoints#

Port Service Metrics
9100 gpu-state-service Config state, switches, uptime
9101 AMD GPU exporter Temperature, utilization, power
9102 NVIDIA GPU exporter RTX 2080 metrics (if present)

Temperature Alerts#

The service monitors junction temperatures and sends ntfy notifications:

Threshold Level Action
97°C Warning Notification sent
100°C High Notification sent
105°C Critical Notification sent
110°C Emergency Notification sent

Grafana Dashboard#

The “Feynman GPUs” dashboard displays:

  • Current configuration and loaded model
  • GPU state (ready/loading/switching)
  • Junction temperatures with threshold lines
  • GPU utilization (compute and memory)
  • Power draw across all GPUs
  • Configuration timeline showing when switches occurred

Prometheus Scrape Configuration#

Add these targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'feynman-gpu-state'
    static_configs:
      - targets: ['192.168.1.131:9100']
        labels:
          host: feynman

  - job_name: 'feynman-amd-gpu'
    static_configs:
      - targets: ['192.168.1.131:9101']
        labels:
          host: feynman

  - job_name: 'feynman-nvidia-gpu'
    static_configs:
      - targets: ['192.168.1.131:9102']
        labels:
          host: feynman

Example Queries#

Current Configuration#

gpu_config_info

GPU Temperature Over Time#

amd_gpu_temperature_junction_celsius{gpu="0"}
amd_gpu_temperature_junction_celsius{gpu="1"}

Average GPU Utilization (last 5 minutes)#

avg_over_time(amd_gpu_utilization_percent[5m])

Configuration Switch Rate#

rate(gpu_config_switches_total[1h])