Server Surveillance – Monitoring Proxmox and Raspberry Pi on a Dashboard

How a centralized server monitoring system for Proxmox VMs and Raspberry Pis came to be.

Zusammenfassung

A customized monitoring dashboard simplifies the management of home server environments by providing a centralized view of various components. It combines data from Proxmox hosts, virtual machines, Raspberry Pis, and specific services onto a single interface. This allows for the efficient monitoring of hardware resources such as GPUs, as well as service availability and power consumption.


Diese Zusammenfassung wurde mit KI-Unterstützung erstellt.

The Starting Point

Setting up a home server isn't a one-time task where you build it, configure it, and you're done. Running it is an ongoing process: checking if everything is running smoothly, identifying bottlenecks early, noticing outages, and categorizing workloads. To do this effectively, you need clean monitoring as a foundation.

The problem: Many existing monitoring solutions are built for significantly larger environments. More servers, more roles, more operational overhead. They are powerful and flexible, but they also bring quite a bit with them: exporters, databases, dashboards, ongoing maintenance, and regular updates. For a private setup, this quickly results in more infrastructure for monitoring the actual infrastructure than the setup itself can reasonably justify.

My goal, therefore, was much simpler: a single webpage that immediately shows me what is happening when I open it. Proxmox host, all VMs, all LXCs, the GPU in the AI VM, the Raspberry Pi in the hallway, the relevant services – everything at a glance, without third-party services and without cloud dependency.


What Proxmox provides – and what it doesn't

Proxmox already comes with its own monitoring interface. For CPU, RAM, and storage per node or guest, this is sufficient in many cases. However, for my specific use case, a few things are missing:

  • GPU Monitoring: The RTX PRO 2000 is passed through to a VM via PCIe passthrough. From the hypervisor's perspective, Proxmox no longer sees this GPU as its own resource. Therefore, temperature, VRAM usage, and running processes cannot be obtained via Proxmox itself, but only directly within the VM – in my case via SSH and nvidia-smi.

  • Raspberry Pi: The Pi is not a Proxmox guest and therefore does not appear there.

  • Service Health: Proxmox also does not show whether Ollama, Faster-Whisper, Piper TTS, or Algroveon-News are currently reachable.

  • Electricity Costs: Connecting Intel RAPL, GPU power consumption, and electricity prices to provide a continuous cost estimate is not part of the standard feature set.

Server-Surveillance fills exactly these gaps – not as a replacement for Proxmox, but as a complementary dashboard that brings the relevant layers together in one place.


The Approach: SSR + HTMX instead of SPA

The fundamental technical decision was similar to Algroveon-News: server-side rendering with HTMX for reactive updates, but without a JavaScript framework.

On the first request, the dashboard renders a complete HTML page via Jinja2. After that, HTMX loads only individual partials at fixed intervals – small HTML fragments that each update only the content of a single card. Node, GPU, VMs, Pi, and services are each their own partials with their own endpoints.

The practical advantage is simple: it requires no JavaScript build system, no frontend overhead, and no separate SPA deployment. A Python environment and a running uvicorn process are sufficient.


Data Sources: Three Different Ways

The monitoring combines three fundamentally different methods of data access:

1. Proxmox REST API

The Proxmox API is queried via proxmoxer: node status, VM list, LXC list, as well as guest CPU and RAM usage. The connection runs via a specially created read-only API token (monitor@pve), meaning it is intentionally restricted to read access without execution privileges.

2. SSH to the Proxmox Host and the GPU VM

Whatever the Proxmox API does not provide comes via SSH. This includes, for example, per-core CPU load from /proc/stat, clock frequencies from /proc/cpuinfo, temperatures via sensors – and on the GPU VM, nvidia-smi for the relevant GPU metrics.

SSH access runs via a dedicated monitor user. This user does not have sudo, but only the rights necessary to read the required information, plus the ability to execute nvidia-smi. That was exactly the point: as few privileges as possible.

The Intel RAPL energy counter (/sys/class/powercap/intel-rapl:0/energy_uj) is read twice with a one-second interval. The current power consumption can be derived from the difference. This happens directly via the hardware counter and without additional helper scripts.

3. Pi Agent

The Raspberry Pi is not a Proxmox guest and therefore does not provide a mechanism through which Proxmox could collect its data. The solution was a deliberately small FastAPI service running directly on the Pi.

The agent consists of only about 60 lines of Python and uses psutil as its central dependency. It provides GET /metrics for CPU, RAM, disk, temperature, and uptime, as well as GET /health as a simple liveness check.

The decision to use a small custom HTTP service instead of a classic Prometheus exporter or node_exporter was a deliberately pragmatic one. For a single Pi in a home network, a complete Prometheus approach would have been unnecessary overhead. At the same time, such a small custom service is easier to understand, easier to adapt, and easier to maintain in this case.


Electricity Costs: Approximation from Measured Values and Estimates

The electricity cost feature was one of the more interesting parts of the project. However, it is important to provide context: the system does not measure the total consumption of the server with physical precision. It combines individual measured values with fixed assumptions and approximations. The result is therefore intentionally a continuous cost estimate – not an exact power measurement at the hardware level.

The total estimate is composed of three sources:

CPU: Intel RAPL provides a continuous measured value for the processor. To this, a configurable fixed value is added for the remaining base consumption of the system, such as the motherboard, RAM, SSDs, fans, and other constant consumers. This part is already no longer a direct measurement, but an approximation based on averages and empirical values.

GPU: nvidia-smi provides the current power consumption directly via the NVIDIA driver. This is particularly interesting for the AI VM, as load spikes become very clearly visible there.

Pi: For the Raspberry Pi, there is no separate hardware energy meter in this setup. Therefore, I use a simple approximation: a base value at idle plus a load-dependent surcharge. This, too, is not an exact measurement, but a deliberately practical estimate for home use.

The collector calculates kilowatt-hours over 24 hours, 30 days, and since the last server start based on these values. Multiplied by the stored electricity price in €/kWh, this results in a continuous cost estimate. The 24h report also displays peak values for important metrics.


Collector: Two-Phase Sampling

If the collector were to write all metrics to the database every 10 seconds, a huge amount of data points would accumulate over time. For a private server, this is generally not necessary and puts unnecessary load on SQLite.

The solution is therefore a two-phase collector.

Idle Mode (every 180 seconds): If all configured trigger thresholds are below their limits – i.e., normal load, no significant GPU activity, no unusual VM activity – a data point is written only every 3 minutes. This is sufficient for long-term trends and the 30-day report.

Active Mode (every 30 seconds): As soon as a defined threshold is exceeded, the collector switches to a tighter interval. This could be, for example, GPU load when an Ollama inference is currently running. After a certain period without a new trigger, the system returns to idle mode.

The result is a sensible compromise: enough data points for load spikes and daily evaluations, but no unnecessarily bloated SQLite database during normal operation.


Alerting

Configurable thresholds for the monitored metrics – such as CPU, RAM, disk, GPU temperature, VRAM, or Pi temperature – are checked with every collector tick. If a limit is exceeded, an alert is recorded in the alert_log table with a timestamp, source, and severity (warning / error). As soon as the value returns to the normal range, the alert is automatically closed.

This is intentionally not a large alerting system with escalation logic or external notifications. For home use, it was more important to me to make problems visible and traceable rather than immediately building a full-fledged notification system.


Why a Separate Instance Makes Sense

The decisive design decision was to deliberately keep Server-Surveillance independent. The system knows the Proxmox API, the SSH access, and the local sources for RAPL and GPU metrics.

The practical advantage: If a VM has problems or a service within the monitored infrastructure fails, the dashboard itself remains as unaffected as possible because it runs separately on the host. If the Algroveon-Agent fails, you see it in the service status. If the GPU VM has a problem, that is exactly what becomes visible.

A monitoring system that runs entirely within the infrastructure being monitored – for example, as its own stack in a VM – would not have this advantage in this form.

Sebastian Software Engineer & Wildlife Photographer
← ← Back to blog