Algroveon-AI – Running your own local LLM: Hardware and Setup

Why a Local LLM Makes Sense

At the moment, everyone is talking about AI assistants. OpenAI, Anthropic, and many others. These systems are becoming increasingly powerful, the operation increasingly simple, and that is exactly why they are becoming interesting to more and more people. However, almost all of these offerings have one thing in common: they run on third-party infrastructure. When you use them, you hand over your data—prompts, documents, emails, notes, conversation histories. Sometimes more obvious, sometimes less visible, but never fully under your own control.

That was exactly my starting point. I didn't want an AI playground, but an assistant that is truly useful in my daily life. Not just answering general questions, but processing emails, reading calendar entries, analyzing files, maintaining context, and supporting me in real workflows. And that is precisely the point where it gets sensitive: the more useful such a system becomes, the deeper it looks into your own digital life.

Because of this, my fundamental question became clear very quickly. I wanted to use the advantages of modern generative AI, but not at the cost of letting the central content of my digital daily life run permanently through external platforms. An assistant that potentially knows a lot about me should run on infrastructure that I control myself. Of course, this does not guarantee absolute security. Even a private home server can become a risk if poorly secured. Therefore, "local" does not automatically mean "secure," but rather: more control over architecture, access, data flows, and protective measures.

The obvious answer to this is a local LLM. In my case, that has since become Gemma-4. So, not a model from the cloud, but a model running on my own hardware in my own network.

Server or Laptop/Workstation?

The first obvious approach was a local model on a laptop. Technically, this can work very well. Especially current MacBooks are remarkably well-suited for local LLMs in many respects because the CPU, GPU, and RAM work together via a shared memory architecture with high bandwidth. This is interesting for inference workloads because the model is not limited to a classic dedicated GPU VRAM, but can use the collectively available memory. For many users, such a laptop can therefore be completely sufficient.

I therefore decided on a home server. For my use case, it wasn't decisive whether a laptop could fundamentally run local LLMs well, but rather that the system should be permanently available, independent of my primary workstation, and designed as its own infrastructure for such tasks.

GPU with a maximum of 70 Watts

The most critical element was the GPU. In practice, LLM inference runs primarily via the GPU—model size, speed, and to some extent the usable quality depend directly on the available VRAM. At the same time, the hardware should run continuously, consume as little power as possible, and not be noisy during home operation.

I chose the NVIDIA RTX PRO 2000 Blackwell: 16 GB GDDR7 VRAM, 70 Watt TDP, maximum 0.6 Sone noise level. Based purely on price-performance, an RTX 3090 with 24 GB VRAM would have been the stronger choice for local LLMs on paper. However, for my specific purpose, it was not a sensible option. The deciding factors were the significantly higher power consumption under load and the noise level, which plays a much larger role in continuous home operation than pure maximum performance. The RTX PRO 2000 is thus not the cheapest or the fastest solution, but the more coherent decision for this setup.

With less than 16 GB of VRAM, larger models quickly become problematic. You then have to quantize more heavily, offload, or compromise in other areas more than you would like. This can noticeably degrade output quality depending on the model and use case. In the 16 GB range, local inference for larger models finally becomes seriously interesting, even though it remains clear: while local setups can certainly keep up with current cloud models in individual use cases, they cannot yet generally match them in breadth, robustness, and overall performance.

Why local LLMs hit limits

This is not surprising. Behind services like ChatGPT or Claude stand large, highly specialized data centers with massive GPU deployment. Training, optimization, and inference run there on a technical basis that is neither economically nor physically comparable to a single home server. A local LLM must work with significantly less computing power, less VRAM, and overall tighter resources. Its strength today, therefore, lies not in beating online LLMs in every discipline, but in providing sufficient quality for clearly defined use cases with significantly more control over one's own data.

Proxmox: Virtualization as a Core Principle

Theoretically, Ollama could run directly on the host. That would be easier. But there is a good reason against it: a system that is intended to serve simultaneously as an AI inference server, mail server, document archive, Git server, and home automation hub needs clean separation between services.

Proxmox as a hypervisor solves this: each service runs in its own VM or container. A storage problem in the document archive does not affect the AI service. A container restart for the mail server does not take down the ongoing inference. Snapshots reduce the risk during changes and updates, though they naturally neither make them consequence-free nor completely risk-free.

The decisive mechanism for the AI VM is PCIe passthrough: the GPU is detached from the Proxmox host context and passed directly to a specific VM. The VM then sees the GPU almost as if it were directly installed—without the typical virtualization overhead that one does not want at this stage.

This is technically more demanding than a direct installation, but the subsequent operation is more cleanly separated and significantly more controllable in practice.

From Multi-Model Setup to Single-GPU Strategy

Initially, the system was built so that a fast 4b model resided permanently in the VRAM and was immediately available for simple queries. Depending on usage, this model would be unloaded to instead load a 9b model as a standard or a 27b model for more demanding tasks into the VRAM. All three models did not fit into the available memory at the same time, and the 27b model in particular was only practical in quantized form. The profile system of Algroveon-Agent was based exactly on this logic: not multiple models in parallel in the VRAM, but targeted switching depending on the task.

With the switch to Gemma-4 as the local LLM, this has changed. Gemma-4 is a Mixture-of-Experts model: it has 26 billion parameters, but only activates a portion of them at each step. Compared to a dense model of a similar size, this provides more efficient inference without the output quality dropping noticeably in daily use.

The problem: In my configuration, the model initially did not fit completely into 16 GB of VRAM. Three measures solved this:

Measure 1 – Removing the Vision Projector. Gemma-4 is multimodal and can also process images. For this, an additional vision component is required, which occupies VRAM. However, in my setup, Algroveon-Agent does not process images directly; PDFs are extracted as text. This part was therefore superfluous. A customized modelfile without this component consistently saves the corresponding memory.

Measure 2 – Switching Whisper to quantization. In my configuration, alongside the actual LLM, Faster-Whisper was also running for speech processing. This part also occupied VRAM and therefore had to be included in the overall calculation. Originally, Faster-Whisper ran with float16 weights. Switching to int8_float16 significantly reduced the VRAM requirement without any noticeable loss of quality in practical use. Latency increases only slightly, which in my case hardly matters because the TTS output determines the larger portion of the time anyway.

Measure 3 – Reducing the Context Window. The KV cache, which holds the conversation context in VRAM, grows with the context length. At 8192 tokens, my configuration was missing exactly the small remainder that would have been necessary for full GPU allocation—meaning part of the model layers ran on the CPU. At 4096 tokens, all layers fit completely into the VRAM. In practice, this was acceptable for me because Algroveon-Agent uses its own memory system that can store conversation content beyond individual sessions. Therefore, the pure context window size is less decisive in such a setup than with a model without an external memory system.

What is running now – and what that means

The server is today the infrastructure core for everything running under algroveon for me:

Algroveon-Agent queries the local Ollama API for every chat response, every tool call, and every summary.
Embeddings for the memory system also run locally (nomic-embed-text, on the CPU side).
Speech recognition and speech output run on the same server for the dashboard.
Image generation is available via ComfyUI, directly on the same GPU.

Not a single one of these steps has to run via external AI services. That is exactly the real point.

The objection often heard against local LLMs was justified until recently: local models were usually significantly weaker than cloud models in daily use. As a generalization, that is no longer true today. A quantized Gemma-4 on current hardware can already deliver astonishingly usable results for many tasks in daily use or as a personal assistant.

But that doesn't mean that local LLMs are automatically a "set it and forget it" solution today. With current tools, you can get a local model running on a Mac or another suitable system relatively quickly, even without a deep technical background. However, as soon as you go beyond mere installation, the limits quickly become apparent. Without prior knowledge, it is hard to estimate which models are truly suitable for which tasks, where quantization makes sense, how to evaluate memory limits, or which compromises between speed, quality, and resource consumption are sensible.

It becomes even more demanding when you want more than just a started model, but a real assistant. Such a setup is conceptually, technically, and organizationally much more complex than it appears at first glance. In the end, you almost always pay a price for convenience. In the case of the cloud, that price is very often your own data.