Last updated: May 2026
Running large language models locally gives you full privacy, zero API costs, and no rate limits. The trade-off is that hardware matters enormously — the wrong GPU can leave a model swapping to disk, turning a 2-second response into a 2-minute wait. This guide walks through every component decision, from GPU VRAM tiers to PSU sizing, and closes with three complete build examples at different budgets.
Why Run LLMs Locally?
Before spending money on hardware, it helps to be clear on the benefits:
- Privacy — your prompts never leave your machine. Critical for code review, legal documents, or medical queries.
- No API costs — run unlimited tokens for the price of electricity.
- No rate limits — generate as fast as your hardware allows, any time of day.
- Offline access — useful when traveling, on a secured network, or when a provider has an outage.
- Fine-tuning control — load custom LoRA adapters, quantized variants, or uncensored community models.
The main trade-off versus cloud APIs is upfront hardware cost and model size limits driven by VRAM.
The Golden Rule: VRAM Is Everything
For GPU inference, the entire model (or as much of it as possible) must fit in GPU VRAM. When a model overflows VRAM it spills to system RAM, and performance drops by 10–50x depending on how much overflows. A 70B model at 4-bit quantization needs roughly 40 GB of VRAM to run fully on-GPU.
Use this table to understand which model sizes fit in which VRAM tier:
| VRAM | 4-bit quantized models that fit fully on-GPU |
|---|---|
| 6 GB | 7B models (tight) |
| 8 GB | 7B comfortably, 13B partially |
| 12 GB | 13B comfortably, 20B partially |
| 16 GB | 13B–20B fully, 30B partially |
| 24 GB | 30B–34B fully, 70B partially (~30 layers) |
| 48 GB | 70B fully, 120B partially |
| 80 GB | 70B comfortably, 180B partially |
| 2× 24 GB | 70B fully across two GPUs (NVLink or PCIe) |
Rule of thumb: Model parameters × 0.5 ≈ GB of VRAM needed at 4-bit quantization. A 70B model × 0.5 = ~35 GB minimum; add 5 GB headroom for KV cache → 40 GB total.
Component Guide
1. GPU (Most Important)
The GPU is the single most impactful component. Prioritize VRAM capacity over raw shader performance — a slower GPU with more VRAM will outperform a faster GPU that has to offload layers to RAM.
NVIDIA
NVIDIA is the dominant choice for LLM inference. The CUDA ecosystem is mature, and all major tools (llama.cpp, vLLM, Ollama, Transformers) support it first.
| GPU | VRAM | Bandwidth | Best For | Est. Price (USD) |
|---|---|---|---|---|
| RTX 5090 | 32 GB GDDR7 | 1,792 GB/s | 70B 4-bit comfortably | ~$2,000 |
| RTX 5080 | 16 GB GDDR7 | 960 GB/s | 30B fully, 70B partial | ~$1,000 |
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | 34B fully, 70B partial | ~$1,400 (used) |
| RTX 4080 Super | 16 GB GDDR6X | 736 GB/s | 30B fully | ~$700 (used) |
| RTX 4070 Ti Super | 16 GB GDDR6X | 672 GB/s | 30B fully | ~$600 (used) |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | 34B fully, 70B partial | ~$400 (used) |
| RTX 3090 Ti | 24 GB GDDR6X | 1,008 GB/s | 34B fully | ~$500 (used) |
Recommendation: The RTX 4090 (used) offers the best performance-per-dollar for single-GPU builds. The RTX 5090 is the best single GPU money can buy, but costs 5× as much as a used 3090 for ~4× the tokens/second.
AMD
AMD support has improved significantly in 2025–2026 via ROCm 6.x and native llama.cpp support. AMD is a good option if you want more VRAM per dollar.
| GPU | VRAM | Bandwidth | Notes |
|---|---|---|---|
| RX 7900 XTX | 24 GB GDDR6 | 960 GB/s | Best AMD single-GPU option |
| RX 7900 XT | 20 GB GDDR6 | 800 GB/s | 20B models fully |
| RX 6800 XT | 16 GB GDDR6 | 512 GB/s | Budget AMD option |
| Radeon PRO W7900 | 48 GB GDDR6 | 864 GB/s | Workstation card; expensive but fits 70B |
AMD caveat: ROCm requires a Linux host and is not supported on Windows. If you plan to run on Windows, stick with NVIDIA.
Apple Silicon (Mac Studio / Mac Pro)
If you already own or plan to buy Apple Silicon, it deserves mention. Apple's unified memory architecture means the GPU and CPU share a large, fast pool:
| Chip | Unified Memory | Bandwidth | Notes |
|---|---|---|---|
| M4 Max | 128 GB | 546 GB/s | Fits 70B comfortably |
| M3 Ultra | 192 GB | 800 GB/s | Fits 120B+ |
| M4 Ultra (2025) | 256 GB | 1,000+ GB/s | Near-datacenter performance |
The Mac Studio M4 Max (128 GB) costs ~$3,000 and runs 70B models at 15–20 tokens/second — competitive with a single RTX 4090 at a similar price. Not upgradeable, but a clean and power-efficient option.
2. CPU
For inference workloads, the CPU is secondary to the GPU. Focus on:
- PCIe lanes — if you plan two GPUs, ensure the platform has enough lanes (x16 + x16 or x16 + x8). AMD Threadripper and Intel HEDT platforms provide 64–128 PCIe lanes; consumer AM5/LGA1851 platforms offer 24–28.
- Cores — 8–16 cores is plenty. LLM inference itself happens on GPU; the CPU handles tokenization, batching, and RAM ↔ VRAM transfers.
- Memory channels — more channels = higher system RAM bandwidth when layers offload to RAM.
| CPU | Platform | PCIe Lanes | Good For |
|---|---|---|---|
| AMD Ryzen 9 9950X | AM5 | 28 | Single GPU, good value |
| Intel Core Ultra 9 285K | LGA1851 | 24 | Single GPU |
| AMD Threadripper PRO 7985WX | TRX50 | 128 | Multi-GPU, serious workloads |
| AMD Threadripper 7970X | TRX50 | 128 | Multi-GPU, community builds |
For a single-GPU build, any modern 8+ core CPU works fine. Only go HEDT if you plan two or more GPUs.
3. System RAM
System RAM matters most when model layers spill out of VRAM. At full GPU fit, 32 GB is sufficient. If you plan to partially offload a large model, more system RAM helps.
| RAM | Use Case |
|---|---|
| 32 GB DDR5 | Single GPU, models fit fully in VRAM |
| 64 GB DDR5 | Single GPU, partial offload of 70B+ models |
| 128 GB DDR5 | Multi-GPU, large model offload, CPU-only fallback |
| 256+ GB DDR5 | CPU-only inference of 70B+ (slow but possible) |
Tip: High-bandwidth RAM matters for CPU-only inference (llama.cpp --n-gpu-layers 0). DDR5-6400 with 4 channels can hit ~200 GB/s, giving usable speeds for 13B models on CPU alone.
4. Storage
Model files are large — a 70B model at 4-bit quantization is roughly 40 GB. You will accumulate models quickly.
| Storage | Purpose |
|---|---|
| 2 TB NVMe PCIe 4.0 (OS + active models) | Fast load times (~5–10s for 70B from NVMe vs ~60s from HDD) |
| 4–8 TB HDD or SATA SSD (model library) | Store inactive models cheaply; copy to NVMe before use |
Load time reality check: Even with fast NVMe, a 40 GB model takes ~8–12 seconds to load into VRAM. Once loaded, it stays in VRAM until you unload it, so load time is a one-time cost per session.
5. Motherboard
Key things to check:
- PCIe x16 slot spacing — dual-slot GPUs need physical clearance. Measure slot pitch; cards like the RTX 4090 are 3-slot wide.
- PCIe 5.0 — future-proofs GPU bandwidth, though PCIe 4.0 is not a bottleneck for current single-GPU inference.
- ECC memory support — optional but nice for long-running inference servers.
- M.2 slots — get at least two PCIe 4.0 M.2 slots for your NVMe drives.
6. Power Supply
GPU TDP for cards like the RTX 4090/5090 is 450–600 W. Add CPU, RAM, drives, and cooling fans:
| Build Tier | GPU TDP | Total System Draw | Recommended PSU |
|---|---|---|---|
| Budget (RTX 3090) | 350 W | ~500 W | 750 W 80+ Gold |
| Mid-range (RTX 4090) | 450 W | ~650 W | 850–1000 W 80+ Gold |
| Enthusiast (RTX 5090) | 575 W | ~800 W | 1000–1200 W 80+ Platinum |
| Dual GPU (2× RTX 3090) | 700 W | ~1000 W | 1200–1600 W 80+ Platinum |
Always buy PSU headroom — running a PSU at 90%+ load reduces efficiency and lifespan. An 80+ Gold or Platinum rated unit saves meaningful electricity over months of continuous inference.
7. Cooling
LLM inference is a sustained, heavy workload — not a brief gaming burst. The GPU runs at near-100% for minutes or hours at a time.
- GPU cooler — the stock cooler on RTX 4090/5090 cards is adequate, but aftermarket options like the Arctic Accelero reduce temperatures by 10–15°C and noise significantly.
- Case airflow — choose a case with strong front-to-back airflow. Fractal Design Torrent, Lian Li O11, and be quiet! Silent Base 802 are popular options.
- Ambient temperature — every 10°C increase in ambient temperature costs ~5% performance due to thermal throttling. Keep your room cool or add a case fan controller.
Three Complete Build Examples
Build 1: Budget Local LLM (~$850)
Target: Run 13B models fully on-GPU, 7B at fast speeds. Good for coding assistants (DeepSeek Coder 6.7B, Qwen 2.5 Coder 14B) and general chat (Llama 3.2 8B).
| Component | Choice | Price |
|---|---|---|
| GPU | RTX 3090 (used) | ~$380 |
| CPU | AMD Ryzen 7 7700 | ~$200 |
| Motherboard | B650 micro-ATX | ~$100 |
| RAM | 32 GB DDR5-5600 | ~$70 |
| Storage | 1 TB NVMe PCIe 4.0 | ~$60 |
| PSU | 750 W 80+ Gold | ~$80 |
| Case + fans | Mid-tower | ~$70 |
| Total | ~$960 |
Performance: ~18–22 tokens/second on Llama 3.2 8B Q4_K_M. 13B models at ~10 tokens/second. 34B partially offloaded, ~4–5 tokens/second.
Build 2: Mid-Range (~$2,400)
Target: Run 34B models fully on-GPU, 70B partially. Suitable for serious coding, document analysis, and local RAG pipelines (Qwen 2.5 32B, DeepSeek R1 32B, Mistral Large 2).
| Component | Choice | Price |
|---|---|---|
| GPU | RTX 4090 (used) | ~$1,400 |
| CPU | AMD Ryzen 9 9900X | ~$280 |
| Motherboard | X870 ATX | ~$180 |
| RAM | 64 GB DDR5-6000 | ~$140 |
| Storage | 2 TB NVMe PCIe 4.0 | ~$110 |
| PSU | 1000 W 80+ Gold | ~$120 |
| Case + fans | Full-tower with airflow | ~$100 |
| Total | ~$2,330 |
Performance: ~35–40 tokens/second on Llama 3.3 70B Q4_K_M (partial GPU). 34B at ~55 tokens/second. Sub-second first-token latency on 13B models.
Build 3: Enthusiast (~$5,500)
Target: Run 70B models fully on-GPU, 120B+ partially. Fast enough for production API serving with multiple concurrent users. Fits Llama 3.3 70B fully, partially runs Llama 3.1 405B.
| Component | Choice | Price |
|---|---|---|
| GPU | RTX 5090 (32 GB GDDR7) | ~$2,000 |
| CPU | AMD Ryzen 9 9950X | ~$550 |
| Motherboard | X870E ATX | ~$300 |
| RAM | 128 GB DDR5-6400 | ~$280 |
| Storage | 4 TB NVMe PCIe 5.0 | ~$350 |
| PSU | 1200 W 80+ Platinum | ~$180 |
| Case + fans | Fractal Design Torrent XL | ~$180 |
| Total | ~$3,840 |
Alternative enthusiast build — 2× RTX 3090 (48 GB total VRAM, ~$800 used) instead of RTX 5090: fits 70B fully across two GPUs via llama.cpp multi-GPU, lower tokens/second than 5090 but 70B fits entirely in VRAM. Total build cost ~$2,200.
Performance (RTX 5090): ~65–75 tokens/second on Llama 3.3 70B Q4_K_M. ~120 tokens/second on Qwen 2.5 32B. Can serve 3–4 concurrent users with Ollama's parallel request handling.
Model Recommendations by VRAM Tier
| VRAM | Recommended Models | Notes |
|---|---|---|
| 8 GB | Llama 3.2 8B, Mistral 7B, Qwen 2.5 7B, DeepSeek Coder 6.7B | Plenty for coding and chat |
| 16 GB | Qwen 2.5 14B, Gemma 3 12B, Phi-4 14B, DeepSeek R1 14B | Strong reasoning at 16B |
| 24 GB | Qwen 2.5 32B Q3, Mistral Small 24B, DeepSeek R1 32B Q3 | Near-frontier quality locally |
| 32 GB | Qwen 2.5 32B Q5, Llama 3.3 70B Q2, Gemma 3 27B | Excellent all-round quality |
| 48 GB | Llama 3.3 70B Q4_K_M, DeepSeek R1 70B | Top local model quality |
| 80 GB | Llama 3.1 405B Q2, Mistral Large 2 | Near-GPT-4 class locally |
Software Setup
Ollama (Easiest)
Ollama is the fastest way to get a model running. It handles model downloads, CUDA detection, and exposes a simple REST API compatible with the OpenAI SDK.
# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically on first run)
ollama run llama3.3:70b-instruct-q4_K_M
# Run as a server on port 11434
ollama serve
Ollama automatically uses all available GPUs and falls back to CPU offload when VRAM is insufficient. Query it from any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.3:70b-instruct-q4_K_M",
messages=[{"role": "user", "content": "Explain gradient descent in simple terms."}]
)
print(response.choices[0].message.content)
llama.cpp (Most Control)
llama.cpp gives you fine-grained control over GPU layer offloading, quantization format, context length, and batch size.
# Build with CUDA support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Download a GGUF model (example: Llama 3.3 70B Q4_K_M)
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
--include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" --local-dir ./models
# Run with 80 GPU layers (adjust based on your VRAM)
./build/bin/llama-server \
-m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 80 \
--ctx-size 8192 \
--host 0.0.0.0 \
--port 8080
Key flags:
| Flag | Purpose |
|---|---|
--n-gpu-layers N | Number of transformer layers to put on GPU. Set to 999 to put all layers on GPU. |
--ctx-size N | Context window in tokens. Larger contexts use more VRAM for KV cache. |
--parallel N | Number of parallel request slots (for serving multiple users). |
--flash-attn | Enable Flash Attention 2 — reduces VRAM usage and speeds up long contexts. |
--mlock | Lock model weights in RAM to prevent swapping (requires sufficient RAM). |
LM Studio (Desktop GUI)
LM Studio is a graphical desktop app (Windows, macOS, Linux) that lets you browse, download, and run GGUF models without any terminal interaction. Useful for non-technical users or quick experimentation.
Download at lmstudio.ai. It includes a built-in model browser connected to Hugging Face, GPU offload sliders, and an OpenAI-compatible local server.
Power and Noise Considerations
LLM inference machines run GPUs at 100% load for extended periods. Plan accordingly:
- Electricity cost — an RTX 4090 at full load draws ~450 W. At $0.15/kWh, that is ~$1.62/day if running 24 hours. A 6-hour daily inference session costs ~$0.40/day, or ~$12/month.
- Noise — high-end GPUs under sustained load spin their fans aggressively. Consider aftermarket GPU coolers (Arctic Accelero, Morpheus) or a case in a separate room if noise is a concern.
- Heat — a 600 W GPU raises room temperature meaningfully in a small office. In warm climates, budget for better room cooling or schedule heavy workloads at night.
Frequently Asked Questions
Can I run a 70B model with 24 GB VRAM? Yes, partially. With 24 GB VRAM and 64 GB system RAM, llama.cpp will put ~50–55 of 80 layers on the GPU and offload the rest to RAM. You will get around 8–12 tokens/second instead of 35–40 tokens/second at full GPU fit — still usable for interactive chat, but slow for batch workloads.
Is AMD a viable alternative to NVIDIA? Yes, on Linux with ROCm 6.x. The RX 7900 XTX gives you 24 GB VRAM for ~$700 and runs llama.cpp almost as fast as an RTX 4090 (slightly lower memory bandwidth hurts a little). Not recommended for Windows users — ROCm does not support Windows.
What is the minimum GPU for useful local LLMs? An RTX 3060 (12 GB VRAM) for ~$200 used is the minimum that makes sense. It runs 13B models at ~7 tokens/second and 7B models at ~15 tokens/second. Slower than you would like for long conversations but perfectly usable for coding assistance.
Do I need NVLink for a dual-GPU setup? No. llama.cpp distributes model layers across multiple GPUs over PCIe without NVLink. NVLink allows GPUs to share their VRAM pools (two 24 GB cards become a 48 GB combined pool), which is useful. But standard PCIe multi-GPU also works — each GPU holds a subset of layers and communicates through the PCIe bus.
How does quantization affect quality? Q4_K_M is the sweet spot: it retains ~99% of the full-precision model quality while using 50% of the VRAM. Q2_K and Q3_K_S reduce size further but noticeably degrade reasoning on complex tasks. For most users: use Q4_K_M or Q5_K_M; only drop to Q3 if VRAM forces you to.
Summary
| Priority | Component | Key Metric |
|---|---|---|
| 1 | GPU | VRAM capacity first, bandwidth second |
| 2 | System RAM | 64 GB if partially offloading 70B+ models |
| 3 | Storage | Fast NVMe for model load times |
| 4 | CPU | Any modern 8-core; PCIe lanes matter for dual-GPU |
| 5 | PSU | Headroom is cheap; undersizing is not |
The most common mistake is buying a GPU with high shader performance but insufficient VRAM. A used RTX 3090 (24 GB, ~$380) beats a new RTX 4070 (12 GB, ~$500) for LLM inference simply because it fits larger models entirely in VRAM.
Start with a single RTX 3090 or 4090 depending on budget, install Ollama, and pull llama3.3:70b-instruct-q4_K_M — you will have a capable local AI assistant running within 30 minutes.