Last updated: May 2026

Running large language models locally gives you full privacy, zero API costs, and no rate limits. The trade-off is that hardware matters enormously — the wrong GPU can leave a model swapping to disk, turning a 2-second response into a 2-minute wait. This guide walks through every component decision, from GPU VRAM tiers to PSU sizing, and closes with three complete build examples at different budgets.

Why Run LLMs Locally?

Before spending money on hardware, it helps to be clear on the benefits:

Privacy — your prompts never leave your machine. Critical for code review, legal documents, or medical queries.
No API costs — run unlimited tokens for the price of electricity.
No rate limits — generate as fast as your hardware allows, any time of day.
Offline access — useful when traveling, on a secured network, or when a provider has an outage.
Fine-tuning control — load custom LoRA adapters, quantized variants, or uncensored community models.

The main trade-off versus cloud APIs is upfront hardware cost and model size limits driven by VRAM.

The Golden Rule: VRAM Is Everything

For GPU inference, the entire model (or as much of it as possible) must fit in GPU VRAM. When a model overflows VRAM it spills to system RAM, and performance drops by 10–50x depending on how much overflows. A 70B model at 4-bit quantization needs roughly 40 GB of VRAM to run fully on-GPU.

Use this table to understand which model sizes fit in which VRAM tier:

VRAM	4-bit quantized models that fit fully on-GPU
6 GB	7B models (tight)
8 GB	7B comfortably, 13B partially
12 GB	13B comfortably, 20B partially
16 GB	13B–20B fully, 30B partially
24 GB	30B–34B fully, 70B partially (~30 layers)
48 GB	70B fully, 120B partially
80 GB	70B comfortably, 180B partially
2× 24 GB	70B fully across two GPUs (NVLink or PCIe)

Rule of thumb: Model parameters × 0.5 ≈ GB of VRAM needed at 4-bit quantization. A 70B model × 0.5 = ~35 GB minimum; add 5 GB headroom for KV cache → 40 GB total.

Component Guide

1. GPU (Most Important)

The GPU is the single most impactful component. Prioritize VRAM capacity over raw shader performance — a slower GPU with more VRAM will outperform a faster GPU that has to offload layers to RAM.

NVIDIA

NVIDIA is the dominant choice for LLM inference. The CUDA ecosystem is mature, and all major tools (llama.cpp, vLLM, Ollama, Transformers) support it first.

GPU	VRAM	Bandwidth	Best For	Est. Price (USD)
RTX 5090	32 GB GDDR7	1,792 GB/s	70B 4-bit comfortably	~$2,000
RTX 5080	16 GB GDDR7	960 GB/s	30B fully, 70B partial	~$1,000
RTX 4090	24 GB GDDR6X	1,008 GB/s	34B fully, 70B partial	~$1,400 (used)
RTX 4080 Super	16 GB GDDR6X	736 GB/s	30B fully	~$700 (used)
RTX 4070 Ti Super	16 GB GDDR6X	672 GB/s	30B fully	~$600 (used)
RTX 3090	24 GB GDDR6X	936 GB/s	34B fully, 70B partial	~$400 (used)
RTX 3090 Ti	24 GB GDDR6X	1,008 GB/s	34B fully	~$500 (used)

Recommendation: The RTX 4090 (used) offers the best performance-per-dollar for single-GPU builds. The RTX 5090 is the best single GPU money can buy, but costs 5× as much as a used 3090 for ~4× the tokens/second.

AMD

AMD support has improved significantly in 2025–2026 via ROCm 6.x and native llama.cpp support. AMD is a good option if you want more VRAM per dollar.

GPU	VRAM	Bandwidth	Notes
RX 7900 XTX	24 GB GDDR6	960 GB/s	Best AMD single-GPU option
RX 7900 XT	20 GB GDDR6	800 GB/s	20B models fully
RX 6800 XT	16 GB GDDR6	512 GB/s	Budget AMD option
Radeon PRO W7900	48 GB GDDR6	864 GB/s	Workstation card; expensive but fits 70B

AMD caveat: ROCm requires a Linux host and is not supported on Windows. If you plan to run on Windows, stick with NVIDIA.

Apple Silicon (Mac Studio / Mac Pro)

If you already own or plan to buy Apple Silicon, it deserves mention. Apple's unified memory architecture means the GPU and CPU share a large, fast pool:

Chip	Unified Memory	Bandwidth	Notes
M4 Max	128 GB	546 GB/s	Fits 70B comfortably
M3 Ultra	192 GB	800 GB/s	Fits 120B+
M4 Ultra (2025)	256 GB	1,000+ GB/s	Near-datacenter performance

The Mac Studio M4 Max (128 GB) costs ~$3,000 and runs 70B models at 15–20 tokens/second — competitive with a single RTX 4090 at a similar price. Not upgradeable, but a clean and power-efficient option.

2. CPU

For inference workloads, the CPU is secondary to the GPU. Focus on:

PCIe lanes — if you plan two GPUs, ensure the platform has enough lanes (x16 + x16 or x16 + x8). AMD Threadripper and Intel HEDT platforms provide 64–128 PCIe lanes; consumer AM5/LGA1851 platforms offer 24–28.
Cores — 8–16 cores is plenty. LLM inference itself happens on GPU; the CPU handles tokenization, batching, and RAM ↔ VRAM transfers.
Memory channels — more channels = higher system RAM bandwidth when layers offload to RAM.

CPU	Platform	PCIe Lanes	Good For
AMD Ryzen 9 9950X	AM5	28	Single GPU, good value
Intel Core Ultra 9 285K	LGA1851	24	Single GPU
AMD Threadripper PRO 7985WX	TRX50	128	Multi-GPU, serious workloads
AMD Threadripper 7970X	TRX50	128	Multi-GPU, community builds

For a single-GPU build, any modern 8+ core CPU works fine. Only go HEDT if you plan two or more GPUs.

3. System RAM

System RAM matters most when model layers spill out of VRAM. At full GPU fit, 32 GB is sufficient. If you plan to partially offload a large model, more system RAM helps.

RAM	Use Case
32 GB DDR5	Single GPU, models fit fully in VRAM
64 GB DDR5	Single GPU, partial offload of 70B+ models
128 GB DDR5	Multi-GPU, large model offload, CPU-only fallback
256+ GB DDR5	CPU-only inference of 70B+ (slow but possible)

Tip: High-bandwidth RAM matters for CPU-only inference (llama.cpp --n-gpu-layers 0). DDR5-6400 with 4 channels can hit ~200 GB/s, giving usable speeds for 13B models on CPU alone.

4. Storage

Model files are large — a 70B model at 4-bit quantization is roughly 40 GB. You will accumulate models quickly.

Storage	Purpose
2 TB NVMe PCIe 4.0 (OS + active models)	Fast load times (~5–10s for 70B from NVMe vs ~60s from HDD)
4–8 TB HDD or SATA SSD (model library)	Store inactive models cheaply; copy to NVMe before use

Load time reality check: Even with fast NVMe, a 40 GB model takes ~8–12 seconds to load into VRAM. Once loaded, it stays in VRAM until you unload it, so load time is a one-time cost per session.

5. Motherboard

Key things to check:

PCIe x16 slot spacing — dual-slot GPUs need physical clearance. Measure slot pitch; cards like the RTX 4090 are 3-slot wide.
PCIe 5.0 — future-proofs GPU bandwidth, though PCIe 4.0 is not a bottleneck for current single-GPU inference.
ECC memory support — optional but nice for long-running inference servers.
M.2 slots — get at least two PCIe 4.0 M.2 slots for your NVMe drives.

6. Power Supply

GPU TDP for cards like the RTX 4090/5090 is 450–600 W. Add CPU, RAM, drives, and cooling fans:

Build Tier	GPU TDP	Total System Draw	Recommended PSU
Budget (RTX 3090)	350 W	~500 W	750 W 80+ Gold
Mid-range (RTX 4090)	450 W	~650 W	850–1000 W 80+ Gold
Enthusiast (RTX 5090)	575 W	~800 W	1000–1200 W 80+ Platinum
Dual GPU (2× RTX 3090)	700 W	~1000 W	1200–1600 W 80+ Platinum

Always buy PSU headroom — running a PSU at 90%+ load reduces efficiency and lifespan. An 80+ Gold or Platinum rated unit saves meaningful electricity over months of continuous inference.

7. Cooling

LLM inference is a sustained, heavy workload — not a brief gaming burst. The GPU runs at near-100% for minutes or hours at a time.

GPU cooler — the stock cooler on RTX 4090/5090 cards is adequate, but aftermarket options like the Arctic Accelero reduce temperatures by 10–15°C and noise significantly.
Case airflow — choose a case with strong front-to-back airflow. Fractal Design Torrent, Lian Li O11, and be quiet! Silent Base 802 are popular options.
Ambient temperature — every 10°C increase in ambient temperature costs ~5% performance due to thermal throttling. Keep your room cool or add a case fan controller.

Three Complete Build Examples

Build 1: Budget Local LLM (~$850)

Target: Run 13B models fully on-GPU, 7B at fast speeds. Good for coding assistants (DeepSeek Coder 6.7B, Qwen 2.5 Coder 14B) and general chat (Llama 3.2 8B).

Component	Choice	Price
GPU	RTX 3090 (used)	~$380
CPU	AMD Ryzen 7 7700	~$200
Motherboard	B650 micro-ATX	~$100
RAM	32 GB DDR5-5600	~$70
Storage	1 TB NVMe PCIe 4.0	~$60
PSU	750 W 80+ Gold	~$80
Case + fans	Mid-tower	~$70
Total		~$960

Performance: ~18–22 tokens/second on Llama 3.2 8B Q4_K_M. 13B models at ~10 tokens/second. 34B partially offloaded, ~4–5 tokens/second.

Build 2: Mid-Range (~$2,400)

Target: Run 34B models fully on-GPU, 70B partially. Suitable for serious coding, document analysis, and local RAG pipelines (Qwen 2.5 32B, DeepSeek R1 32B, Mistral Large 2).

Component	Choice	Price
GPU	RTX 4090 (used)	~$1,400
CPU	AMD Ryzen 9 9900X	~$280
Motherboard	X870 ATX	~$180
RAM	64 GB DDR5-6000	~$140
Storage	2 TB NVMe PCIe 4.0	~$110
PSU	1000 W 80+ Gold	~$120
Case + fans	Full-tower with airflow	~$100
Total		~$2,330

Performance: ~35–40 tokens/second on Llama 3.3 70B Q4_K_M (partial GPU). 34B at ~55 tokens/second. Sub-second first-token latency on 13B models.

Build 3: Enthusiast (~$5,500)

Target: Run 70B models fully on-GPU, 120B+ partially. Fast enough for production API serving with multiple concurrent users. Fits Llama 3.3 70B fully, partially runs Llama 3.1 405B.

Component	Choice	Price
GPU	RTX 5090 (32 GB GDDR7)	~$2,000
CPU	AMD Ryzen 9 9950X	~$550
Motherboard	X870E ATX	~$300
RAM	128 GB DDR5-6400	~$280
Storage	4 TB NVMe PCIe 5.0	~$350
PSU	1200 W 80+ Platinum	~$180
Case + fans	Fractal Design Torrent XL	~$180
Total		~$3,840

Alternative enthusiast build — 2× RTX 3090 (48 GB total VRAM, ~$800 used) instead of RTX 5090: fits 70B fully across two GPUs via llama.cpp multi-GPU, lower tokens/second than 5090 but 70B fits entirely in VRAM. Total build cost ~$2,200.

Performance (RTX 5090): ~65–75 tokens/second on Llama 3.3 70B Q4_K_M. ~120 tokens/second on Qwen 2.5 32B. Can serve 3–4 concurrent users with Ollama's parallel request handling.

Model Recommendations by VRAM Tier

VRAM	Recommended Models	Notes
8 GB	Llama 3.2 8B, Mistral 7B, Qwen 2.5 7B, DeepSeek Coder 6.7B	Plenty for coding and chat
16 GB	Qwen 2.5 14B, Gemma 3 12B, Phi-4 14B, DeepSeek R1 14B	Strong reasoning at 16B
24 GB	Qwen 2.5 32B Q3, Mistral Small 24B, DeepSeek R1 32B Q3	Near-frontier quality locally
32 GB	Qwen 2.5 32B Q5, Llama 3.3 70B Q2, Gemma 3 27B	Excellent all-round quality
48 GB	Llama 3.3 70B Q4_K_M, DeepSeek R1 70B	Top local model quality
80 GB	Llama 3.1 405B Q2, Mistral Large 2	Near-GPT-4 class locally

Software Setup

Ollama (Easiest)

Ollama is the fastest way to get a model running. It handles model downloads, CUDA detection, and exposes a simple REST API compatible with the OpenAI SDK.

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically on first run)
ollama run llama3.3:70b-instruct-q4_K_M

# Run as a server on port 11434
ollama serve

Ollama automatically uses all available GPUs and falls back to CPU offload when VRAM is insufficient. Query it from any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.3:70b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Explain gradient descent in simple terms."}]
)
print(response.choices[0].message.content)

llama.cpp (Most Control)

llama.cpp gives you fine-grained control over GPU layer offloading, quantization format, context length, and batch size.

# Build with CUDA support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Llama 3.3 70B Q4_K_M)
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
  --include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" --local-dir ./models

# Run with 80 GPU layers (adjust based on your VRAM)
./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8080

Key flags:

Flag	Purpose
`--n-gpu-layers N`	Number of transformer layers to put on GPU. Set to 999 to put all layers on GPU.
`--ctx-size N`	Context window in tokens. Larger contexts use more VRAM for KV cache.
`--parallel N`	Number of parallel request slots (for serving multiple users).
`--flash-attn`	Enable Flash Attention 2 — reduces VRAM usage and speeds up long contexts.
`--mlock`	Lock model weights in RAM to prevent swapping (requires sufficient RAM).

LM Studio (Desktop GUI)

LM Studio is a graphical desktop app (Windows, macOS, Linux) that lets you browse, download, and run GGUF models without any terminal interaction. Useful for non-technical users or quick experimentation.

Download at lmstudio.ai. It includes a built-in model browser connected to Hugging Face, GPU offload sliders, and an OpenAI-compatible local server.

Power and Noise Considerations

LLM inference machines run GPUs at 100% load for extended periods. Plan accordingly:

Electricity cost — an RTX 4090 at full load draws ~450 W. At $0.15/kWh, that is ~$1.62/day if running 24 hours. A 6-hour daily inference session costs ~$0.40/day, or ~$12/month.
Noise — high-end GPUs under sustained load spin their fans aggressively. Consider aftermarket GPU coolers (Arctic Accelero, Morpheus) or a case in a separate room if noise is a concern.
Heat — a 600 W GPU raises room temperature meaningfully in a small office. In warm climates, budget for better room cooling or schedule heavy workloads at night.

Frequently Asked Questions

Can I run a 70B model with 24 GB VRAM? Yes, partially. With 24 GB VRAM and 64 GB system RAM, llama.cpp will put ~50–55 of 80 layers on the GPU and offload the rest to RAM. You will get around 8–12 tokens/second instead of 35–40 tokens/second at full GPU fit — still usable for interactive chat, but slow for batch workloads.

Is AMD a viable alternative to NVIDIA? Yes, on Linux with ROCm 6.x. The RX 7900 XTX gives you 24 GB VRAM for ~$700 and runs llama.cpp almost as fast as an RTX 4090 (slightly lower memory bandwidth hurts a little). Not recommended for Windows users — ROCm does not support Windows.

What is the minimum GPU for useful local LLMs? An RTX 3060 (12 GB VRAM) for ~$200 used is the minimum that makes sense. It runs 13B models at ~7 tokens/second and 7B models at ~15 tokens/second. Slower than you would like for long conversations but perfectly usable for coding assistance.

Do I need NVLink for a dual-GPU setup? No. llama.cpp distributes model layers across multiple GPUs over PCIe without NVLink. NVLink allows GPUs to share their VRAM pools (two 24 GB cards become a 48 GB combined pool), which is useful. But standard PCIe multi-GPU also works — each GPU holds a subset of layers and communicates through the PCIe bus.

How does quantization affect quality? Q4_K_M is the sweet spot: it retains ~99% of the full-precision model quality while using 50% of the VRAM. Q2_K and Q3_K_S reduce size further but noticeably degrade reasoning on complex tasks. For most users: use Q4_K_M or Q5_K_M; only drop to Q3 if VRAM forces you to.

Summary

Priority	Component	Key Metric
1	GPU	VRAM capacity first, bandwidth second
2	System RAM	64 GB if partially offloading 70B+ models
3	Storage	Fast NVMe for model load times
4	CPU	Any modern 8-core; PCIe lanes matter for dual-GPU
5	PSU	Headroom is cheap; undersizing is not

The most common mistake is buying a GPU with high shader performance but insufficient VRAM. A used RTX 3090 (24 GB, ~$380) beats a new RTX 4070 (12 GB, ~$500) for LLM inference simply because it fits larger models entirely in VRAM.

Start with a single RTX 3090 or 4090 depending on budget, install Ollama, and pull llama3.3:70b-instruct-q4_K_M — you will have a capable local AI assistant running within 30 minutes.

Building a Local LLM Machine in 2026: Complete Hardware Guide

Why Run LLMs Locally?

The Golden Rule: VRAM Is Everything

Component Guide

1. GPU (Most Important)

NVIDIA

AMD

Apple Silicon (Mac Studio / Mac Pro)

2. CPU

3. System RAM

4. Storage

5. Motherboard

6. Power Supply

7. Cooling

Three Complete Build Examples

Build 1: Budget Local LLM (~$850)

Build 2: Mid-Range (~$2,400)

Build 3: Enthusiast (~$5,500)

Model Recommendations by VRAM Tier

Software Setup

Ollama (Easiest)

llama.cpp (Most Control)

LM Studio (Desktop GUI)

Power and Noise Considerations

Frequently Asked Questions

Summary

Leonardo Lazzaro

Why Run LLMs Locally?

The Golden Rule: VRAM Is Everything

Component Guide

1. GPU (Most Important)

NVIDIA

AMD

Apple Silicon (Mac Studio / Mac Pro)

2. CPU

3. System RAM

4. Storage

5. Motherboard

6. Power Supply

7. Cooling

Three Complete Build Examples

Build 1: Budget Local LLM (~$850)

Build 2: Mid-Range (~$2,400)

Build 3: Enthusiast (~$5,500)

Model Recommendations by VRAM Tier

Software Setup

Ollama (Easiest)

llama.cpp (Most Control)

LM Studio (Desktop GUI)

Power and Noise Considerations

Frequently Asked Questions

Summary

Related Articles

Leonardo Lazzaro