}

Building a Local LLM Machine in 2026: Complete Hardware Guide

Last updated: May 2026

Running large language models locally gives you full privacy, zero API costs, and no rate limits. The trade-off is that hardware matters enormously — the wrong GPU can leave a model swapping to disk, turning a 2-second response into a 2-minute wait. This guide walks through every component decision, from GPU VRAM tiers to PSU sizing, and closes with three complete build examples at different budgets.


Why Run LLMs Locally?

Before spending money on hardware, it helps to be clear on the benefits:

  • Privacy — your prompts never leave your machine. Critical for code review, legal documents, or medical queries.
  • No API costs — run unlimited tokens for the price of electricity.
  • No rate limits — generate as fast as your hardware allows, any time of day.
  • Offline access — useful when traveling, on a secured network, or when a provider has an outage.
  • Fine-tuning control — load custom LoRA adapters, quantized variants, or uncensored community models.

The main trade-off versus cloud APIs is upfront hardware cost and model size limits driven by VRAM.


The Golden Rule: VRAM Is Everything

For GPU inference, the entire model (or as much of it as possible) must fit in GPU VRAM. When a model overflows VRAM it spills to system RAM, and performance drops by 10–50x depending on how much overflows. A 70B model at 4-bit quantization needs roughly 40 GB of VRAM to run fully on-GPU.

Use this table to understand which model sizes fit in which VRAM tier:

VRAM4-bit quantized models that fit fully on-GPU
6 GB7B models (tight)
8 GB7B comfortably, 13B partially
12 GB13B comfortably, 20B partially
16 GB13B–20B fully, 30B partially
24 GB30B–34B fully, 70B partially (~30 layers)
48 GB70B fully, 120B partially
80 GB70B comfortably, 180B partially
2× 24 GB70B fully across two GPUs (NVLink or PCIe)

Rule of thumb: Model parameters × 0.5 ≈ GB of VRAM needed at 4-bit quantization. A 70B model × 0.5 = ~35 GB minimum; add 5 GB headroom for KV cache → 40 GB total.


Component Guide

1. GPU (Most Important)

The GPU is the single most impactful component. Prioritize VRAM capacity over raw shader performance — a slower GPU with more VRAM will outperform a faster GPU that has to offload layers to RAM.

NVIDIA

NVIDIA is the dominant choice for LLM inference. The CUDA ecosystem is mature, and all major tools (llama.cpp, vLLM, Ollama, Transformers) support it first.

GPUVRAMBandwidthBest ForEst. Price (USD)
RTX 509032 GB GDDR71,792 GB/s70B 4-bit comfortably~$2,000
RTX 508016 GB GDDR7960 GB/s30B fully, 70B partial~$1,000
RTX 409024 GB GDDR6X1,008 GB/s34B fully, 70B partial~$1,400 (used)
RTX 4080 Super16 GB GDDR6X736 GB/s30B fully~$700 (used)
RTX 4070 Ti Super16 GB GDDR6X672 GB/s30B fully~$600 (used)
RTX 309024 GB GDDR6X936 GB/s34B fully, 70B partial~$400 (used)
RTX 3090 Ti24 GB GDDR6X1,008 GB/s34B fully~$500 (used)

Recommendation: The RTX 4090 (used) offers the best performance-per-dollar for single-GPU builds. The RTX 5090 is the best single GPU money can buy, but costs 5× as much as a used 3090 for ~4× the tokens/second.

AMD

AMD support has improved significantly in 2025–2026 via ROCm 6.x and native llama.cpp support. AMD is a good option if you want more VRAM per dollar.

GPUVRAMBandwidthNotes
RX 7900 XTX24 GB GDDR6960 GB/sBest AMD single-GPU option
RX 7900 XT20 GB GDDR6800 GB/s20B models fully
RX 6800 XT16 GB GDDR6512 GB/sBudget AMD option
Radeon PRO W790048 GB GDDR6864 GB/sWorkstation card; expensive but fits 70B

AMD caveat: ROCm requires a Linux host and is not supported on Windows. If you plan to run on Windows, stick with NVIDIA.

Apple Silicon (Mac Studio / Mac Pro)

If you already own or plan to buy Apple Silicon, it deserves mention. Apple's unified memory architecture means the GPU and CPU share a large, fast pool:

ChipUnified MemoryBandwidthNotes
M4 Max128 GB546 GB/sFits 70B comfortably
M3 Ultra192 GB800 GB/sFits 120B+
M4 Ultra (2025)256 GB1,000+ GB/sNear-datacenter performance

The Mac Studio M4 Max (128 GB) costs ~$3,000 and runs 70B models at 15–20 tokens/second — competitive with a single RTX 4090 at a similar price. Not upgradeable, but a clean and power-efficient option.


2. CPU

For inference workloads, the CPU is secondary to the GPU. Focus on:

  • PCIe lanes — if you plan two GPUs, ensure the platform has enough lanes (x16 + x16 or x16 + x8). AMD Threadripper and Intel HEDT platforms provide 64–128 PCIe lanes; consumer AM5/LGA1851 platforms offer 24–28.
  • Cores — 8–16 cores is plenty. LLM inference itself happens on GPU; the CPU handles tokenization, batching, and RAM ↔ VRAM transfers.
  • Memory channels — more channels = higher system RAM bandwidth when layers offload to RAM.
CPUPlatformPCIe LanesGood For
AMD Ryzen 9 9950XAM528Single GPU, good value
Intel Core Ultra 9 285KLGA185124Single GPU
AMD Threadripper PRO 7985WXTRX50128Multi-GPU, serious workloads
AMD Threadripper 7970XTRX50128Multi-GPU, community builds

For a single-GPU build, any modern 8+ core CPU works fine. Only go HEDT if you plan two or more GPUs.


3. System RAM

System RAM matters most when model layers spill out of VRAM. At full GPU fit, 32 GB is sufficient. If you plan to partially offload a large model, more system RAM helps.

RAMUse Case
32 GB DDR5Single GPU, models fit fully in VRAM
64 GB DDR5Single GPU, partial offload of 70B+ models
128 GB DDR5Multi-GPU, large model offload, CPU-only fallback
256+ GB DDR5CPU-only inference of 70B+ (slow but possible)

Tip: High-bandwidth RAM matters for CPU-only inference (llama.cpp --n-gpu-layers 0). DDR5-6400 with 4 channels can hit ~200 GB/s, giving usable speeds for 13B models on CPU alone.


4. Storage

Model files are large — a 70B model at 4-bit quantization is roughly 40 GB. You will accumulate models quickly.

StoragePurpose
2 TB NVMe PCIe 4.0 (OS + active models)Fast load times (~5–10s for 70B from NVMe vs ~60s from HDD)
4–8 TB HDD or SATA SSD (model library)Store inactive models cheaply; copy to NVMe before use

Load time reality check: Even with fast NVMe, a 40 GB model takes ~8–12 seconds to load into VRAM. Once loaded, it stays in VRAM until you unload it, so load time is a one-time cost per session.


5. Motherboard

Key things to check:

  • PCIe x16 slot spacing — dual-slot GPUs need physical clearance. Measure slot pitch; cards like the RTX 4090 are 3-slot wide.
  • PCIe 5.0 — future-proofs GPU bandwidth, though PCIe 4.0 is not a bottleneck for current single-GPU inference.
  • ECC memory support — optional but nice for long-running inference servers.
  • M.2 slots — get at least two PCIe 4.0 M.2 slots for your NVMe drives.

6. Power Supply

GPU TDP for cards like the RTX 4090/5090 is 450–600 W. Add CPU, RAM, drives, and cooling fans:

Build TierGPU TDPTotal System DrawRecommended PSU
Budget (RTX 3090)350 W~500 W750 W 80+ Gold
Mid-range (RTX 4090)450 W~650 W850–1000 W 80+ Gold
Enthusiast (RTX 5090)575 W~800 W1000–1200 W 80+ Platinum
Dual GPU (2× RTX 3090)700 W~1000 W1200–1600 W 80+ Platinum

Always buy PSU headroom — running a PSU at 90%+ load reduces efficiency and lifespan. An 80+ Gold or Platinum rated unit saves meaningful electricity over months of continuous inference.


7. Cooling

LLM inference is a sustained, heavy workload — not a brief gaming burst. The GPU runs at near-100% for minutes or hours at a time.

  • GPU cooler — the stock cooler on RTX 4090/5090 cards is adequate, but aftermarket options like the Arctic Accelero reduce temperatures by 10–15°C and noise significantly.
  • Case airflow — choose a case with strong front-to-back airflow. Fractal Design Torrent, Lian Li O11, and be quiet! Silent Base 802 are popular options.
  • Ambient temperature — every 10°C increase in ambient temperature costs ~5% performance due to thermal throttling. Keep your room cool or add a case fan controller.

Three Complete Build Examples

Build 1: Budget Local LLM (~$850)

Target: Run 13B models fully on-GPU, 7B at fast speeds. Good for coding assistants (DeepSeek Coder 6.7B, Qwen 2.5 Coder 14B) and general chat (Llama 3.2 8B).

ComponentChoicePrice
GPURTX 3090 (used)~$380
CPUAMD Ryzen 7 7700~$200
MotherboardB650 micro-ATX~$100
RAM32 GB DDR5-5600~$70
Storage1 TB NVMe PCIe 4.0~$60
PSU750 W 80+ Gold~$80
Case + fansMid-tower~$70
Total~$960

Performance: ~18–22 tokens/second on Llama 3.2 8B Q4_K_M. 13B models at ~10 tokens/second. 34B partially offloaded, ~4–5 tokens/second.


Build 2: Mid-Range (~$2,400)

Target: Run 34B models fully on-GPU, 70B partially. Suitable for serious coding, document analysis, and local RAG pipelines (Qwen 2.5 32B, DeepSeek R1 32B, Mistral Large 2).

ComponentChoicePrice
GPURTX 4090 (used)~$1,400
CPUAMD Ryzen 9 9900X~$280
MotherboardX870 ATX~$180
RAM64 GB DDR5-6000~$140
Storage2 TB NVMe PCIe 4.0~$110
PSU1000 W 80+ Gold~$120
Case + fansFull-tower with airflow~$100
Total~$2,330

Performance: ~35–40 tokens/second on Llama 3.3 70B Q4_K_M (partial GPU). 34B at ~55 tokens/second. Sub-second first-token latency on 13B models.


Build 3: Enthusiast (~$5,500)

Target: Run 70B models fully on-GPU, 120B+ partially. Fast enough for production API serving with multiple concurrent users. Fits Llama 3.3 70B fully, partially runs Llama 3.1 405B.

ComponentChoicePrice
GPURTX 5090 (32 GB GDDR7)~$2,000
CPUAMD Ryzen 9 9950X~$550
MotherboardX870E ATX~$300
RAM128 GB DDR5-6400~$280
Storage4 TB NVMe PCIe 5.0~$350
PSU1200 W 80+ Platinum~$180
Case + fansFractal Design Torrent XL~$180
Total~$3,840

Alternative enthusiast build — 2× RTX 3090 (48 GB total VRAM, ~$800 used) instead of RTX 5090: fits 70B fully across two GPUs via llama.cpp multi-GPU, lower tokens/second than 5090 but 70B fits entirely in VRAM. Total build cost ~$2,200.

Performance (RTX 5090): ~65–75 tokens/second on Llama 3.3 70B Q4_K_M. ~120 tokens/second on Qwen 2.5 32B. Can serve 3–4 concurrent users with Ollama's parallel request handling.


Model Recommendations by VRAM Tier

VRAMRecommended ModelsNotes
8 GBLlama 3.2 8B, Mistral 7B, Qwen 2.5 7B, DeepSeek Coder 6.7BPlenty for coding and chat
16 GBQwen 2.5 14B, Gemma 3 12B, Phi-4 14B, DeepSeek R1 14BStrong reasoning at 16B
24 GBQwen 2.5 32B Q3, Mistral Small 24B, DeepSeek R1 32B Q3Near-frontier quality locally
32 GBQwen 2.5 32B Q5, Llama 3.3 70B Q2, Gemma 3 27BExcellent all-round quality
48 GBLlama 3.3 70B Q4_K_M, DeepSeek R1 70BTop local model quality
80 GBLlama 3.1 405B Q2, Mistral Large 2Near-GPT-4 class locally

Software Setup

Ollama (Easiest)

Ollama is the fastest way to get a model running. It handles model downloads, CUDA detection, and exposes a simple REST API compatible with the OpenAI SDK.

# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically on first run)
ollama run llama3.3:70b-instruct-q4_K_M

# Run as a server on port 11434
ollama serve

Ollama automatically uses all available GPUs and falls back to CPU offload when VRAM is insufficient. Query it from any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.3:70b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Explain gradient descent in simple terms."}]
)
print(response.choices[0].message.content)

llama.cpp (Most Control)

llama.cpp gives you fine-grained control over GPU layer offloading, quantization format, context length, and batch size.

# Build with CUDA support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Llama 3.3 70B Q4_K_M)
huggingface-cli download bartowski/Llama-3.3-70B-Instruct-GGUF \
  --include "Llama-3.3-70B-Instruct-Q4_K_M.gguf" --local-dir ./models

# Run with 80 GPU layers (adjust based on your VRAM)
./build/bin/llama-server \
  -m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8080

Key flags:

FlagPurpose
--n-gpu-layers NNumber of transformer layers to put on GPU. Set to 999 to put all layers on GPU.
--ctx-size NContext window in tokens. Larger contexts use more VRAM for KV cache.
--parallel NNumber of parallel request slots (for serving multiple users).
--flash-attnEnable Flash Attention 2 — reduces VRAM usage and speeds up long contexts.
--mlockLock model weights in RAM to prevent swapping (requires sufficient RAM).

LM Studio (Desktop GUI)

LM Studio is a graphical desktop app (Windows, macOS, Linux) that lets you browse, download, and run GGUF models without any terminal interaction. Useful for non-technical users or quick experimentation.

Download at lmstudio.ai. It includes a built-in model browser connected to Hugging Face, GPU offload sliders, and an OpenAI-compatible local server.


Power and Noise Considerations

LLM inference machines run GPUs at 100% load for extended periods. Plan accordingly:

  • Electricity cost — an RTX 4090 at full load draws ~450 W. At $0.15/kWh, that is ~$1.62/day if running 24 hours. A 6-hour daily inference session costs ~$0.40/day, or ~$12/month.
  • Noise — high-end GPUs under sustained load spin their fans aggressively. Consider aftermarket GPU coolers (Arctic Accelero, Morpheus) or a case in a separate room if noise is a concern.
  • Heat — a 600 W GPU raises room temperature meaningfully in a small office. In warm climates, budget for better room cooling or schedule heavy workloads at night.

Frequently Asked Questions

Can I run a 70B model with 24 GB VRAM? Yes, partially. With 24 GB VRAM and 64 GB system RAM, llama.cpp will put ~50–55 of 80 layers on the GPU and offload the rest to RAM. You will get around 8–12 tokens/second instead of 35–40 tokens/second at full GPU fit — still usable for interactive chat, but slow for batch workloads.

Is AMD a viable alternative to NVIDIA? Yes, on Linux with ROCm 6.x. The RX 7900 XTX gives you 24 GB VRAM for ~$700 and runs llama.cpp almost as fast as an RTX 4090 (slightly lower memory bandwidth hurts a little). Not recommended for Windows users — ROCm does not support Windows.

What is the minimum GPU for useful local LLMs? An RTX 3060 (12 GB VRAM) for ~$200 used is the minimum that makes sense. It runs 13B models at ~7 tokens/second and 7B models at ~15 tokens/second. Slower than you would like for long conversations but perfectly usable for coding assistance.

Do I need NVLink for a dual-GPU setup? No. llama.cpp distributes model layers across multiple GPUs over PCIe without NVLink. NVLink allows GPUs to share their VRAM pools (two 24 GB cards become a 48 GB combined pool), which is useful. But standard PCIe multi-GPU also works — each GPU holds a subset of layers and communicates through the PCIe bus.

How does quantization affect quality? Q4_K_M is the sweet spot: it retains ~99% of the full-precision model quality while using 50% of the VRAM. Q2_K and Q3_K_S reduce size further but noticeably degrade reasoning on complex tasks. For most users: use Q4_K_M or Q5_K_M; only drop to Q3 if VRAM forces you to.


Summary

PriorityComponentKey Metric
1GPUVRAM capacity first, bandwidth second
2System RAM64 GB if partially offloading 70B+ models
3StorageFast NVMe for model load times
4CPUAny modern 8-core; PCIe lanes matter for dual-GPU
5PSUHeadroom is cheap; undersizing is not

The most common mistake is buying a GPU with high shader performance but insufficient VRAM. A used RTX 3090 (24 GB, ~$380) beats a new RTX 4070 (12 GB, ~$500) for LLM inference simply because it fits larger models entirely in VRAM.

Start with a single RTX 3090 or 4090 depending on budget, install Ollama, and pull llama3.3:70b-instruct-q4_K_M — you will have a capable local AI assistant running within 30 minutes.

Leonardo Lazzaro

Software engineer and technical writer. 10+ years experience in DevOps, Python, and Linux systems.

More articles by Leonardo Lazzaro