Ollama + Open WebUI on Linux: Run a Local ChatGPT Server (2026)

Running a local AI chat server gives you privacy, zero API costs, and full control over which models you use. This guide walks through installing Ollama and Open WebUI on Linux, pulling models, enabling GPU acceleration, and exposing the interface securely over HTTPS with nginx.

What You'll Build

A fully self-hosted AI chat server on Linux with:

Ollama — runs LLMs locally (no cloud, no API costs)
Open WebUI — a polished ChatGPT-like web interface
GPU acceleration — optional NVIDIA/AMD support
nginx — expose it over HTTPS

Requirements

Ubuntu 22.04/24.04 or Debian 12
8GB+ RAM (16GB recommended)
NVIDIA GPU with 6GB+ VRAM (optional, CPU works too)

Step 1: Install Ollama

Ollama provides a one-liner installer that sets up the binary, creates a systemd service, and configures the API server on port 11434.

curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version

Start the service:

sudo systemctl enable ollama
sudo systemctl start ollama

The Ollama API is now running at http://localhost:11434. You can confirm with curl http://localhost:11434 — it should return Ollama is running.

Step 2: Pull Models

Ollama manages models like Docker manages images. Pull a model once and it stays cached on disk.

# LLaMA 3.2 (3B - fast, good for most tasks)
ollama pull llama3.2

# Mistral 7B (great balance of speed and quality)
ollama pull mistral

# Qwen 2.5 (excellent multilingual support)
ollama pull qwen2.5

# Code-focused model
ollama pull codellama

# List installed models
ollama list

Model files are stored in ~/.ollama/models by default. A 7B model takes roughly 4–5 GB on disk in its quantized form.

Step 3: Test Ollama from CLI

Before setting up the web interface, confirm Ollama works end-to-end from the terminal:

ollama run llama3.2 "Explain what a REST API is in one paragraph"

You should see a streamed response in your terminal. Press Ctrl+D to exit the interactive mode.

Step 4: Install Open WebUI

Open WebUI is a web application that provides a ChatGPT-like interface backed by Ollama. There are two installation paths.

Option A: Docker (recommended)

Docker keeps Open WebUI isolated from your system Python and makes upgrades straightforward.

# CPU only
docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# With NVIDIA GPU support
docker run -d \
  -p 3000:8080 \
  --gpus all \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

Open http://localhost:3000 and create your admin account.

The -v open-webui:/app/backend/data flag persists chat history, settings, and uploaded files to a named Docker volume so they survive container restarts.

Option B: pip install

If you prefer not to use Docker:

pip install open-webui
open-webui serve

This starts Open WebUI on port 8080. Use a virtual environment to avoid conflicts with system packages.

Step 5: Connect Open WebUI to Ollama

By default Open WebUI connects to Ollama at http://localhost:11434. If you installed both on the same machine, models should appear automatically in the model selector.

To configure manually: go to Settings → Connections → Ollama API and set the URL. If Open WebUI is running inside Docker and Ollama is on the host, use http://host.docker.internal:11434 instead of localhost.

Step 6: GPU Acceleration

Running models on a GPU dramatically reduces inference time. A 7B model that takes 30 seconds per response on CPU may respond in 2–3 seconds on a mid-range NVIDIA GPU.

NVIDIA

First install the NVIDIA Container Toolkit so Docker can access the GPU:

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Then use --gpus all in your docker run command (as shown in Step 4, Option A).

Verify GPU is being used:

ollama run llama3.2 "hello"
nvidia-smi  # should show GPU memory usage

AMD ROCm

For AMD GPUs, use the ROCm-enabled image and pass the GPU device nodes:

docker run -d \
  -p 3000:8080 \
  --device /dev/kfd \
  --device /dev/dri \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:rocm

Ollama itself has built-in ROCm support when installed via the install script on systems with ROCm drivers.

Step 7: Run as systemd Service (without Docker)

If you installed Open WebUI via pip and want it to start automatically on boot:

# /etc/systemd/system/open-webui.service
[Unit]
Description=Open WebUI
After=network.target

[Service]
Type=simple
User=ubuntu
ExecStart=/home/ubuntu/.local/bin/open-webui serve
Restart=always
Environment=OLLAMA_BASE_URL=http://127.0.0.1:11434

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now open-webui

Check status with sudo systemctl status open-webui and logs with journalctl -u open-webui -f.

Step 8: nginx Reverse Proxy

Exposing Open WebUI over HTTPS requires a reverse proxy. nginx handles TLS termination and forwards requests to the local Open WebUI port. The proxy_read_timeout value must be high because LLM responses can take tens of seconds to stream.

# /etc/nginx/sites-available/openwebui
server {
    listen 80;
    server_name ai.yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name ai.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_read_timeout 3600s;  # important for streaming
    }
}

sudo ln -s /etc/nginx/sites-available/openwebui /etc/nginx/sites-enabled/
sudo certbot --nginx -d ai.yourdomain.com
sudo nginx -t && sudo systemctl reload nginx

The Upgrade and Connection headers are required for WebSocket support, which Open WebUI uses for real-time streaming responses.

Open WebUI Features

Once running, Open WebUI offers more than just chat:

Multi-model: switch between llama3.2, mistral, codellama in one chat session
System prompts: save custom personas and instructions as presets
RAG: upload PDFs and query them — the content is indexed locally
Image generation: connect to Automatic1111 or ComfyUI for image synthesis
Web search: integrate with SearXNG for grounded responses
Multiple users: invite teammates, set per-user permissions and model access
API: full OpenAI-compatible API at /api — drop-in replacement for apps using the OpenAI SDK

Useful Ollama Commands

ollama list           # list installed models
ollama pull phi3      # download a model
ollama rm llama3.2    # delete a model
ollama show mistral   # show model info and parameters
ollama ps             # show currently running models

Performance Tips

Choose the right model size. A 3B model like llama3.2:3b is fast enough for most tasks and runs on 4GB VRAM. A 7B model is noticeably better at reasoning and still fits on 8GB VRAM.

Use quantized variants. Pull llama3.2:3b-instruct-q4_K_M instead of the default to cut VRAM usage by roughly 40% with minimal quality loss.

Tune parallelism. Set OLLAMA_NUM_PARALLEL=2 in /etc/systemd/system/ollama.service to allow two concurrent inference requests — useful if multiple users share the server.

Enable flash attention. Set OLLAMA_FLASH_ATTENTION=1 for faster inference on NVIDIA Ampere GPUs (RTX 30xx and newer).

Monitor resource usage. Run ollama ps to see which models are loaded in memory, and nvidia-smi to watch GPU utilization in real time.