Ollama is the easiest way to run large language models locally on Linux. No cloud API keys, no data leaving your machine, no per-token costs. This guide covers everything from installation to API integration and GPU acceleration.

What is Ollama?

Ollama is an open-source tool that packages and runs LLMs locally. It handles:

Model downloading and management
GPU/CPU acceleration automatically
A local REST API compatible with the OpenAI format
A CLI for interactive chat

Supported models include Llama 3, Mistral, Gemma, Qwen, Phi, DeepSeek, and dozens more.

Install Ollama on Linux

One-Line Install (Recommended)

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama as a systemd service. Verify:

ollama --version
systemctl status ollama

Manual Install

# Download the binary
curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
tar -C /usr/local -xzf ollama-linux-amd64.tgz

# Create systemd service
sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
sudo usermod -a -G ollama $(whoami)

/etc/systemd/system/ollama.service:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=default.target

sudo systemctl daemon-reload
sudo systemctl enable --now ollama

Running Your First Model

# Pull and run a model (starts interactive chat)
ollama run llama3.2

# Run a specific size
ollama run llama3.2:3b     # 3B parameters (~2GB RAM)
ollama run llama3.2:8b     # 8B parameters (~5GB RAM)
ollama run llama3.1:70b    # 70B parameters (~40GB RAM, needs strong GPU)

# Run with a prompt (non-interactive)
ollama run llama3.2 "Explain Docker in one paragraph"

# Run Mistral
ollama run mistral

# Run Gemma
ollama run gemma2

# Run Phi (small, fast)
ollama run phi4-mini

# Run DeepSeek Coder
ollama run deepseek-coder-v2

Model Management

# List downloaded models
ollama list

# Pull a model without running it
ollama pull llama3.2

# Show model information
ollama show llama3.2
ollama show llama3.2 --modelfile

# Remove a model
ollama rm llama3.2:8b

# Copy/rename a model
ollama cp llama3.2 my-llama

# Check where models are stored
ls ~/.ollama/models/

Interactive Chat Commands

When inside ollama run:

>>> /help          # show commands
>>> /set verbose   # show token stats
>>> /set nohistory # disable history
>>> /bye           # exit

>>> Multiline input:
"""
line 1
line 2
"""

REST API

Ollama exposes a REST API on http://localhost:11434. By default it only listens on localhost.

Generate (single response)

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Why is the sky blue?",
    "stream": false
  }'

Chat (multi-turn conversation)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Hello, what is your name?"},
      {"role": "assistant", "content": "I am Llama."},
      {"role": "user", "content": "What can you help me with?"}
    ],
    "stream": false
  }'

OpenAI-Compatible API

Ollama is compatible with the OpenAI API format:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

List Available Models via API

curl http://localhost:11434/api/tags
curl http://localhost:11434/v1/models

Python Integration

Using the ollama Python library

pip install ollama

import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
print(response['response'])

# Chat
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Write a Python function to reverse a string'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a story'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Using OpenAI Python Client with Ollama

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain containers in simple terms.'}
    ]
)
print(response.choices[0].message.content)

Embeddings

import ollama

# Pull an embedding model first
# ollama pull nomic-embed-text

response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='The quick brown fox jumps over the lazy dog'
)
embedding = response['embedding']
print(f"Embedding dimensions: {len(embedding)}")

Custom Models with Modelfiles

Create a custom model based on an existing one:

Modelfile:

FROM llama3.2

SYSTEM """You are a senior Linux sysadmin. You give concise, accurate answers
about Linux, bash scripting, and server administration. You prefer examples
over explanations. When asked about commands, show the command first."""

PARAMETER temperature 0.5
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Build and run:

ollama create linux-expert -f ./Modelfile
ollama run linux-expert
ollama list  # see your custom model

Import GGUF Model

FROM ./mistral-7b-instruct-v0.3.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

ollama create my-mistral -f ./Modelfile

GPU Acceleration

NVIDIA GPU

Install NVIDIA drivers and CUDA:

# Check GPU
nvidia-smi

# Ollama auto-detects CUDA — no extra steps needed after driver install
ollama run llama3.2
# Look for: "using CUDA" in logs

Check GPU usage during inference:

watch -n 1 nvidia-smi

AMD GPU (ROCm)

# Install ROCm
sudo apt install rocm-hip-sdk

# Run Ollama with ROCm
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve

CPU-Only (No GPU)

Ollama works on CPU-only systems, but is slower. For 8B models on CPU:

8B model: ~5-10 tokens/second on modern CPU
3B model: ~15-25 tokens/second

Choose quantized models for better CPU performance:

ollama pull llama3.2:3b-instruct-q4_K_M  # 4-bit quantized

Environment Variables

# Change API listen address (expose to network)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Or in systemd service
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

# Set GPU memory usage
OLLAMA_GPU_MEMORY_FRACTION=0.8 ollama serve

# Number of parallel requests
OLLAMA_NUM_PARALLEL=4 ollama serve

# Context length limit
OLLAMA_MAX_LOADED_MODELS=2 ollama serve

Expose Ollama to Your Network (with Nginx)

server {
    listen 80;
    server_name ollama.local;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Add basic auth:

sudo htpasswd -c /etc/nginx/.htpasswd ollama-user

location / {
    auth_basic "Ollama";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://127.0.0.1:11434;
}

Model Recommendations by Use Case

Use Case	Model	Size
General chat	llama3.2	3B–8B
Code generation	deepseek-coder-v2	16B
Code completion	qwen2.5-coder	7B
Fast/lightweight	phi4-mini	3.8B
Long context	qwen2.5	7B–72B
Embeddings	nomic-embed-text	137M
Multilingual	gemma2	9B

Troubleshooting

Ollama service not starting:

journalctl -u ollama -n 50
systemctl status ollama

CUDA/GPU not detected:

nvidia-smi  # verify drivers
ls /dev/nvidia*  # check device files
# Add ollama user to video group
sudo usermod -aG video ollama

Out of memory error:

# Use smaller/more quantized model
ollama run llama3.2:3b-instruct-q4_K_M

# Reduce context length in Modelfile
PARAMETER num_ctx 2048

Slow first token:

Ollama loads the model into VRAM/RAM on first request. Subsequent requests are fast. Use OLLAMA_KEEP_ALIVE=24h to keep models loaded.

Summary

Ollama makes running local LLMs trivially easy on Linux:

Install with one curl command
ollama run llama3.2 — you're chatting with a local LLM
REST API at localhost:11434 — drop-in compatible with OpenAI
Python via ollama library or openai library with custom base URL
Custom models via Modelfiles
GPU acceleration works automatically with NVIDIA/AMD

The combination of zero cost, complete privacy, and OpenAI API compatibility makes Ollama the obvious choice for running LLMs locally in 2026.

Ollama on Linux: Run Local LLMs, Manage Models and Use the API (2026)

What is Ollama?

Install Ollama on Linux

One-Line Install (Recommended)

Manual Install

Running Your First Model

Model Management

Interactive Chat Commands

REST API

Generate (single response)

Chat (multi-turn conversation)

OpenAI-Compatible API

List Available Models via API

Python Integration

Using the ollama Python library

Using OpenAI Python Client with Ollama

Embeddings

Custom Models with Modelfiles

Import GGUF Model

GPU Acceleration

NVIDIA GPU

AMD GPU (ROCm)

CPU-Only (No GPU)

Environment Variables

Expose Ollama to Your Network (with Nginx)

Model Recommendations by Use Case

Troubleshooting

Summary

Leonardo Lazzaro

What is Ollama?

Install Ollama on Linux

One-Line Install (Recommended)

Manual Install

Running Your First Model

Model Management

Interactive Chat Commands

REST API

Generate (single response)

Chat (multi-turn conversation)

OpenAI-Compatible API

List Available Models via API

Python Integration

Using the ollama Python library

Using OpenAI Python Client with Ollama

Embeddings

Custom Models with Modelfiles

Import GGUF Model

GPU Acceleration

NVIDIA GPU

AMD GPU (ROCm)

CPU-Only (No GPU)

Environment Variables

Expose Ollama to Your Network (with Nginx)

Model Recommendations by Use Case

Troubleshooting

Summary

Related Articles

Leonardo Lazzaro