PyTorch for Beginners 2026: Train Your First Neural Network (No Math PhD Required)

Deep learning has never been more accessible, and PyTorch has become the go-to framework for anyone serious about the field. Whether you want to build image classifiers, language models, or anything in between, this tutorial walks you through everything from installing PyTorch to training a real convolutional neural network on the CIFAR-10 dataset — and then squeezing out even more accuracy with transfer learning.

No calculus degree required. Just Python, curiosity, and a free Google Colab account.

Why PyTorch in 2026?

A few years ago, TensorFlow and PyTorch were neck and neck. Today, PyTorch has pulled decisively ahead in the research and education communities, and that lead matters for beginners too.

Research adoption: The majority of papers published on Papers With Code use PyTorch as their reference implementation. When you want to reproduce a paper's results or learn from state-of-the-art code, you'll almost always be reading PyTorch.

Industry endorsement: fast.ai — the organization behind one of the most praised practical deep learning courses online — switched entirely to PyTorch and has never looked back. Their reasoning was simple: PyTorch's dynamic computation graph lets you write normal Python, which makes debugging intuitive and iteration fast.

Expert recommendation: Andrej Karpathy, former Director of AI at Tesla and a founding member of OpenAI, has consistently recommended PyTorch for people learning deep learning. His popular "Neural Networks: Zero to Hero" series is built entirely on PyTorch. His point is pragmatic: the framework gets out of your way so you can focus on the ideas.

Pythonic by design: TensorFlow 1.x forced you into a "define then run" mental model. PyTorch uses "define by run" — your network is built dynamically as Python executes, which means you can use standard Python control flow, print tensors mid-computation, and drop into pdb like any normal program.

The official documentation at pytorch.org/tutorials is also exceptionally good and is actively maintained alongside each release.

Installation

Using pip (recommended for most users)

pip install torch torchvision torchaudio

This installs PyTorch with CPU support. It is the fastest way to get started and works on any machine.

CUDA version (for NVIDIA GPUs)

If you have an NVIDIA GPU, install the CUDA-enabled build. Visit pytorch.org and use the installation selector to pick your OS, package manager, and CUDA version. A typical command looks like:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Replace cu121 with cu118 or whichever CUDA version matches your driver. You can check your CUDA version with nvidia-smi.

Using conda

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Verify the installation

import torch
print(torch.__version__)
print(torch.cuda.is_available())  # True if GPU is available

Tensors: The Building Block of Everything

A tensor is PyTorch's equivalent of a NumPy array — a multi-dimensional container for numbers. Every piece of data in a neural network, from the input images to the learned weights, lives in a tensor.

Creating tensors

import torch

# From a Python list
t = torch.tensor([1.0, 2.0, 3.0])
print(t)        # tensor([1., 2., 3.])
print(t.dtype)  # torch.float32

# Common factory functions
zeros = torch.zeros(3, 4)     # 3x4 matrix of zeros
ones  = torch.ones(2, 2)      # 2x2 matrix of ones
rand  = torch.rand(5, 5)      # uniform random values in [0, 1)
randn = torch.randn(3, 3)     # standard normal distribution

# Check shape
print(rand.shape)   # torch.Size([5, 5])
print(rand.ndim)    # 2

Basic operations

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

# Element-wise
print(a + b)
print(a * b)

# Matrix multiplication
print(a @ b)          # or torch.matmul(a, b)

# Reshaping
flat = a.view(-1)     # flatten to 1D: tensor([1., 2., 3., 4.])
print(flat.shape)     # torch.Size([4])

The NumPy bridge

PyTorch and NumPy play well together. CPU tensors and NumPy arrays share memory, so conversion is free (no copy).

import numpy as np

arr = np.array([1.0, 2.0, 3.0])
t   = torch.from_numpy(arr)   # tensor shares memory with arr

t2  = torch.tensor([4.0, 5.0, 6.0])
arr2 = t2.numpy()              # back to NumPy

Moving to GPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

t = torch.randn(3, 3)
t = t.to(device)     # move to GPU if available

print(t.device)      # device(type='cuda', index=0) or device(type='cpu')

The .to(device) pattern is the standard idiom throughout PyTorch code. You define your device once at the top of the script and use it everywhere.

Autograd: Automatic Differentiation Without the Pain

Training a neural network requires computing gradients — partial derivatives of the loss with respect to every parameter. Doing this by hand for a network with millions of parameters would be impossible. PyTorch's autograd engine does it automatically.

How it works conceptually

Every tensor can have requires_grad=True. When you perform operations on such tensors, PyTorch records those operations in a computational graph. When you call .backward() on the final scalar loss, PyTorch traverses that graph in reverse (back-propagation) and accumulates gradients into each tensor's .grad attribute.

A simple example

x = torch.tensor(3.0, requires_grad=True)

# Forward pass: compute y = x^2 + 2x + 1
y = x**2 + 2*x + 1

# Backward pass: compute dy/dx
y.backward()

# dy/dx at x=3 is 2x + 2 = 8
print(x.grad)  # tensor(8.)

Gradient accumulation and zero_grad

PyTorch accumulates (adds) gradients by default. This is useful for certain advanced techniques but will give you wrong results in a basic training loop if you forget to reset them. Always call optimizer.zero_grad() before computing a new batch's gradients.

# This is wrong — gradients accumulate across iterations
for data, target in dataloader:
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()   # gradients from ALL previous batches affect this step

# This is correct
for data, target in dataloader:
    optimizer.zero_grad()   # clear accumulated gradients
    output = model(data)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

Building a Neural Network with nn.Module

PyTorch provides torch.nn.Module as the base class for all neural networks. Subclassing it and implementing forward() is the standard pattern.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers as attributes
        self.fc1 = nn.Linear(784, 256)   # fully connected: 784 inputs -> 256 outputs
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)    # 10 output classes

    def forward(self, x):
        # Define the forward pass
        x = F.relu(self.fc1(x))   # apply fc1, then ReLU activation
        x = F.relu(self.fc2(x))
        x = self.fc3(x)           # no activation — CrossEntropyLoss expects raw logits
        return x

model = SimpleNet()
print(model)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

nn.Module automatically tracks all sub-modules and parameters, so model.parameters() gives you everything the optimizer needs to update.

The Training Loop

Every PyTorch training loop follows the same five-step pattern per batch:

Zero the gradients — optimizer.zero_grad()
Forward pass — pass inputs through the model
Compute the loss — compare predictions to targets
Backward pass — compute gradients via .backward()
Update parameters — optimizer.step()

import torch.optim as optim

model = SimpleNet()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        optimizer.zero_grad()          # Step 1: clear old gradients
        outputs = model(inputs)        # Step 2: forward pass
        loss = loss_fn(outputs, targets)  # Step 3: compute loss
        loss.backward()                # Step 4: backward pass
        optimizer.step()               # Step 5: update weights

This loop structure is identical whether you're training a tiny classifier or a billion-parameter language model.

DataLoader and Dataset: Efficient Data Pipelines

PyTorch provides two abstractions for feeding data into your model:

torch.utils.data.Dataset — represents your dataset and defines how to load a single sample
torch.utils.data.DataLoader — wraps a Dataset and handles batching, shuffling, and parallel loading

Custom Dataset example

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data, labels):
        self.data   = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = MyDataset(X_train, y_train)
loader  = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=2)

torchvision provides ready-made datasets for common benchmarks like CIFAR-10, MNIST, and ImageNet, which we'll use below.

Training a Real CIFAR-10 Image Classifier

CIFAR-10 is a dataset of 60,000 small (32x32) colour images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. It's a standard benchmark and small enough to train on a free Colab GPU in minutes.

Here is the complete, working code. Every significant line is explained below.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms

# ── 1. Device setup ──────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# ── 2. Data transforms ───────────────────────────────────────────────────────
# Training transforms include data augmentation to reduce overfitting
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),   # randomly flip image left-right
    transforms.RandomCrop(32, padding=4),# randomly crop with 4-pixel padding
    transforms.ToTensor(),               # convert PIL image to tensor [0, 1]
    transforms.Normalize(                # normalise using CIFAR-10 channel stats
        mean=(0.4914, 0.4822, 0.4465),
        std=(0.2470, 0.2435, 0.2616)
    ),
])

# Test/validation transforms — no augmentation, just normalisation
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=(0.4914, 0.4822, 0.4465),
        std=(0.2470, 0.2435, 0.2616)
    ),
])

# ── 3. Load datasets ─────────────────────────────────────────────────────────
train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=train_transform
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=test_transform
)

train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True,  num_workers=2)
test_loader  = DataLoader(test_dataset,  batch_size=256, shuffle=False, num_workers=2)

CLASSES = ('plane', 'car', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')

# ── 4. Define the model ───────────────────────────────────────────────────────
class ConvNet(nn.Module):
    """
    A small convolutional network for CIFAR-10.
    Architecture: Conv → BN → ReLU → Pool (×2) → Fully Connected (×2)
    """
    def __init__(self):
        super().__init__()
        # Block 1: 3 input channels (RGB) → 32 feature maps
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(32)

        # Block 2: 32 → 64 feature maps
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(64)

        # Block 3: 64 → 128 feature maps
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(128)

        self.pool    = nn.MaxPool2d(2, 2)   # halves spatial dimensions
        self.dropout = nn.Dropout(0.5)

        # After 3 max-pool operations: 32 → 16 → 8 → 4, with 128 channels
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        # Block 1
        x = self.pool(F.relu(self.bn1(self.conv1(x))))  # → (B, 32, 16, 16)
        # Block 2
        x = self.pool(F.relu(self.bn2(self.conv2(x))))  # → (B, 64, 8, 8)
        # Block 3
        x = self.pool(F.relu(self.bn3(self.conv3(x))))  # → (B, 128, 4, 4)

        x = x.view(x.size(0), -1)   # flatten: (B, 128*4*4) = (B, 2048)
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.fc2(x)              # raw logits — no softmax here
        return x

model = ConvNet().to(device)
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# ── 5. Loss and optimiser ────────────────────────────────────────────────────
loss_fn   = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# Learning rate schedule: reduce LR at epochs 100 and 150
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[100, 150], gamma=0.1)

# ── 6. Training loop ─────────────────────────────────────────────────────────
NUM_EPOCHS = 20   # increase to 200 for near state-of-the-art accuracy

def train_one_epoch(model, loader, optimizer, loss_fn, device):
    model.train()   # enable dropout and batch-norm training behaviour
    total_loss = 0.0
    correct    = 0
    total      = 0

    for inputs, targets in loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()            # clear accumulated gradients
        outputs = model(inputs)          # forward pass
        loss    = loss_fn(outputs, targets)  # compute loss
        loss.backward()                  # compute gradients
        optimizer.step()                 # update parameters

        total_loss += loss.item() * inputs.size(0)
        _, predicted = outputs.max(1)
        correct      += predicted.eq(targets).sum().item()
        total        += targets.size(0)

    return total_loss / total, 100.0 * correct / total

# ── 7. Evaluation loop ───────────────────────────────────────────────────────
def evaluate(model, loader, loss_fn, device):
    model.eval()   # disable dropout; use running stats for batch norm
    total_loss = 0.0
    correct    = 0
    total      = 0

    with torch.no_grad():   # disable gradient tracking — saves memory and time
        for inputs, targets in loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss    = loss_fn(outputs, targets)

            total_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            correct      += predicted.eq(targets).sum().item()
            total        += targets.size(0)

    return total_loss / total, 100.0 * correct / total

# ── 8. Main training run ─────────────────────────────────────────────────────
for epoch in range(1, NUM_EPOCHS + 1):
    train_loss, train_acc = train_one_epoch(model, train_loader, optimizer, loss_fn, device)
    test_loss,  test_acc  = evaluate(model, test_loader, loss_fn, device)
    scheduler.step()

    print(f"Epoch {epoch:3d}/{NUM_EPOCHS} | "
          f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
          f"Test Loss: {test_loss:.4f}  | Test Acc: {test_acc:.2f}%")

What each section does

Device setup (1): A single device variable determines whether computation runs on CPU or GPU. Every tensor and the model itself will be moved to this device with .to(device).

Transforms (2): transforms.Compose chains image preprocessing steps. Augmentation (flip, crop) on the training set reduces overfitting. Normalisation with the dataset's per-channel mean and standard deviation stabilises training.

DataLoader (3): shuffle=True on the training loader is critical — it ensures that each epoch sees batches with different class mixtures, preventing the model from learning patterns tied to data order.

Model architecture (4): Three convolutional blocks, each with a convolution, batch normalisation, ReLU, and max-pooling layer. Batch norm stabilises activations and greatly accelerates training. Dropout before the final layers reduces overfitting.

Loss and optimiser (5): CrossEntropyLoss combines LogSoftmax and NLLLoss internally, so the model outputs raw logits (no activation on the last layer). SGD with momentum and weight decay is a classic choice; Adam also works well.

Training loop (6): The five-step pattern: zero_grad → forward → loss → backward → step. model.train() is important — it tells layers like Dropout and BatchNorm to use their training behaviour.

Evaluation (7): model.eval() switches Dropout and BatchNorm to inference mode. torch.no_grad() disables gradient tracking entirely, which saves memory and speeds up the pass. Never forget these two for evaluation.

With 20 epochs on a GPU you should see around 75–80% test accuracy. With 200 epochs and the learning rate schedule, this architecture reaches around 90%.

Moving to GPU: Free Colab Training

Google Colab provides a free Tesla T4 or L4 GPU that can reduce epoch times from minutes to seconds. To use it:

Open colab.research.google.com
Click Runtime → Change runtime type → T4 GPU
Paste the code above into a cell and run

The .to(device) pattern does all the work — no other code changes required. The device variable automatically resolves to "cuda" when a GPU is available.

# Check what you've got
print(torch.cuda.get_device_name(0))       # e.g. "Tesla T4"
print(torch.cuda.get_device_properties(0)) # VRAM, compute capability, etc.

Important: Move both the model AND the data to the same device. A common beginner error is moving the model to GPU but forgetting to move the input tensors, which produces a device mismatch error.

# Correct pattern inside the loop
inputs, targets = inputs.to(device), targets.to(device)

Saving and Loading Models

Saving

The recommended approach is to save only the model's state_dict (the learned parameters), not the entire model object. This is more portable and future-proof.

# Save after training
torch.save(model.state_dict(), 'cifar10_convnet.pth')

# Save a training checkpoint (to resume later)
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': test_loss,
}, 'checkpoint.pth')

Loading

# Load weights into a fresh model instance
model = ConvNet()
model.load_state_dict(torch.load('cifar10_convnet.pth', map_location=device))
model.to(device)
model.eval()

# Resume from checkpoint
checkpoint = torch.load('checkpoint.pth', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch'] + 1

map_location=device ensures the saved tensors are loaded onto the right device regardless of where they were originally saved (e.g., loading a GPU checkpoint onto a CPU machine).

Transfer Learning with ResNet18

Training from scratch requires a lot of data and time. Transfer learning lets you leverage a network pre-trained on ImageNet (1.2 million images, 1000 classes) and fine-tune it on your much smaller dataset. This typically achieves higher accuracy with far less training.

The approach:

Load a pretrained ResNet18 from torchvision.models
Freeze all existing layers (so their weights do not change)
Replace the final fully-connected layer with a new one sized for your classes
Train — only the new layer (and optionally a few earlier ones) update

import torchvision.models as models

# ── Load pretrained ResNet18 ──────────────────────────────────────────────────
resnet = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# ── Freeze all parameters ─────────────────────────────────────────────────────
for param in resnet.parameters():
    param.requires_grad = False

# ── Replace the final fully-connected layer ───────────────────────────────────
# ResNet18's original fc is Linear(512, 1000) for 1000 ImageNet classes
# We replace it with Linear(512, 10) for our 10 CIFAR-10 classes
num_features = resnet.fc.in_features   # 512
resnet.fc    = nn.Linear(num_features, 10)
# The new layer has requires_grad=True by default

resnet = resnet.to(device)

# ── Only optimise the new final layer ────────────────────────────────────────
optimizer = optim.Adam(resnet.fc.parameters(), lr=1e-3)
loss_fn   = nn.CrossEntropyLoss()

# ── Train ─────────────────────────────────────────────────────────────────────
NUM_EPOCHS = 10

for epoch in range(1, NUM_EPOCHS + 1):
    train_loss, train_acc = train_one_epoch(resnet, train_loader, optimizer, loss_fn, device)
    test_loss,  test_acc  = evaluate(resnet, test_loader, loss_fn, device)
    print(f"Epoch {epoch:2d}/{NUM_EPOCHS} | "
          f"Train Acc: {train_acc:.2f}% | Test Acc: {test_acc:.2f}%")

Fine-tuning deeper layers

After the classifier head has converged, you can unfreeze earlier layers and train with a much smaller learning rate. This is called fine-tuning.

# Unfreeze all parameters for fine-tuning
for param in resnet.parameters():
    param.requires_grad = True

# Use a lower learning rate to avoid destroying pretrained features
optimizer = optim.SGD(resnet.parameters(), lr=1e-4, momentum=0.9, weight_decay=1e-4)

# Train for a few more epochs
for epoch in range(1, 6):
    train_loss, train_acc = train_one_epoch(resnet, train_loader, optimizer, loss_fn, device)
    test_loss,  test_acc  = evaluate(resnet, test_loader, loss_fn, device)
    print(f"Fine-tune epoch {epoch} | Test Acc: {test_acc:.2f}%")

With this approach on CIFAR-10, ResNet18 can reach 92–94% test accuracy in just a few minutes of training — compared to hours from scratch.

Note on image size: ResNet was designed for 224×224 images. CIFAR-10 images are 32×32, which means the network is slightly mismatched. For the best transfer learning results, upscale CIFAR-10 images to 224×224 in your transform:

train_transform = transforms.Compose([
    transforms.Resize(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.4914, 0.4822, 0.4465),
                         std=(0.2470, 0.2435, 0.2616)),
])

Putting It All Together: Recommended Learning Path

Run the CIFAR-10 classifier on Colab with a GPU. Watch the accuracy climb epoch by epoch. Experiment with changing the learning rate, batch size, or number of filters.
Try the transfer learning example. Compare how many epochs it takes to reach 80% vs training from scratch. This viscerally illustrates why pretrained models matter.
Read the PyTorch tutorials at pytorch.org/tutorials. The "60 Minute Blitz" and "Learning PyTorch with Examples" are excellent next steps.
Take fast.ai's Practical Deep Learning for Coders — it is free, builds on PyTorch, and takes you from beginner to state-of-the-art techniques in a practical, code-first way.
Study CS231n (Stanford's Convolutional Neural Networks for Visual Recognition). The lecture notes and assignments are publicly available and provide the theoretical depth that complements this practical tutorial. See cs231n.github.io.
Follow Papers With Code at paperswithcode.com to track the current best results on CIFAR-10 and other benchmarks. Reading the attached PyTorch implementations is a great way to learn advanced techniques.

Quick Reference

Concept	Code
Create tensor	`torch.tensor([1, 2, 3])`
Move to device	`tensor.to(device)`
Gradient off	`with torch.no_grad():`
Training mode	`model.train()`
Eval mode	`model.eval()`
Save weights	`torch.save(model.state_dict(), 'path.pth')`
Load weights	`model.load_state_dict(torch.load('path.pth'))`
Freeze params	`param.requires_grad = False`
Zero gradients	`optimizer.zero_grad()`
Backpropagate	`loss.backward()`
Update weights	`optimizer.step()`

Summary

You have now seen every major piece of a PyTorch deep learning workflow:

Tensors are the data container, with seamless NumPy interop and GPU support via .to(device).
Autograd handles all the calculus automatically — you just call .backward().
nn.Module provides the building block for networks of any size.
The training loop follows a five-step pattern that never changes.
DataLoader makes batching, shuffling, and parallel loading trivial.
A full ConvNet can classify CIFAR-10 images with ~90% accuracy.
Transfer learning with ResNet18 gets you there faster and with higher accuracy.
Colab gives you a free GPU to run all of this in the cloud.

The best way to continue is to take this code, break it, fix it, and build on it. Change the architecture. Add more layers. Try a different dataset. The PyTorch ecosystem — torchvision, torchaudio, torchtext, HuggingFace Transformers — is enormous, and this tutorial gives you the foundation to explore all of it.