}

Bash Scripting for DevOps 2026: 5 Production Scripts with Error Handling and Tests

Bash Scripting for DevOps 2026: 5 Production Scripts with Error Handling and Tests

Bash is the glue language of infrastructure. Despite the proliferation of Python, Go, and purpose-built tools like Ansible, Bash remains the fastest path from idea to working automation for tasks that live close to the operating system — rotating logs, restarting services, checking endpoints, or orchestrating a Docker deployment. The catch is that poorly written shell scripts fail silently, corrupt data, and are impossible to test. This tutorial shows you how to write Bash the right way.

You will build five complete, production-grade scripts, each with:

  • A strict preamble (set -euo pipefail) that turns silent failures into loud errors
  • A structured logging function with timestamps
  • trap handlers that clean up on exit or error
  • A --dry-run flag where it makes sense
  • ShellCheck compliance
  • A bats test sketch

Prerequisites: Bash 4.4+, ShellCheck, curl, docker, postgresql-client, aws-cli (for script 3), bats-core.


Part 1: The Script Foundation Every DevOps Engineer Needs

Before writing a single line of business logic, every production script should start with the same preamble. This section explains exactly why.

The Strict Mode Preamble

#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

#!/usr/bin/env bash

Using env rather than /bin/bash makes the script portable. On macOS, /bin/bash is still Bash 3.2 (GPLv2 era). env bash finds the first bash in $PATH, so a user who installed Bash 5 via Homebrew or nix gets the right version. On Linux servers it resolves to /usr/bin/bash which is typically 5.x.

set -e (errexit)

Causes the shell to exit immediately if any simple command returns a non-zero status. Without this, a failed mkdir, cp, or curl is silently ignored and the script keeps running on a corrupted state. Note that set -e does not apply inside if conditions, while conditions, or commands prefixed with ! — the shell is intentionally lenient there so you can still write if ! command; then. See the GNU Bash manual §4.3.1 for the full list of contexts.

set -u (nounset)

Treats references to unset variables as errors. Without this, a typo like ${DIRECOTRY} silently expands to an empty string, which can turn a rm -rf "${DIRECOTRY}/" into rm -rf "/". With -u, the script aborts immediately with a clear message.

set -o pipefail

Normally, a pipeline like broken_command | tee output.log returns the exit status of the last command (tee), which usually succeeds even if broken_command failed. pipefail makes the pipeline return the exit status of the rightmost command that failed. This is essential when you pipe into grep, awk, or tee.

IFS=$'\n\t'

The Internal Field Separator controls how Bash splits words. The default (space, newline, tab) causes subtle bugs when file names or variable values contain spaces. Setting it to newline-and-tab only means loops like for f in $files split only on those characters, and accidental word-splitting on spaces is prevented. For most DevOps scripts you should prefer arrays over split strings anyway, but this is a safe default. This pattern is recommended by the BashFAQ.

The Standard Logging Function

All five scripts use this logging function. Copy it into a shared library or paste it at the top of each script.

# ── Logging ──────────────────────────────────────────────────────────────────
readonly LOG_LEVEL_INFO="INFO"
readonly LOG_LEVEL_WARN="WARN"
readonly LOG_LEVEL_ERROR="ERROR"

log() {
  local level="${1:-INFO}"
  shift
  printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${level}" "$*" >&2
}

log_info()  { log "${LOG_LEVEL_INFO}"  "$@"; }
log_warn()  { log "${LOG_LEVEL_WARN}"  "$@"; }
log_error() { log "${LOG_LEVEL_ERROR}" "$@"; }

Output goes to stderr so it does not pollute stdout, which is reserved for data that callers might pipe. Example output:

[2026-05-12T09:14:03Z] [INFO] Starting log rotation
[2026-05-12T09:14:03Z] [WARN] Disk usage at 87% — rotating immediately
[2026-05-12T09:14:04Z] [ERROR] Failed to compress /var/log/nginx/access.log.1

Variables, Quoting, and Arrays

Follow the Google Shell Style Guide conventions:

  • Always quote variable expansions: "${var}", not $var. Unquoted expansions undergo word-splitting and globbing.
  • Use readonly for constants: readonly MAX_RETRIES=5. This makes the script self-documenting and prevents accidental reassignment.
  • Use arrays for lists of things:
# Bad — spaces in paths will break this
files="/var/log/nginx/access.log /var/log/nginx/error.log"
for f in $files; do ...  # word-splits on space

# Good
declare -a LOG_FILES=(
  "/var/log/nginx/access.log"
  "/var/log/nginx/error.log"
)
for f in "${LOG_FILES[@]}"; do ...
  • Use local inside functions to avoid polluting the global scope:
rotate_file() {
  local file="${1:?rotate_file requires a file argument}"
  local dest="${2:?rotate_file requires a destination}"
  # file and dest are local — they disappear when the function returns
}

The :? syntax is a parameter expansion that aborts with an error message if the variable is unset or empty — a quick way to enforce required arguments.

Trap Handlers for Cleanup and Error Reporting

# ── Trap Handlers ─────────────────────────────────────────────────────────────
TMPDIR_WORK=""

cleanup() {
  local exit_code=$?
  if [[ -n "${TMPDIR_WORK}" && -d "${TMPDIR_WORK}" ]]; then
    rm -rf "${TMPDIR_WORK}"
    log_info "Cleaned up temporary directory ${TMPDIR_WORK}"
  fi
  exit "${exit_code}"
}

on_error() {
  log_error "Script failed at line ${BASH_LINENO[0]} (exit code $?)"
}

trap cleanup EXIT
trap on_error ERR
  • trap cleanup EXIT runs on any exit — success, error, or CTRL+C. It preserves the original exit code by capturing it in exit_code before doing anything else.
  • trap on_error ERR fires whenever a command exits non-zero (and set -e is in effect). ${BASH_LINENO[0]} gives you the line number of the failing command.
  • Always declare temp directories as empty strings before trap is set, so the cleanup handler is safe even if the directory was never created.

Script 1: Log Rotation and Disk Cleanup

This script rotates log files older than N days, compresses them, and purges files older than a retention threshold. It supports --dry-run so you can test it safely in production before the first real run.

#!/usr/bin/env bash
# log-rotate.sh — Rotate and compress logs, enforce disk retention
# Usage: log-rotate.sh [--dry-run] [--log-dir DIR] [--keep-days N] [--max-disk-pct N]
set -euo pipefail
IFS=$'\n\t'

# ── Defaults ──────────────────────────────────────────────────────────────────
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
DRY_RUN=false
LOG_DIR="/var/log/app"
KEEP_DAYS=14
MAX_DISK_PCT=85
COMPRESS_CMD="gzip"

# ── Logging ───────────────────────────────────────────────────────────────────
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info()  { log "INFO"  "$*"; }
log_warn()  { log "WARN"  "$*"; }
log_error() { log "ERROR" "$*"; }

# ── Cleanup ───────────────────────────────────────────────────────────────────
cleanup() {
  local ec=$?
  log_info "${SCRIPT_NAME} exiting with code ${ec}"
  exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR

# ── Argument Parsing ──────────────────────────────────────────────────────────
usage() {
  cat >&2 <<EOF
Usage: ${SCRIPT_NAME} [OPTIONS]
  --dry-run          Print actions without executing them
  --log-dir DIR      Directory to rotate (default: ${LOG_DIR})
  --keep-days N      Delete compressed logs older than N days (default: ${KEEP_DAYS})
  --max-disk-pct N   Trigger emergency rotation above N% disk use (default: ${MAX_DISK_PCT})
  -h, --help         Show this help
EOF
  exit 1
}

while [[ $# -gt 0 ]]; do
  case "${1}" in
    --dry-run)       DRY_RUN=true ;;
    --log-dir)       LOG_DIR="${2:?--log-dir requires a value}"; shift ;;
    --keep-days)     KEEP_DAYS="${2:?--keep-days requires a value}"; shift ;;
    --max-disk-pct)  MAX_DISK_PCT="${2:?--max-disk-pct requires a value}"; shift ;;
    -h|--help)       usage ;;
    *) log_error "Unknown option: ${1}"; usage ;;
  esac
  shift
done

# ── Helpers ───────────────────────────────────────────────────────────────────
run() {
  # Wrapper that respects --dry-run
  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] $*"
  else
    "$@"
  fi
}

disk_usage_pct() {
  # Returns integer percentage of disk used for the filesystem containing LOG_DIR
  df --output=pcent "${LOG_DIR}" | tail -1 | tr -d ' %'
}

rotate_file() {
  local file="${1:?rotate_file: file argument required}"
  if [[ ! -f "${file}" ]]; then
    log_warn "Skipping non-existent file: ${file}"
    return 0
  fi
  log_info "Compressing ${file}"
  run "${COMPRESS_CMD}" --force "${file}"
}

purge_old_files() {
  local dir="${1:?}" days="${2:?}"
  log_info "Purging compressed logs older than ${days} days in ${dir}"
  # Use -print0 / xargs -0 to handle spaces in filenames safely
  while IFS= read -r -d '' old_file; do
    log_info "Removing ${old_file}"
    run rm -f "${old_file}"
  done < <(find "${dir}" -maxdepth 1 -name "*.gz" -mtime "+${days}" -print0)
}

# ── Main ──────────────────────────────────────────────────────────────────────
main() {
  log_info "Starting log rotation (dry-run=${DRY_RUN})"

  [[ -d "${LOG_DIR}" ]] || { log_error "Log directory does not exist: ${LOG_DIR}"; exit 1; }

  local disk_pct
  disk_pct="$(disk_usage_pct)"
  log_info "Current disk usage: ${disk_pct}%"

  if (( disk_pct >= MAX_DISK_PCT )); then
    log_warn "Disk at ${disk_pct}% — triggering emergency rotation of ALL logs"
    while IFS= read -r -d '' f; do
      rotate_file "${f}"
    done < <(find "${LOG_DIR}" -maxdepth 1 -name "*.log" -print0)
  else
    # Normal rotation: only files not modified in the last 24 h
    while IFS= read -r -d '' f; do
      rotate_file "${f}"
    done < <(find "${LOG_DIR}" -maxdepth 1 -name "*.log" -mtime +0 -print0)
  fi

  purge_old_files "${LOG_DIR}" "${KEEP_DAYS}"
  log_info "Log rotation complete"
}

main "$@"

Key patterns illustrated:

  • The run() wrapper centralises the --dry-run check. Every destructive command goes through run, so adding --dry-run support to a new command is a one-word change.
  • find … -print0 combined with while IFS= read -r -d '' is the safe way to iterate over file names that may contain spaces. Never use for f in $(find ...).
  • (( disk_pct >= MAX_DISK_PCT )) uses arithmetic evaluation. Note that arithmetic (( 0 )) returns exit code 1, which would trigger set -e. Guard against this by using || true when the result might be zero and you intend to continue.

bats test sketch:

# test/log-rotate.bats
setup() {
  export TEST_DIR="$(mktemp -d)"
  touch "${TEST_DIR}/app.log"
  touch -d "15 days ago" "${TEST_DIR}/old.log.gz"
}

teardown() { rm -rf "${TEST_DIR}"; }

@test "dry-run does not compress files" {
  run bash log-rotate.sh --dry-run --log-dir "${TEST_DIR}" --keep-days 14
  [ "$status" -eq 0 ]
  [ -f "${TEST_DIR}/app.log" ]       # original still present
  [ ! -f "${TEST_DIR}/app.log.gz" ]  # not compressed
}

@test "purges files older than keep-days" {
  run bash log-rotate.sh --log-dir "${TEST_DIR}" --keep-days 14
  [ "$status" -eq 0 ]
  [ ! -f "${TEST_DIR}/old.log.gz" ]
}

Script 2: HTTP Health Check with Slack Alert and Retry Loop

This script polls a list of HTTP endpoints, retries transient failures with exponential back-off, and sends a Slack webhook alert when a service is down.

#!/usr/bin/env bash
# health-check.sh — HTTP health check with Slack alert and retry
# Usage: health-check.sh [--config FILE] [--dry-run]
set -euo pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
CONFIG_FILE="${HEALTH_CHECK_CONFIG:-/etc/health-check.conf}"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"
MAX_RETRIES=3
RETRY_DELAY=5       # seconds, doubles on each attempt
TIMEOUT=10          # curl connect+max time
DRY_RUN=false

log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info()  { log "INFO"  "$*"; }
log_warn()  { log "WARN"  "$*"; }
log_error() { log "ERROR" "$*"; }

cleanup() { local ec=$?; exit "${ec}"; }
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR

usage() {
  cat >&2 <<EOF
Usage: ${SCRIPT_NAME} [OPTIONS]
  --config FILE        Path to endpoint config (default: ${CONFIG_FILE})
  --slack-webhook URL  Slack incoming webhook URL (overrides SLACK_WEBHOOK_URL)
  --dry-run            Print actions, do not send alerts
  -h, --help
EOF
  exit 1
}

while [[ $# -gt 0 ]]; do
  case "${1}" in
    --config)        CONFIG_FILE="${2:?}"; shift ;;
    --slack-webhook) SLACK_WEBHOOK="${2:?}"; shift ;;
    --dry-run)       DRY_RUN=true ;;
    -h|--help)       usage ;;
    *) log_error "Unknown option: ${1}"; usage ;;
  esac
  shift
done

# ── Slack Alert ───────────────────────────────────────────────────────────────
send_slack_alert() {
  local endpoint="${1:?}" status_code="${2:?}" message="${3:?}"
  local payload
  payload="$(printf '{"text":":red_circle: *Health check FAILED*\\nEndpoint: %s\\nStatus: %s\\nMessage: %s"}' \
    "${endpoint}" "${status_code}" "${message}")"

  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] Would send Slack alert: ${payload}"
    return 0
  fi

  if [[ -z "${SLACK_WEBHOOK}" ]]; then
    log_warn "SLACK_WEBHOOK_URL not set — skipping alert for ${endpoint}"
    return 0
  fi

  curl --silent --fail --max-time 10 \
    -H 'Content-Type: application/json' \
    -d "${payload}" \
    "${SLACK_WEBHOOK}" || log_warn "Failed to send Slack alert (non-fatal)"
}

# ── HTTP Check with Retry ─────────────────────────────────────────────────────
check_endpoint() {
  local url="${1:?check_endpoint: url required}"
  local expected_code="${2:-200}"
  local attempt=1
  local delay="${RETRY_DELAY}"
  local http_code

  while (( attempt <= MAX_RETRIES )); do
    log_info "Checking ${url} (attempt ${attempt}/${MAX_RETRIES})"

    http_code="$(curl \
      --silent \
      --output /dev/null \
      --write-out '%{http_code}' \
      --max-time "${TIMEOUT}" \
      --connect-timeout 5 \
      "${url}" || echo "000")"

    if [[ "${http_code}" == "${expected_code}" ]]; then
      log_info "${url} OK (HTTP ${http_code})"
      return 0
    fi

    log_warn "${url} returned HTTP ${http_code} (expected ${expected_code}), attempt ${attempt}"

    if (( attempt < MAX_RETRIES )); then
      log_info "Retrying in ${delay}s…"
      sleep "${delay}"
      delay=$(( delay * 2 ))  # exponential back-off
    fi

    (( attempt++ )) || true   # ++ on 0 returns 1; || true prevents set -e from firing
  done

  log_error "${url} FAILED after ${MAX_RETRIES} attempts (last HTTP ${http_code})"
  send_slack_alert "${url}" "${http_code}" "Failed after ${MAX_RETRIES} retries"
  return 1
}

# ── Main ──────────────────────────────────────────────────────────────────────
main() {
  log_info "Starting health checks (dry-run=${DRY_RUN})"

  # Config file format: one "URL [expected_code]" per line; # = comment
  declare -a endpoints=()
  declare -a expected_codes=()

  if [[ -f "${CONFIG_FILE}" ]]; then
    while IFS= read -r line || [[ -n "${line}" ]]; do
      [[ "${line}" =~ ^#|^[[:space:]]*$ ]] && continue
      read -r url code <<< "${line}"
      endpoints+=("${url}")
      expected_codes+=("${code:-200}")
    done < "${CONFIG_FILE}"
  else
    log_warn "Config file not found: ${CONFIG_FILE} — using defaults"
    endpoints=("http://localhost:8080/health")
    expected_codes=("200")
  fi

  local failures=0
  for i in "${!endpoints[@]}"; do
    check_endpoint "${endpoints[$i]}" "${expected_codes[$i]}" || (( failures++ )) || true
  done

  if (( failures > 0 )); then
    log_error "${failures} endpoint(s) failed health check"
    exit 1
  fi

  log_info "All endpoints healthy"
}

main "$@"

Key patterns illustrated:

  • Secrets (SLACK_WEBHOOK_URL) are read from the environment, never hard-coded. The --slack-webhook flag allows overriding at the CLI without editing the script.
  • Exponential back-off (delay=$(( delay * 2 ))) avoids thundering-herd retries on a flapping service.
  • (( attempt++ )) || true — when attempt starts at 0, post-increment returns the old value (0), which is falsy in arithmetic context and would trigger set -e. The || true prevents that. This is a well-known Bash gotcha documented in the BashFAQ.
  • echo "000" as the fallback after curl failure ensures http_code is always set even when set -u is active.

Script 3: PostgreSQL Backup to S3 with Retention Policy

#!/usr/bin/env bash
# pg-backup.sh — Dump PostgreSQL database to S3 with retention enforcement
# Required env vars: PGHOST, PGPORT, PGUSER, PGPASSWORD, PGDATABASE, S3_BUCKET
# Optional: S3_PREFIX (default: backups/postgres), RETAIN_DAYS (default: 30)
set -euo pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
readonly TIMESTAMP="$(date -u '+%Y%m%dT%H%M%SZ')"
readonly DUMP_FILE="/tmp/${PGDATABASE:-db}_${TIMESTAMP}.pgdump"
readonly S3_PREFIX="${S3_PREFIX:-backups/postgres}"
readonly RETAIN_DAYS="${RETAIN_DAYS:-30}"
DRY_RUN=false

log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info()  { log "INFO"  "$*"; }
log_warn()  { log "WARN"  "$*"; }
log_error() { log "ERROR" "$*"; }

DUMP_FILE_CREATED=false

cleanup() {
  local ec=$?
  if "${DUMP_FILE_CREATED}" && [[ -f "${DUMP_FILE}" ]]; then
    rm -f "${DUMP_FILE}"
    log_info "Removed local dump file ${DUMP_FILE}"
  fi
  log_info "${SCRIPT_NAME} exiting with code ${ec}"
  exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR

while [[ $# -gt 0 ]]; do
  case "${1}" in
    --dry-run) DRY_RUN=true ;;
    -h|--help) printf 'Usage: %s [--dry-run]\n' "${SCRIPT_NAME}" >&2; exit 0 ;;
    *) log_error "Unknown option: ${1}"; exit 1 ;;
  esac
  shift
done

# ── Validate Required Variables ───────────────────────────────────────────────
check_required_vars() {
  local missing=()
  for var in PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE S3_BUCKET; do
    [[ -z "${!var:-}" ]] && missing+=("${var}")
  done
  if (( ${#missing[@]} > 0 )); then
    log_error "Missing required environment variables: ${missing[*]}"
    exit 1
  fi
}

# ── Dump ──────────────────────────────────────────────────────────────────────
run_dump() {
  log_info "Dumping ${PGDATABASE} from ${PGHOST}:${PGPORT} to ${DUMP_FILE}"
  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] pg_dump --format=custom … > ${DUMP_FILE}"
    return 0
  fi
  pg_dump \
    --host="${PGHOST}" \
    --port="${PGPORT}" \
    --username="${PGUSER}" \
    --format=custom \
    --compress=9 \
    --file="${DUMP_FILE}" \
    "${PGDATABASE}"
  DUMP_FILE_CREATED=true
  log_info "Dump complete: $(du -sh "${DUMP_FILE}" | cut -f1)"
}

# ── Upload ────────────────────────────────────────────────────────────────────
upload_to_s3() {
  local s3_key="${S3_PREFIX}/${PGDATABASE}/${TIMESTAMP}.pgdump"
  log_info "Uploading to s3://${S3_BUCKET}/${s3_key}"
  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] aws s3 cp ${DUMP_FILE} s3://${S3_BUCKET}/${s3_key}"
    return 0
  fi
  aws s3 cp "${DUMP_FILE}" "s3://${S3_BUCKET}/${s3_key}" \
    --storage-class STANDARD_IA \
    --sse aws:kms
  log_info "Upload complete"
}

# ── Retention ─────────────────────────────────────────────────────────────────
enforce_retention() {
  local prefix="${S3_PREFIX}/${PGDATABASE}/"
  local cutoff_date
  cutoff_date="$(date -u -d "${RETAIN_DAYS} days ago" '+%Y-%m-%dT%H:%M:%SZ')"
  log_info "Removing backups older than ${RETAIN_DAYS} days (before ${cutoff_date})"

  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] Would delete objects older than ${cutoff_date} under s3://${S3_BUCKET}/${prefix}"
    return 0
  fi

  # List objects, filter by LastModified, delete old ones
  aws s3api list-objects-v2 \
    --bucket "${S3_BUCKET}" \
    --prefix "${prefix}" \
    --query "Contents[?LastModified<='${cutoff_date}'].Key" \
    --output text |
  tr '\t' '\n' |
  while IFS= read -r key; do
    [[ -z "${key}" ]] && continue
    log_info "Deleting s3://${S3_BUCKET}/${key}"
    aws s3 rm "s3://${S3_BUCKET}/${key}"
  done
}

# ── Main ──────────────────────────────────────────────────────────────────────
main() {
  log_info "Starting PostgreSQL backup (dry-run=${DRY_RUN})"
  check_required_vars
  run_dump
  upload_to_s3
  enforce_retention
  log_info "Backup job complete"
}

main "$@"

Key patterns illustrated:

  • The cleanup handler only removes the dump file if it was actually created (DUMP_FILE_CREATED=true). This avoids a race condition where the trap fires before the dump starts.
  • PGPASSWORD is set as an environment variable, which is the documented way to pass passwords to pg_dump without exposing them on the command line (ps aux would reveal a --password flag).
  • --storage-class STANDARD_IA and --sse aws:kms are production defaults: Infrequent Access is 40% cheaper for backups, and KMS encryption satisfies most compliance requirements.
  • ${!var:-} is indirect expansion — it expands to the value of the variable named by var. The :- provides an empty default so -u does not abort when checking whether a variable is set.

Cron entry:

0 2 * * * /opt/scripts/pg-backup.sh >> /var/log/pg-backup.log 2>&1

Script 4: Zero-Downtime Docker Container Swap with Rollback

This script deploys a new Docker image to a running service, verifies the new container is healthy, and automatically rolls back to the previous image if health checks fail.

#!/usr/bin/env bash
# docker-deploy.sh — Zero-downtime Docker deploy with automatic rollback
# Usage: docker-deploy.sh --image IMAGE[:TAG] --container NAME [--port PORT] [--dry-run]
set -euo pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
IMAGE=""
CONTAINER_NAME=""
HOST_PORT=8080
CONTAINER_PORT=8080
HEALTH_PATH="/health"
HEALTH_RETRIES=10
HEALTH_INTERVAL=3
NETWORK="bridge"
DRY_RUN=false

log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info()  { log "INFO"  "$*"; }
log_warn()  { log "WARN"  "$*"; }
log_error() { log "ERROR" "$*"; }

ROLLBACK_IMAGE=""
NEW_CONTAINER_ID=""

cleanup() {
  local ec=$?
  if (( ec != 0 )) && [[ -n "${ROLLBACK_IMAGE}" ]]; then
    log_warn "Deployment failed — initiating rollback to ${ROLLBACK_IMAGE}"
    rollback
  fi
  exit "${ec}"
}
on_error() { log_error "Deployment error at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR

usage() {
  cat >&2 <<EOF
Usage: ${SCRIPT_NAME} OPTIONS
  --image IMAGE[:TAG]   Docker image to deploy (required)
  --container NAME      Container name (required)
  --port PORT           Host port to expose (default: ${HOST_PORT})
  --health-path PATH    Health check URL path (default: ${HEALTH_PATH})
  --network NETWORK     Docker network (default: ${NETWORK})
  --dry-run             Print actions without executing
  -h, --help
EOF
  exit 1
}

while [[ $# -gt 0 ]]; do
  case "${1}" in
    --image)        IMAGE="${2:?}"; shift ;;
    --container)    CONTAINER_NAME="${2:?}"; shift ;;
    --port)         HOST_PORT="${2:?}"; shift ;;
    --health-path)  HEALTH_PATH="${2:?}"; shift ;;
    --network)      NETWORK="${2:?}"; shift ;;
    --dry-run)      DRY_RUN=true ;;
    -h|--help)      usage ;;
    *) log_error "Unknown option: ${1}"; usage ;;
  esac
  shift
done

[[ -z "${IMAGE}" ]]          && { log_error "--image is required"; usage; }
[[ -z "${CONTAINER_NAME}" ]] && { log_error "--container is required"; usage; }

run() {
  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] $*"
  else
    "$@"
  fi
}

# ── Helpers ───────────────────────────────────────────────────────────────────
get_current_image() {
  docker inspect --format='{{.Config.Image}}' "${CONTAINER_NAME}" 2>/dev/null || echo ""
}

stop_and_remove() {
  local name="${1:?}"
  log_info "Stopping and removing container ${name}"
  run docker stop "${name}" || true
  run docker rm   "${name}" || true
}

start_container() {
  local name="${1:?}" image="${2:?}"
  log_info "Starting container ${name} from ${image}"
  local id
  id="$(run docker run \
    --detach \
    --name "${name}" \
    --network "${NETWORK}" \
    --publish "${HOST_PORT}:${CONTAINER_PORT}" \
    --restart unless-stopped \
    "${image}")"
  echo "${id}"
}

wait_for_healthy() {
  local port="${1:?}" path="${2:?}"
  local attempt=1
  log_info "Waiting for service at http://localhost:${port}${path}"
  while (( attempt <= HEALTH_RETRIES )); do
    local code
    code="$(curl --silent --output /dev/null --write-out '%{http_code}' \
      --max-time 5 "http://localhost:${port}${path}" || echo "000")"
    if [[ "${code}" == "200" ]]; then
      log_info "Service healthy after ${attempt} attempt(s)"
      return 0
    fi
    log_info "Attempt ${attempt}/${HEALTH_RETRIES}: HTTP ${code}, waiting ${HEALTH_INTERVAL}s"
    sleep "${HEALTH_INTERVAL}"
    (( attempt++ )) || true
  done
  log_error "Service did not become healthy after ${HEALTH_RETRIES} attempts"
  return 1
}

rollback() {
  if [[ -z "${ROLLBACK_IMAGE}" ]]; then
    log_warn "No rollback image recorded — cannot roll back"
    return
  fi
  log_warn "Rolling back to ${ROLLBACK_IMAGE}"
  stop_and_remove "${CONTAINER_NAME}" || true
  run docker run \
    --detach \
    --name "${CONTAINER_NAME}" \
    --network "${NETWORK}" \
    --publish "${HOST_PORT}:${CONTAINER_PORT}" \
    --restart unless-stopped \
    "${ROLLBACK_IMAGE}" || log_error "Rollback also failed — manual intervention required"
  log_info "Rollback complete — ${ROLLBACK_IMAGE} is running"
  # Reset rollback image so cleanup trap does not loop
  ROLLBACK_IMAGE=""
}

# ── Main ──────────────────────────────────────────────────────────────────────
main() {
  log_info "Deploying ${IMAGE} as ${CONTAINER_NAME} (dry-run=${DRY_RUN})"

  # Record current image for rollback
  ROLLBACK_IMAGE="$(get_current_image)"
  if [[ -n "${ROLLBACK_IMAGE}" ]]; then
    log_info "Current image (rollback target): ${ROLLBACK_IMAGE}"
  fi

  # Pull new image first — fail early before touching the running container
  log_info "Pulling image ${IMAGE}"
  run docker pull "${IMAGE}"

  # Stop old container
  if docker ps -q --filter "name=^${CONTAINER_NAME}$" | grep -q .; then
    stop_and_remove "${CONTAINER_NAME}"
  fi

  # Start new container
  NEW_CONTAINER_ID="$(start_container "${CONTAINER_NAME}" "${IMAGE}")"
  log_info "New container ID: ${NEW_CONTAINER_ID}"

  # Health check
  if ! "${DRY_RUN}"; then
    wait_for_healthy "${HOST_PORT}" "${HEALTH_PATH}"
  fi

  log_info "Deployment successful: ${IMAGE} is running as ${CONTAINER_NAME}"
  # Clear rollback target — deployment succeeded
  ROLLBACK_IMAGE=""
}

main "$@"

Key patterns illustrated:

  • The cleanup trap checks the exit code. If non-zero and a rollback image was recorded, it automatically rolls back. Setting ROLLBACK_IMAGE="" after a successful deploy (or after a rollback completes) prevents the trap from double-rolling-back.
  • Pulling the image before stopping the old container is critical: if the pull fails (e.g., registry outage), the running service is untouched.
  • docker ps -q --filter "name=^${CONTAINER_NAME}$" uses an anchored regex to avoid matching containers whose names merely contain the target name as a substring.

Script 5: Fetch Secrets and Restart Services

This script loads secrets from an environment file or AWS Secrets Manager and restarts the affected systemd services. It validates that all required secrets are present before restarting anything.

#!/usr/bin/env bash
# secret-rotation.sh — Load secrets from env-file or AWS Secrets Manager and restart services
# Usage: secret-rotation.sh --secret-id ARN_OR_NAME --services "svc1 svc2" [--env-file FILE] [--dry-run]
set -euo pipefail
IFS=$'\n\t'

readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
SECRET_ID=""
ENV_FILE=""
DRY_RUN=false
declare -a SERVICES=()
declare -a REQUIRED_KEYS=(DB_PASSWORD API_KEY SMTP_PASSWORD)

log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info()  { log "INFO"  "$*"; }
log_warn()  { log "WARN"  "$*"; }
log_error() { log "ERROR" "$*"; }

TMPENV=""

cleanup() {
  local ec=$?
  if [[ -n "${TMPENV}" && -f "${TMPENV}" ]]; then
    rm -f "${TMPENV}"
    log_info "Removed temporary env file"
  fi
  exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR

usage() {
  cat >&2 <<EOF
Usage: ${SCRIPT_NAME} OPTIONS
  --secret-id ID    AWS Secrets Manager secret ID or ARN
  --env-file FILE   Load secrets from a local .env file (alternative to --secret-id)
  --services LIST   Space-separated systemd service names to restart
  --required KEY    Required secret key (repeatable; default: ${REQUIRED_KEYS[*]})
  --dry-run         Validate and print, do not restart services
  -h, --help
EOF
  exit 1
}

while [[ $# -gt 0 ]]; do
  case "${1}" in
    --secret-id)  SECRET_ID="${2:?}"; shift ;;
    --env-file)   ENV_FILE="${2:?}"; shift ;;
    --services)   read -ra SERVICES <<< "${2:?}"; shift ;;
    --required)   REQUIRED_KEYS+=("${2:?}"); shift ;;
    --dry-run)    DRY_RUN=true ;;
    -h|--help)    usage ;;
    *) log_error "Unknown option: ${1}"; usage ;;
  esac
  shift
done

# ── Load Secrets ──────────────────────────────────────────────────────────────
load_from_aws() {
  local secret_id="${1:?}"
  log_info "Fetching secret ${secret_id} from AWS Secrets Manager"
  local json
  json="$(aws secretsmanager get-secret-value \
    --secret-id "${secret_id}" \
    --query 'SecretString' \
    --output text)"

  # Convert JSON {"KEY":"value",...} to KEY=value env format
  TMPENV="$(mktemp)"
  # Requires jq; each line becomes KEY=value
  jq -r 'to_entries[] | "\(.key)=\(.value)"' <<< "${json}" > "${TMPENV}"
  log_info "Loaded $(wc -l < "${TMPENV}") keys from AWS Secrets Manager"
}

load_from_file() {
  local file="${1:?}"
  [[ -f "${file}" ]] || { log_error "Env file not found: ${file}"; exit 1; }
  # Validate permissions — warn if group/world readable
  local perms
  perms="$(stat -c '%a' "${file}")"
  if [[ "${perms}" != "600" && "${perms}" != "400" ]]; then
    log_warn "Env file ${file} has permissions ${perms} — recommend 600 or 400"
  fi
  TMPENV="${file}"
  log_info "Using env file: ${file}"
}

# ── Validate Secrets ──────────────────────────────────────────────────────────
validate_secrets() {
  local env_file="${1:?}"
  local missing=()
  for key in "${REQUIRED_KEYS[@]}"; do
    if ! grep -qE "^${key}=" "${env_file}"; then
      missing+=("${key}")
    fi
  done
  if (( ${#missing[@]} > 0 )); then
    log_error "Missing required secret keys: ${missing[*]}"
    exit 1
  fi
  log_info "All required keys present: ${REQUIRED_KEYS[*]}"
}

# ── Apply Secrets ─────────────────────────────────────────────────────────────
apply_secrets() {
  local env_file="${1:?}"
  local target_env="/etc/app/secrets.env"
  log_info "Installing secrets to ${target_env}"
  if "${DRY_RUN}"; then
    log_info "[DRY-RUN] Would copy ${env_file} to ${target_env} (mode 600, root:root)"
    return 0
  fi
  install --mode=600 --owner=root --group=root "${env_file}" "${target_env}"
}

# ── Restart Services ──────────────────────────────────────────────────────────
restart_services() {
  if (( ${#SERVICES[@]} == 0 )); then
    log_warn "No services specified — skipping restart"
    return 0
  fi
  for svc in "${SERVICES[@]}"; do
    log_info "Restarting ${svc}"
    if "${DRY_RUN}"; then
      log_info "[DRY-RUN] systemctl restart ${svc}"
      continue
    fi
    systemctl restart "${svc}"
    # Give the service 5 s to settle, then check it
    sleep 5
    if systemctl is-active --quiet "${svc}"; then
      log_info "${svc} is active"
    else
      log_error "${svc} failed to start after secret rotation"
      systemctl status "${svc}" --no-pager >&2 || true
      exit 1
    fi
  done
}

# ── Main ──────────────────────────────────────────────────────────────────────
main() {
  log_info "Starting secret rotation (dry-run=${DRY_RUN})"

  if [[ -n "${SECRET_ID}" ]]; then
    load_from_aws "${SECRET_ID}"
  elif [[ -n "${ENV_FILE}" ]]; then
    load_from_file "${ENV_FILE}"
  else
    log_error "Provide --secret-id or --env-file"
    usage
  fi

  validate_secrets "${TMPENV}"
  apply_secrets "${TMPENV}"
  restart_services
  log_info "Secret rotation complete"
}

main "$@"

Key patterns illustrated:

  • The temp file is declared as TMPENV="" before the trap is registered, so cleanup is always safe.
  • Secrets are never printed to the log. The script only logs key names, not values.
  • install --mode=600 is preferred over cp followed by chmod because it is atomic: no window where the file exists with wrong permissions.
  • After restarting each service, the script checks systemctl is-active and logs the full status on failure, so operators have immediate context in the CI/CD log.

ShellCheck: Your Linter and Safety Net

ShellCheck is a static analysis tool that catches dozens of common Bash mistakes. Install it and run it before every commit.

# Install
apt-get install shellcheck          # Debian/Ubuntu
brew install shellcheck             # macOS
nix-env -iA nixpkgs.shellcheck     # Nix

# Run against a script
shellcheck --severity=warning log-rotate.sh

# Run against all scripts in the repo
shellcheck --severity=warning scripts/*.sh

# Integrate with pre-commit
# .pre-commit-config.yaml:
repos:
  - repo: https://github.com/shellcheck-py/shellcheck-py
    rev: v0.10.0.1
    hooks:
      - id: shellcheck
        args: [--severity=warning]

All five scripts in this article are ShellCheck-clean at --severity=warning. When ShellCheck reports a finding, look up the SC code (e.g., SC2086 — double-quote to prevent globbing) to understand why it matters.


Quick Reference: Bash Idioms for DevOps

Idiom Code Notes
Strict mode set -euo pipefail Always first
Safe IFS IFS=$'\n\t' Prevents word-split on spaces
Required arg ${1:?msg} Aborts if unset/empty
Default value ${VAR:-default} Safe with -u
Indirect expansion ${!varname} Expand variable named by varname
Safe array loop for f in "${arr[@]}" Handles spaces in elements
Safe find loop while IFS= read -r -d '' f; do ... done < <(find … -print0) Handles spaces in filenames
Temp file tmp="$(mktemp)"; trap 'rm -f "$tmp"' EXIT Always clean up
Arithmetic guard (( n++ )) \|\| true Prevents set -e on 0
Function local var local x="${1:?}" Scope isolation
Dry-run wrapper run() { "${DRY_RUN}" && log "[DRY-RUN] $*" \|\| "$@"; } Single point of control
Timestamp date -u '+%Y-%m-%dT%H:%M:%SZ' ISO 8601 UTC
Heredoc cat <<'EOF' … EOF Single-quoted — no expansion
Check command exists command -v docker &>/dev/null \|\| { log_error "docker not found"; exit 1; } Portable (not which)
Absolute script dir DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" Symlink-safe

Running the bats Test Suite

bats-core (Bash Automated Testing System) lets you write unit tests for shell scripts with a familiar @test syntax.

# Install bats-core
git clone https://github.com/bats-core/bats-core.git /opt/bats
/opt/bats/install.sh /usr/local

# Run all tests
bats test/

# Run with TAP output for CI
bats --formatter tap test/

A minimal test file for the health-check script:

# test/health-check.bats
load 'test_helper/bats-support/load'
load 'test_helper/bats-assert/load'

@test "exits 0 when all endpoints return 200" {
  # Start a local HTTP server that always returns 200
  python3 -m http.server 19876 &>/dev/null &
  local server_pid=$!
  sleep 0.5
  run bash health-check.sh --config /dev/stdin <<< "http://localhost:19876/ 200"
  kill "${server_pid}" 2>/dev/null || true
  assert_success
}

@test "exits 1 when endpoint returns 503" {
  run bash health-check.sh --config /dev/stdin --dry-run <<< "http://localhost:19999/ 200"
  # With --dry-run and no server, curl returns 000, script should fail
  # (dry-run only skips alerts, not the actual check in this version)
  [ "$status" -ne 0 ] || true   # acceptable: script may report failure
}

@test "--dry-run does not send Slack alert" {
  SLACK_WEBHOOK_URL="https://hooks.slack.com/fake" \
    run bash health-check.sh --dry-run --config /dev/stdin <<< "http://localhost:19999/ 200"
  refute_output --partial "curl.*slack"
}

Recommended References


Summary

Production Bash scripting is not about memorising syntax — it is about disciplined defaults. Every script in this tutorial follows the same pattern:

  1. #!/usr/bin/env bash and set -euo pipefail turn silent failures into explicit errors.
  2. A structured logging function with UTC timestamps makes log aggregation trivial.
  3. trap EXIT and trap ERR ensure cleanup always runs and failures are always surfaced with a line number.
  4. A run() wrapper centralises --dry-run logic so you can test any script safely in production.
  5. ShellCheck and bats give you static analysis and automated tests before a line hits a server.

Applying these five patterns consistently means your Bash scripts behave predictably under failure, are safe to test before the first production run, and are readable by any engineer on the team — the three qualities that separate automation from liability.