Bash Scripting for DevOps 2026: 5 Production Scripts with Error Handling and Tests
Bash is the glue language of infrastructure. Despite the proliferation of Python, Go, and purpose-built tools like Ansible, Bash remains the fastest path from idea to working automation for tasks that live close to the operating system — rotating logs, restarting services, checking endpoints, or orchestrating a Docker deployment. The catch is that poorly written shell scripts fail silently, corrupt data, and are impossible to test. This tutorial shows you how to write Bash the right way.
You will build five complete, production-grade scripts, each with:
- A strict preamble (
set -euo pipefail) that turns silent failures into loud errors - A structured logging function with timestamps
traphandlers that clean up on exit or error- A
--dry-runflag where it makes sense - ShellCheck compliance
- A bats test sketch
Prerequisites: Bash 4.4+, ShellCheck, curl, docker, postgresql-client, aws-cli (for script 3), bats-core.
Part 1: The Script Foundation Every DevOps Engineer Needs
Before writing a single line of business logic, every production script should start with the same preamble. This section explains exactly why.
The Strict Mode Preamble
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
#!/usr/bin/env bash
Using env rather than /bin/bash makes the script portable. On macOS, /bin/bash is still Bash 3.2 (GPLv2 era). env bash finds the first bash in $PATH, so a user who installed Bash 5 via Homebrew or nix gets the right version. On Linux servers it resolves to /usr/bin/bash which is typically 5.x.
set -e (errexit)
Causes the shell to exit immediately if any simple command returns a non-zero status. Without this, a failed mkdir, cp, or curl is silently ignored and the script keeps running on a corrupted state. Note that set -e does not apply inside if conditions, while conditions, or commands prefixed with ! — the shell is intentionally lenient there so you can still write if ! command; then. See the GNU Bash manual §4.3.1 for the full list of contexts.
set -u (nounset)
Treats references to unset variables as errors. Without this, a typo like ${DIRECOTRY} silently expands to an empty string, which can turn a rm -rf "${DIRECOTRY}/" into rm -rf "/". With -u, the script aborts immediately with a clear message.
set -o pipefail
Normally, a pipeline like broken_command | tee output.log returns the exit status of the last command (tee), which usually succeeds even if broken_command failed. pipefail makes the pipeline return the exit status of the rightmost command that failed. This is essential when you pipe into grep, awk, or tee.
IFS=$'\n\t'
The Internal Field Separator controls how Bash splits words. The default (space, newline, tab) causes subtle bugs when file names or variable values contain spaces. Setting it to newline-and-tab only means loops like for f in $files split only on those characters, and accidental word-splitting on spaces is prevented. For most DevOps scripts you should prefer arrays over split strings anyway, but this is a safe default. This pattern is recommended by the BashFAQ.
The Standard Logging Function
All five scripts use this logging function. Copy it into a shared library or paste it at the top of each script.
# ── Logging ──────────────────────────────────────────────────────────────────
readonly LOG_LEVEL_INFO="INFO"
readonly LOG_LEVEL_WARN="WARN"
readonly LOG_LEVEL_ERROR="ERROR"
log() {
local level="${1:-INFO}"
shift
printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${level}" "$*" >&2
}
log_info() { log "${LOG_LEVEL_INFO}" "$@"; }
log_warn() { log "${LOG_LEVEL_WARN}" "$@"; }
log_error() { log "${LOG_LEVEL_ERROR}" "$@"; }
Output goes to stderr so it does not pollute stdout, which is reserved for data that callers might pipe. Example output:
[2026-05-12T09:14:03Z] [INFO] Starting log rotation
[2026-05-12T09:14:03Z] [WARN] Disk usage at 87% — rotating immediately
[2026-05-12T09:14:04Z] [ERROR] Failed to compress /var/log/nginx/access.log.1
Variables, Quoting, and Arrays
Follow the Google Shell Style Guide conventions:
- Always quote variable expansions:
"${var}", not$var. Unquoted expansions undergo word-splitting and globbing. - Use
readonlyfor constants:readonly MAX_RETRIES=5. This makes the script self-documenting and prevents accidental reassignment. - Use arrays for lists of things:
# Bad — spaces in paths will break this
files="/var/log/nginx/access.log /var/log/nginx/error.log"
for f in $files; do ... # word-splits on space
# Good
declare -a LOG_FILES=(
"/var/log/nginx/access.log"
"/var/log/nginx/error.log"
)
for f in "${LOG_FILES[@]}"; do ...
- Use
localinside functions to avoid polluting the global scope:
rotate_file() {
local file="${1:?rotate_file requires a file argument}"
local dest="${2:?rotate_file requires a destination}"
# file and dest are local — they disappear when the function returns
}
The :? syntax is a parameter expansion that aborts with an error message if the variable is unset or empty — a quick way to enforce required arguments.
Trap Handlers for Cleanup and Error Reporting
# ── Trap Handlers ─────────────────────────────────────────────────────────────
TMPDIR_WORK=""
cleanup() {
local exit_code=$?
if [[ -n "${TMPDIR_WORK}" && -d "${TMPDIR_WORK}" ]]; then
rm -rf "${TMPDIR_WORK}"
log_info "Cleaned up temporary directory ${TMPDIR_WORK}"
fi
exit "${exit_code}"
}
on_error() {
log_error "Script failed at line ${BASH_LINENO[0]} (exit code $?)"
}
trap cleanup EXIT
trap on_error ERR
trap cleanup EXITruns on any exit — success, error, orCTRL+C. It preserves the original exit code by capturing it inexit_codebefore doing anything else.trap on_error ERRfires whenever a command exits non-zero (andset -eis in effect).${BASH_LINENO[0]}gives you the line number of the failing command.- Always declare temp directories as empty strings before
trapis set, so the cleanup handler is safe even if the directory was never created.
Script 1: Log Rotation and Disk Cleanup
This script rotates log files older than N days, compresses them, and purges files older than a retention threshold. It supports --dry-run so you can test it safely in production before the first real run.
#!/usr/bin/env bash
# log-rotate.sh — Rotate and compress logs, enforce disk retention
# Usage: log-rotate.sh [--dry-run] [--log-dir DIR] [--keep-days N] [--max-disk-pct N]
set -euo pipefail
IFS=$'\n\t'
# ── Defaults ──────────────────────────────────────────────────────────────────
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
DRY_RUN=false
LOG_DIR="/var/log/app"
KEEP_DAYS=14
MAX_DISK_PCT=85
COMPRESS_CMD="gzip"
# ── Logging ───────────────────────────────────────────────────────────────────
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info() { log "INFO" "$*"; }
log_warn() { log "WARN" "$*"; }
log_error() { log "ERROR" "$*"; }
# ── Cleanup ───────────────────────────────────────────────────────────────────
cleanup() {
local ec=$?
log_info "${SCRIPT_NAME} exiting with code ${ec}"
exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR
# ── Argument Parsing ──────────────────────────────────────────────────────────
usage() {
cat >&2 <<EOF
Usage: ${SCRIPT_NAME} [OPTIONS]
--dry-run Print actions without executing them
--log-dir DIR Directory to rotate (default: ${LOG_DIR})
--keep-days N Delete compressed logs older than N days (default: ${KEEP_DAYS})
--max-disk-pct N Trigger emergency rotation above N% disk use (default: ${MAX_DISK_PCT})
-h, --help Show this help
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case "${1}" in
--dry-run) DRY_RUN=true ;;
--log-dir) LOG_DIR="${2:?--log-dir requires a value}"; shift ;;
--keep-days) KEEP_DAYS="${2:?--keep-days requires a value}"; shift ;;
--max-disk-pct) MAX_DISK_PCT="${2:?--max-disk-pct requires a value}"; shift ;;
-h|--help) usage ;;
*) log_error "Unknown option: ${1}"; usage ;;
esac
shift
done
# ── Helpers ───────────────────────────────────────────────────────────────────
run() {
# Wrapper that respects --dry-run
if "${DRY_RUN}"; then
log_info "[DRY-RUN] $*"
else
"$@"
fi
}
disk_usage_pct() {
# Returns integer percentage of disk used for the filesystem containing LOG_DIR
df --output=pcent "${LOG_DIR}" | tail -1 | tr -d ' %'
}
rotate_file() {
local file="${1:?rotate_file: file argument required}"
if [[ ! -f "${file}" ]]; then
log_warn "Skipping non-existent file: ${file}"
return 0
fi
log_info "Compressing ${file}"
run "${COMPRESS_CMD}" --force "${file}"
}
purge_old_files() {
local dir="${1:?}" days="${2:?}"
log_info "Purging compressed logs older than ${days} days in ${dir}"
# Use -print0 / xargs -0 to handle spaces in filenames safely
while IFS= read -r -d '' old_file; do
log_info "Removing ${old_file}"
run rm -f "${old_file}"
done < <(find "${dir}" -maxdepth 1 -name "*.gz" -mtime "+${days}" -print0)
}
# ── Main ──────────────────────────────────────────────────────────────────────
main() {
log_info "Starting log rotation (dry-run=${DRY_RUN})"
[[ -d "${LOG_DIR}" ]] || { log_error "Log directory does not exist: ${LOG_DIR}"; exit 1; }
local disk_pct
disk_pct="$(disk_usage_pct)"
log_info "Current disk usage: ${disk_pct}%"
if (( disk_pct >= MAX_DISK_PCT )); then
log_warn "Disk at ${disk_pct}% — triggering emergency rotation of ALL logs"
while IFS= read -r -d '' f; do
rotate_file "${f}"
done < <(find "${LOG_DIR}" -maxdepth 1 -name "*.log" -print0)
else
# Normal rotation: only files not modified in the last 24 h
while IFS= read -r -d '' f; do
rotate_file "${f}"
done < <(find "${LOG_DIR}" -maxdepth 1 -name "*.log" -mtime +0 -print0)
fi
purge_old_files "${LOG_DIR}" "${KEEP_DAYS}"
log_info "Log rotation complete"
}
main "$@"
Key patterns illustrated:
- The
run()wrapper centralises the--dry-runcheck. Every destructive command goes throughrun, so adding--dry-runsupport to a new command is a one-word change. find … -print0combined withwhile IFS= read -r -d ''is the safe way to iterate over file names that may contain spaces. Never usefor f in $(find ...).(( disk_pct >= MAX_DISK_PCT ))uses arithmetic evaluation. Note that arithmetic(( 0 ))returns exit code 1, which would triggerset -e. Guard against this by using|| truewhen the result might be zero and you intend to continue.
bats test sketch:
# test/log-rotate.bats
setup() {
export TEST_DIR="$(mktemp -d)"
touch "${TEST_DIR}/app.log"
touch -d "15 days ago" "${TEST_DIR}/old.log.gz"
}
teardown() { rm -rf "${TEST_DIR}"; }
@test "dry-run does not compress files" {
run bash log-rotate.sh --dry-run --log-dir "${TEST_DIR}" --keep-days 14
[ "$status" -eq 0 ]
[ -f "${TEST_DIR}/app.log" ] # original still present
[ ! -f "${TEST_DIR}/app.log.gz" ] # not compressed
}
@test "purges files older than keep-days" {
run bash log-rotate.sh --log-dir "${TEST_DIR}" --keep-days 14
[ "$status" -eq 0 ]
[ ! -f "${TEST_DIR}/old.log.gz" ]
}
Script 2: HTTP Health Check with Slack Alert and Retry Loop
This script polls a list of HTTP endpoints, retries transient failures with exponential back-off, and sends a Slack webhook alert when a service is down.
#!/usr/bin/env bash
# health-check.sh — HTTP health check with Slack alert and retry
# Usage: health-check.sh [--config FILE] [--dry-run]
set -euo pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
CONFIG_FILE="${HEALTH_CHECK_CONFIG:-/etc/health-check.conf}"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"
MAX_RETRIES=3
RETRY_DELAY=5 # seconds, doubles on each attempt
TIMEOUT=10 # curl connect+max time
DRY_RUN=false
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info() { log "INFO" "$*"; }
log_warn() { log "WARN" "$*"; }
log_error() { log "ERROR" "$*"; }
cleanup() { local ec=$?; exit "${ec}"; }
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR
usage() {
cat >&2 <<EOF
Usage: ${SCRIPT_NAME} [OPTIONS]
--config FILE Path to endpoint config (default: ${CONFIG_FILE})
--slack-webhook URL Slack incoming webhook URL (overrides SLACK_WEBHOOK_URL)
--dry-run Print actions, do not send alerts
-h, --help
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case "${1}" in
--config) CONFIG_FILE="${2:?}"; shift ;;
--slack-webhook) SLACK_WEBHOOK="${2:?}"; shift ;;
--dry-run) DRY_RUN=true ;;
-h|--help) usage ;;
*) log_error "Unknown option: ${1}"; usage ;;
esac
shift
done
# ── Slack Alert ───────────────────────────────────────────────────────────────
send_slack_alert() {
local endpoint="${1:?}" status_code="${2:?}" message="${3:?}"
local payload
payload="$(printf '{"text":":red_circle: *Health check FAILED*\\nEndpoint: %s\\nStatus: %s\\nMessage: %s"}' \
"${endpoint}" "${status_code}" "${message}")"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] Would send Slack alert: ${payload}"
return 0
fi
if [[ -z "${SLACK_WEBHOOK}" ]]; then
log_warn "SLACK_WEBHOOK_URL not set — skipping alert for ${endpoint}"
return 0
fi
curl --silent --fail --max-time 10 \
-H 'Content-Type: application/json' \
-d "${payload}" \
"${SLACK_WEBHOOK}" || log_warn "Failed to send Slack alert (non-fatal)"
}
# ── HTTP Check with Retry ─────────────────────────────────────────────────────
check_endpoint() {
local url="${1:?check_endpoint: url required}"
local expected_code="${2:-200}"
local attempt=1
local delay="${RETRY_DELAY}"
local http_code
while (( attempt <= MAX_RETRIES )); do
log_info "Checking ${url} (attempt ${attempt}/${MAX_RETRIES})"
http_code="$(curl \
--silent \
--output /dev/null \
--write-out '%{http_code}' \
--max-time "${TIMEOUT}" \
--connect-timeout 5 \
"${url}" || echo "000")"
if [[ "${http_code}" == "${expected_code}" ]]; then
log_info "${url} OK (HTTP ${http_code})"
return 0
fi
log_warn "${url} returned HTTP ${http_code} (expected ${expected_code}), attempt ${attempt}"
if (( attempt < MAX_RETRIES )); then
log_info "Retrying in ${delay}s…"
sleep "${delay}"
delay=$(( delay * 2 )) # exponential back-off
fi
(( attempt++ )) || true # ++ on 0 returns 1; || true prevents set -e from firing
done
log_error "${url} FAILED after ${MAX_RETRIES} attempts (last HTTP ${http_code})"
send_slack_alert "${url}" "${http_code}" "Failed after ${MAX_RETRIES} retries"
return 1
}
# ── Main ──────────────────────────────────────────────────────────────────────
main() {
log_info "Starting health checks (dry-run=${DRY_RUN})"
# Config file format: one "URL [expected_code]" per line; # = comment
declare -a endpoints=()
declare -a expected_codes=()
if [[ -f "${CONFIG_FILE}" ]]; then
while IFS= read -r line || [[ -n "${line}" ]]; do
[[ "${line}" =~ ^#|^[[:space:]]*$ ]] && continue
read -r url code <<< "${line}"
endpoints+=("${url}")
expected_codes+=("${code:-200}")
done < "${CONFIG_FILE}"
else
log_warn "Config file not found: ${CONFIG_FILE} — using defaults"
endpoints=("http://localhost:8080/health")
expected_codes=("200")
fi
local failures=0
for i in "${!endpoints[@]}"; do
check_endpoint "${endpoints[$i]}" "${expected_codes[$i]}" || (( failures++ )) || true
done
if (( failures > 0 )); then
log_error "${failures} endpoint(s) failed health check"
exit 1
fi
log_info "All endpoints healthy"
}
main "$@"
Key patterns illustrated:
- Secrets (
SLACK_WEBHOOK_URL) are read from the environment, never hard-coded. The--slack-webhookflag allows overriding at the CLI without editing the script. - Exponential back-off (
delay=$(( delay * 2 ))) avoids thundering-herd retries on a flapping service. (( attempt++ )) || true— whenattemptstarts at 0, post-increment returns the old value (0), which is falsy in arithmetic context and would triggerset -e. The|| trueprevents that. This is a well-known Bash gotcha documented in the BashFAQ.echo "000"as the fallback aftercurlfailure ensureshttp_codeis always set even whenset -uis active.
Script 3: PostgreSQL Backup to S3 with Retention Policy
#!/usr/bin/env bash
# pg-backup.sh — Dump PostgreSQL database to S3 with retention enforcement
# Required env vars: PGHOST, PGPORT, PGUSER, PGPASSWORD, PGDATABASE, S3_BUCKET
# Optional: S3_PREFIX (default: backups/postgres), RETAIN_DAYS (default: 30)
set -euo pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
readonly TIMESTAMP="$(date -u '+%Y%m%dT%H%M%SZ')"
readonly DUMP_FILE="/tmp/${PGDATABASE:-db}_${TIMESTAMP}.pgdump"
readonly S3_PREFIX="${S3_PREFIX:-backups/postgres}"
readonly RETAIN_DAYS="${RETAIN_DAYS:-30}"
DRY_RUN=false
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info() { log "INFO" "$*"; }
log_warn() { log "WARN" "$*"; }
log_error() { log "ERROR" "$*"; }
DUMP_FILE_CREATED=false
cleanup() {
local ec=$?
if "${DUMP_FILE_CREATED}" && [[ -f "${DUMP_FILE}" ]]; then
rm -f "${DUMP_FILE}"
log_info "Removed local dump file ${DUMP_FILE}"
fi
log_info "${SCRIPT_NAME} exiting with code ${ec}"
exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR
while [[ $# -gt 0 ]]; do
case "${1}" in
--dry-run) DRY_RUN=true ;;
-h|--help) printf 'Usage: %s [--dry-run]\n' "${SCRIPT_NAME}" >&2; exit 0 ;;
*) log_error "Unknown option: ${1}"; exit 1 ;;
esac
shift
done
# ── Validate Required Variables ───────────────────────────────────────────────
check_required_vars() {
local missing=()
for var in PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE S3_BUCKET; do
[[ -z "${!var:-}" ]] && missing+=("${var}")
done
if (( ${#missing[@]} > 0 )); then
log_error "Missing required environment variables: ${missing[*]}"
exit 1
fi
}
# ── Dump ──────────────────────────────────────────────────────────────────────
run_dump() {
log_info "Dumping ${PGDATABASE} from ${PGHOST}:${PGPORT} to ${DUMP_FILE}"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] pg_dump --format=custom … > ${DUMP_FILE}"
return 0
fi
pg_dump \
--host="${PGHOST}" \
--port="${PGPORT}" \
--username="${PGUSER}" \
--format=custom \
--compress=9 \
--file="${DUMP_FILE}" \
"${PGDATABASE}"
DUMP_FILE_CREATED=true
log_info "Dump complete: $(du -sh "${DUMP_FILE}" | cut -f1)"
}
# ── Upload ────────────────────────────────────────────────────────────────────
upload_to_s3() {
local s3_key="${S3_PREFIX}/${PGDATABASE}/${TIMESTAMP}.pgdump"
log_info "Uploading to s3://${S3_BUCKET}/${s3_key}"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] aws s3 cp ${DUMP_FILE} s3://${S3_BUCKET}/${s3_key}"
return 0
fi
aws s3 cp "${DUMP_FILE}" "s3://${S3_BUCKET}/${s3_key}" \
--storage-class STANDARD_IA \
--sse aws:kms
log_info "Upload complete"
}
# ── Retention ─────────────────────────────────────────────────────────────────
enforce_retention() {
local prefix="${S3_PREFIX}/${PGDATABASE}/"
local cutoff_date
cutoff_date="$(date -u -d "${RETAIN_DAYS} days ago" '+%Y-%m-%dT%H:%M:%SZ')"
log_info "Removing backups older than ${RETAIN_DAYS} days (before ${cutoff_date})"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] Would delete objects older than ${cutoff_date} under s3://${S3_BUCKET}/${prefix}"
return 0
fi
# List objects, filter by LastModified, delete old ones
aws s3api list-objects-v2 \
--bucket "${S3_BUCKET}" \
--prefix "${prefix}" \
--query "Contents[?LastModified<='${cutoff_date}'].Key" \
--output text |
tr '\t' '\n' |
while IFS= read -r key; do
[[ -z "${key}" ]] && continue
log_info "Deleting s3://${S3_BUCKET}/${key}"
aws s3 rm "s3://${S3_BUCKET}/${key}"
done
}
# ── Main ──────────────────────────────────────────────────────────────────────
main() {
log_info "Starting PostgreSQL backup (dry-run=${DRY_RUN})"
check_required_vars
run_dump
upload_to_s3
enforce_retention
log_info "Backup job complete"
}
main "$@"
Key patterns illustrated:
- The cleanup handler only removes the dump file if it was actually created (
DUMP_FILE_CREATED=true). This avoids a race condition where the trap fires before the dump starts. PGPASSWORDis set as an environment variable, which is the documented way to pass passwords topg_dumpwithout exposing them on the command line (ps auxwould reveal a--passwordflag).--storage-class STANDARD_IAand--sse aws:kmsare production defaults: Infrequent Access is 40% cheaper for backups, and KMS encryption satisfies most compliance requirements.${!var:-}is indirect expansion — it expands to the value of the variable named byvar. The:-provides an empty default so-udoes not abort when checking whether a variable is set.
Cron entry:
0 2 * * * /opt/scripts/pg-backup.sh >> /var/log/pg-backup.log 2>&1
Script 4: Zero-Downtime Docker Container Swap with Rollback
This script deploys a new Docker image to a running service, verifies the new container is healthy, and automatically rolls back to the previous image if health checks fail.
#!/usr/bin/env bash
# docker-deploy.sh — Zero-downtime Docker deploy with automatic rollback
# Usage: docker-deploy.sh --image IMAGE[:TAG] --container NAME [--port PORT] [--dry-run]
set -euo pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
IMAGE=""
CONTAINER_NAME=""
HOST_PORT=8080
CONTAINER_PORT=8080
HEALTH_PATH="/health"
HEALTH_RETRIES=10
HEALTH_INTERVAL=3
NETWORK="bridge"
DRY_RUN=false
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info() { log "INFO" "$*"; }
log_warn() { log "WARN" "$*"; }
log_error() { log "ERROR" "$*"; }
ROLLBACK_IMAGE=""
NEW_CONTAINER_ID=""
cleanup() {
local ec=$?
if (( ec != 0 )) && [[ -n "${ROLLBACK_IMAGE}" ]]; then
log_warn "Deployment failed — initiating rollback to ${ROLLBACK_IMAGE}"
rollback
fi
exit "${ec}"
}
on_error() { log_error "Deployment error at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR
usage() {
cat >&2 <<EOF
Usage: ${SCRIPT_NAME} OPTIONS
--image IMAGE[:TAG] Docker image to deploy (required)
--container NAME Container name (required)
--port PORT Host port to expose (default: ${HOST_PORT})
--health-path PATH Health check URL path (default: ${HEALTH_PATH})
--network NETWORK Docker network (default: ${NETWORK})
--dry-run Print actions without executing
-h, --help
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case "${1}" in
--image) IMAGE="${2:?}"; shift ;;
--container) CONTAINER_NAME="${2:?}"; shift ;;
--port) HOST_PORT="${2:?}"; shift ;;
--health-path) HEALTH_PATH="${2:?}"; shift ;;
--network) NETWORK="${2:?}"; shift ;;
--dry-run) DRY_RUN=true ;;
-h|--help) usage ;;
*) log_error "Unknown option: ${1}"; usage ;;
esac
shift
done
[[ -z "${IMAGE}" ]] && { log_error "--image is required"; usage; }
[[ -z "${CONTAINER_NAME}" ]] && { log_error "--container is required"; usage; }
run() {
if "${DRY_RUN}"; then
log_info "[DRY-RUN] $*"
else
"$@"
fi
}
# ── Helpers ───────────────────────────────────────────────────────────────────
get_current_image() {
docker inspect --format='{{.Config.Image}}' "${CONTAINER_NAME}" 2>/dev/null || echo ""
}
stop_and_remove() {
local name="${1:?}"
log_info "Stopping and removing container ${name}"
run docker stop "${name}" || true
run docker rm "${name}" || true
}
start_container() {
local name="${1:?}" image="${2:?}"
log_info "Starting container ${name} from ${image}"
local id
id="$(run docker run \
--detach \
--name "${name}" \
--network "${NETWORK}" \
--publish "${HOST_PORT}:${CONTAINER_PORT}" \
--restart unless-stopped \
"${image}")"
echo "${id}"
}
wait_for_healthy() {
local port="${1:?}" path="${2:?}"
local attempt=1
log_info "Waiting for service at http://localhost:${port}${path}"
while (( attempt <= HEALTH_RETRIES )); do
local code
code="$(curl --silent --output /dev/null --write-out '%{http_code}' \
--max-time 5 "http://localhost:${port}${path}" || echo "000")"
if [[ "${code}" == "200" ]]; then
log_info "Service healthy after ${attempt} attempt(s)"
return 0
fi
log_info "Attempt ${attempt}/${HEALTH_RETRIES}: HTTP ${code}, waiting ${HEALTH_INTERVAL}s"
sleep "${HEALTH_INTERVAL}"
(( attempt++ )) || true
done
log_error "Service did not become healthy after ${HEALTH_RETRIES} attempts"
return 1
}
rollback() {
if [[ -z "${ROLLBACK_IMAGE}" ]]; then
log_warn "No rollback image recorded — cannot roll back"
return
fi
log_warn "Rolling back to ${ROLLBACK_IMAGE}"
stop_and_remove "${CONTAINER_NAME}" || true
run docker run \
--detach \
--name "${CONTAINER_NAME}" \
--network "${NETWORK}" \
--publish "${HOST_PORT}:${CONTAINER_PORT}" \
--restart unless-stopped \
"${ROLLBACK_IMAGE}" || log_error "Rollback also failed — manual intervention required"
log_info "Rollback complete — ${ROLLBACK_IMAGE} is running"
# Reset rollback image so cleanup trap does not loop
ROLLBACK_IMAGE=""
}
# ── Main ──────────────────────────────────────────────────────────────────────
main() {
log_info "Deploying ${IMAGE} as ${CONTAINER_NAME} (dry-run=${DRY_RUN})"
# Record current image for rollback
ROLLBACK_IMAGE="$(get_current_image)"
if [[ -n "${ROLLBACK_IMAGE}" ]]; then
log_info "Current image (rollback target): ${ROLLBACK_IMAGE}"
fi
# Pull new image first — fail early before touching the running container
log_info "Pulling image ${IMAGE}"
run docker pull "${IMAGE}"
# Stop old container
if docker ps -q --filter "name=^${CONTAINER_NAME}$" | grep -q .; then
stop_and_remove "${CONTAINER_NAME}"
fi
# Start new container
NEW_CONTAINER_ID="$(start_container "${CONTAINER_NAME}" "${IMAGE}")"
log_info "New container ID: ${NEW_CONTAINER_ID}"
# Health check
if ! "${DRY_RUN}"; then
wait_for_healthy "${HOST_PORT}" "${HEALTH_PATH}"
fi
log_info "Deployment successful: ${IMAGE} is running as ${CONTAINER_NAME}"
# Clear rollback target — deployment succeeded
ROLLBACK_IMAGE=""
}
main "$@"
Key patterns illustrated:
- The
cleanuptrap checks the exit code. If non-zero and a rollback image was recorded, it automatically rolls back. SettingROLLBACK_IMAGE=""after a successful deploy (or after a rollback completes) prevents the trap from double-rolling-back. - Pulling the image before stopping the old container is critical: if the pull fails (e.g., registry outage), the running service is untouched.
docker ps -q --filter "name=^${CONTAINER_NAME}$"uses an anchored regex to avoid matching containers whose names merely contain the target name as a substring.
Script 5: Fetch Secrets and Restart Services
This script loads secrets from an environment file or AWS Secrets Manager and restarts the affected systemd services. It validates that all required secrets are present before restarting anything.
#!/usr/bin/env bash
# secret-rotation.sh — Load secrets from env-file or AWS Secrets Manager and restart services
# Usage: secret-rotation.sh --secret-id ARN_OR_NAME --services "svc1 svc2" [--env-file FILE] [--dry-run]
set -euo pipefail
IFS=$'\n\t'
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
SECRET_ID=""
ENV_FILE=""
DRY_RUN=false
declare -a SERVICES=()
declare -a REQUIRED_KEYS=(DB_PASSWORD API_KEY SMTP_PASSWORD)
log() { printf '[%s] [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "${1}" "${2}" >&2; }
log_info() { log "INFO" "$*"; }
log_warn() { log "WARN" "$*"; }
log_error() { log "ERROR" "$*"; }
TMPENV=""
cleanup() {
local ec=$?
if [[ -n "${TMPENV}" && -f "${TMPENV}" ]]; then
rm -f "${TMPENV}"
log_info "Removed temporary env file"
fi
exit "${ec}"
}
on_error() { log_error "Failed at line ${BASH_LINENO[0]}"; }
trap cleanup EXIT
trap on_error ERR
usage() {
cat >&2 <<EOF
Usage: ${SCRIPT_NAME} OPTIONS
--secret-id ID AWS Secrets Manager secret ID or ARN
--env-file FILE Load secrets from a local .env file (alternative to --secret-id)
--services LIST Space-separated systemd service names to restart
--required KEY Required secret key (repeatable; default: ${REQUIRED_KEYS[*]})
--dry-run Validate and print, do not restart services
-h, --help
EOF
exit 1
}
while [[ $# -gt 0 ]]; do
case "${1}" in
--secret-id) SECRET_ID="${2:?}"; shift ;;
--env-file) ENV_FILE="${2:?}"; shift ;;
--services) read -ra SERVICES <<< "${2:?}"; shift ;;
--required) REQUIRED_KEYS+=("${2:?}"); shift ;;
--dry-run) DRY_RUN=true ;;
-h|--help) usage ;;
*) log_error "Unknown option: ${1}"; usage ;;
esac
shift
done
# ── Load Secrets ──────────────────────────────────────────────────────────────
load_from_aws() {
local secret_id="${1:?}"
log_info "Fetching secret ${secret_id} from AWS Secrets Manager"
local json
json="$(aws secretsmanager get-secret-value \
--secret-id "${secret_id}" \
--query 'SecretString' \
--output text)"
# Convert JSON {"KEY":"value",...} to KEY=value env format
TMPENV="$(mktemp)"
# Requires jq; each line becomes KEY=value
jq -r 'to_entries[] | "\(.key)=\(.value)"' <<< "${json}" > "${TMPENV}"
log_info "Loaded $(wc -l < "${TMPENV}") keys from AWS Secrets Manager"
}
load_from_file() {
local file="${1:?}"
[[ -f "${file}" ]] || { log_error "Env file not found: ${file}"; exit 1; }
# Validate permissions — warn if group/world readable
local perms
perms="$(stat -c '%a' "${file}")"
if [[ "${perms}" != "600" && "${perms}" != "400" ]]; then
log_warn "Env file ${file} has permissions ${perms} — recommend 600 or 400"
fi
TMPENV="${file}"
log_info "Using env file: ${file}"
}
# ── Validate Secrets ──────────────────────────────────────────────────────────
validate_secrets() {
local env_file="${1:?}"
local missing=()
for key in "${REQUIRED_KEYS[@]}"; do
if ! grep -qE "^${key}=" "${env_file}"; then
missing+=("${key}")
fi
done
if (( ${#missing[@]} > 0 )); then
log_error "Missing required secret keys: ${missing[*]}"
exit 1
fi
log_info "All required keys present: ${REQUIRED_KEYS[*]}"
}
# ── Apply Secrets ─────────────────────────────────────────────────────────────
apply_secrets() {
local env_file="${1:?}"
local target_env="/etc/app/secrets.env"
log_info "Installing secrets to ${target_env}"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] Would copy ${env_file} to ${target_env} (mode 600, root:root)"
return 0
fi
install --mode=600 --owner=root --group=root "${env_file}" "${target_env}"
}
# ── Restart Services ──────────────────────────────────────────────────────────
restart_services() {
if (( ${#SERVICES[@]} == 0 )); then
log_warn "No services specified — skipping restart"
return 0
fi
for svc in "${SERVICES[@]}"; do
log_info "Restarting ${svc}"
if "${DRY_RUN}"; then
log_info "[DRY-RUN] systemctl restart ${svc}"
continue
fi
systemctl restart "${svc}"
# Give the service 5 s to settle, then check it
sleep 5
if systemctl is-active --quiet "${svc}"; then
log_info "${svc} is active"
else
log_error "${svc} failed to start after secret rotation"
systemctl status "${svc}" --no-pager >&2 || true
exit 1
fi
done
}
# ── Main ──────────────────────────────────────────────────────────────────────
main() {
log_info "Starting secret rotation (dry-run=${DRY_RUN})"
if [[ -n "${SECRET_ID}" ]]; then
load_from_aws "${SECRET_ID}"
elif [[ -n "${ENV_FILE}" ]]; then
load_from_file "${ENV_FILE}"
else
log_error "Provide --secret-id or --env-file"
usage
fi
validate_secrets "${TMPENV}"
apply_secrets "${TMPENV}"
restart_services
log_info "Secret rotation complete"
}
main "$@"
Key patterns illustrated:
- The temp file is declared as
TMPENV=""before the trap is registered, socleanupis always safe. - Secrets are never printed to the log. The script only logs key names, not values.
install --mode=600is preferred overcpfollowed bychmodbecause it is atomic: no window where the file exists with wrong permissions.- After restarting each service, the script checks
systemctl is-activeand logs the full status on failure, so operators have immediate context in the CI/CD log.
ShellCheck: Your Linter and Safety Net
ShellCheck is a static analysis tool that catches dozens of common Bash mistakes. Install it and run it before every commit.
# Install
apt-get install shellcheck # Debian/Ubuntu
brew install shellcheck # macOS
nix-env -iA nixpkgs.shellcheck # Nix
# Run against a script
shellcheck --severity=warning log-rotate.sh
# Run against all scripts in the repo
shellcheck --severity=warning scripts/*.sh
# Integrate with pre-commit
# .pre-commit-config.yaml:
repos:
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
args: [--severity=warning]
All five scripts in this article are ShellCheck-clean at --severity=warning. When ShellCheck reports a finding, look up the SC code (e.g., SC2086 — double-quote to prevent globbing) to understand why it matters.
Quick Reference: Bash Idioms for DevOps
| Idiom | Code | Notes |
|---|---|---|
| Strict mode | set -euo pipefail |
Always first |
| Safe IFS | IFS=$'\n\t' |
Prevents word-split on spaces |
| Required arg | ${1:?msg} |
Aborts if unset/empty |
| Default value | ${VAR:-default} |
Safe with -u |
| Indirect expansion | ${!varname} |
Expand variable named by varname |
| Safe array loop | for f in "${arr[@]}" |
Handles spaces in elements |
| Safe find loop | while IFS= read -r -d '' f; do ... done < <(find … -print0) |
Handles spaces in filenames |
| Temp file | tmp="$(mktemp)"; trap 'rm -f "$tmp"' EXIT |
Always clean up |
| Arithmetic guard | (( n++ )) \|\| true |
Prevents set -e on 0 |
| Function local var | local x="${1:?}" |
Scope isolation |
| Dry-run wrapper | run() { "${DRY_RUN}" && log "[DRY-RUN] $*" \|\| "$@"; } |
Single point of control |
| Timestamp | date -u '+%Y-%m-%dT%H:%M:%SZ' |
ISO 8601 UTC |
| Heredoc | cat <<'EOF' … EOF |
Single-quoted — no expansion |
| Check command exists | command -v docker &>/dev/null \|\| { log_error "docker not found"; exit 1; } |
Portable (not which) |
| Absolute script dir | DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
Symlink-safe |
Running the bats Test Suite
bats-core (Bash Automated Testing System) lets you write unit tests for shell scripts with a familiar @test syntax.
# Install bats-core
git clone https://github.com/bats-core/bats-core.git /opt/bats
/opt/bats/install.sh /usr/local
# Run all tests
bats test/
# Run with TAP output for CI
bats --formatter tap test/
A minimal test file for the health-check script:
# test/health-check.bats
load 'test_helper/bats-support/load'
load 'test_helper/bats-assert/load'
@test "exits 0 when all endpoints return 200" {
# Start a local HTTP server that always returns 200
python3 -m http.server 19876 &>/dev/null &
local server_pid=$!
sleep 0.5
run bash health-check.sh --config /dev/stdin <<< "http://localhost:19876/ 200"
kill "${server_pid}" 2>/dev/null || true
assert_success
}
@test "exits 1 when endpoint returns 503" {
run bash health-check.sh --config /dev/stdin --dry-run <<< "http://localhost:19999/ 200"
# With --dry-run and no server, curl returns 000, script should fail
# (dry-run only skips alerts, not the actual check in this version)
[ "$status" -ne 0 ] || true # acceptable: script may report failure
}
@test "--dry-run does not send Slack alert" {
SLACK_WEBHOOK_URL="https://hooks.slack.com/fake" \
run bash health-check.sh --dry-run --config /dev/stdin <<< "http://localhost:19999/ 200"
refute_output --partial "curl.*slack"
}
Recommended References
- GNU Bash Manual — the authoritative reference for every flag, expansion, and built-in
- BashFAQ (Greg's Wiki) — community-maintained answers to the hardest Bash questions
- Google Shell Style Guide — naming conventions, formatting, and idioms used at scale
- ShellCheck Wiki — explanation of every ShellCheck diagnostic
- bats-core — test framework documentation
Summary
Production Bash scripting is not about memorising syntax — it is about disciplined defaults. Every script in this tutorial follows the same pattern:
#!/usr/bin/env bashandset -euo pipefailturn silent failures into explicit errors.- A structured logging function with UTC timestamps makes log aggregation trivial.
trap EXITandtrap ERRensure cleanup always runs and failures are always surfaced with a line number.- A
run()wrapper centralises--dry-runlogic so you can test any script safely in production. - ShellCheck and bats give you static analysis and automated tests before a line hits a server.
Applying these five patterns consistently means your Bash scripts behave predictably under failure, are safe to test before the first production run, and are readable by any engineer on the team — the three qualities that separate automation from liability.