}

SigNoz + OpenTelemetry Python 2026: Full Observability Stack Tutorial


TL;DR This tutorial walks you through building a production-grade observability stack using OpenTelemetry and SigNoz for Python applications. You will instrument a FastAPI service with zero-code auto-instrumentation, add custom manual spans, propagate context across service boundaries, collect metrics and structured logs, configure the OpenTelemetry Collector, and visualize everything inside SigNoz — all without paying for Datadog or New Relic. By the end, you will have a running, correlated traces-metrics-logs pipeline that you can deploy to production.


SigNoz + OpenTelemetry Python 2026: Full Observability Stack Tutorial

Modern distributed systems fail in non-obvious ways. A single slow database query can cascade into degraded response times across a dozen microservices, and without the right tooling the incident will last hours instead of minutes. Observability — the ability to infer the internal state of a system purely from its external outputs — is no longer optional. This guide shows you exactly how to build that capability with two excellent open-source projects: OpenTelemetry for instrumentation and SigNoz for storage and visualization.


1. What Is Observability? Traces vs Metrics vs Logs

Observability is often described through three pillars. Understanding what each pillar is good at (and where it falls short) prevents you from over-relying on any single signal.

Traces

A trace is a directed acyclic graph of spans. Each span represents a single unit of work — an HTTP handler, a database call, a message publish — with a start time, end time, and a bag of key-value attributes. Spans within the same trace share a common trace_id, and parent-child relationships are recorded through span_id and parent_span_id fields.

Traces answer the question: why is this particular request slow? They let you pinpoint the exact operation that added latency and understand the call chain that led to an error.

Metrics

A metric is a numeric measurement sampled over time. Counters, gauges, and histograms are the three primitive types. Metrics are cheap to store and aggregate, making them ideal for dashboards and alerting. They answer: is anything broken right now, and how bad is it?

Logs

A log is an immutable, timestamped record of a discrete event. Structured logs (JSON or key-value pairs) are far more useful than plain-text lines because they can be indexed and queried efficiently. Logs answer: what exactly happened at this moment in time?

The real power of modern observability platforms is correlation — jumping from an alert on a metric spike, to the trace that caused it, to the log line that explains the root cause — all in a single workflow. SigNoz provides this correlation natively.


2. OpenTelemetry Overview

OpenTelemetry (OTel) is a CNCF project that provides a vendor-neutral API, SDK, and wire protocol for collecting observability data. It was formed by merging OpenCensus and OpenTracing, and is now the de-facto standard for instrumenting cloud-native applications.

Key components

ComponentRole
APILanguage-specific interfaces for creating spans, metrics, and log records. Stable and dependency-safe — libraries can depend on this.
SDKImplements the API with sampling, batching, and exporting logic. Applications configure the SDK at startup.
CollectorA standalone agent/gateway that receives telemetry over OTLP, processes it (tail sampling, attribute enrichment, filtering), and exports it to one or more backends.
OTLPOpenTelemetry Protocol — the canonical wire format. gRPC and HTTP/protobuf transports are both supported.
Instrumentation librariesAuto-instrumentation packages for popular frameworks (FastAPI, Django, SQLAlchemy, httpx, etc.).

Why OpenTelemetry?

  • Vendor neutrality: instrument once, export to SigNoz today, Grafana Tempo tomorrow, or both simultaneously.
  • Community momentum: hundreds of contrib packages cover virtually every popular library.
  • Specification-backed: the OTel specification defines semantic conventions so attribute names like http.method mean the same thing across every language and backend.
  • No lock-in: switching backends requires only a Collector config change, not an application redeployment.

3. SigNoz Overview

SigNoz is an open-source observability platform written in Go and React. It is designed as a self-hosted alternative to Datadog and New Relic, with a ClickHouse columnar database at its core for cost-effective storage of high-cardinality trace and metric data.

Why SigNoz over hosted solutions?

  • Cost: ClickHouse compresses telemetry data aggressively. Many teams report 5-10x lower storage costs compared to SaaS platforms at the same data volume.
  • Data residency: all data stays inside your infrastructure — critical for healthcare, finance, and government workloads.
  • No per-seat pricing: SigNoz is MIT-licensed and free to run. You pay only for the infrastructure it runs on.
  • Native OTLP ingestion: no proprietary agent required. If your application sends OTLP, SigNoz accepts it.
  • Full correlation: traces, metrics, and logs share the same trace_id and span_id fields, enabling one-click pivots between signals.

SigNoz ships with a prebuilt Docker Compose stack that includes the SigNoz query service, ClickHouse, Zookeeper, and the OTel Collector. Getting it running locally takes under five minutes.


4. Install SigNoz with Docker Compose

Prerequisites

  • Docker Engine 24+ and Docker Compose v2
  • At least 4 GB RAM free for the SigNoz stack

Clone and start

git clone -b main https://github.com/SigNoz/signoz.git --depth=1
cd signoz/deploy
docker compose -f docker/clickhouse-setup/docker-compose.yaml up -d

Wait about 60 seconds for ClickHouse to initialize, then open http://localhost:3301 in your browser. You will be prompted to create an admin account on first launch.

Verify the Collector is listening

The OTel Collector bundled with SigNoz exposes these ports by default:

PortProtocolPurpose
4317gRPCOTLP traces, metrics, logs
4318HTTPOTLP traces, metrics, logs
8888HTTPCollector self-metrics
curl -v http://localhost:4318/v1/traces
# Expect: 405 Method Not Allowed — the endpoint exists and is healthy

5. Install the OpenTelemetry Python SDK

Create a virtual environment and install the core packages:

python -m venv .venv
source .venv/bin/activate

pip install \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-httpx \
  opentelemetry-instrumentation-sqlalchemy \
  fastapi \
  uvicorn[standard]

The packages split into distinct concerns:

  • opentelemetry-sdk — the core SDK with TracerProvider, MeterProvider, LoggerProvider
  • opentelemetry-exporter-otlp — exports data via OTLP (includes both gRPC and HTTP sub-packages)
  • opentelemetry-instrumentation-* — zero-code patches for specific libraries

6. Auto-Instrument FastAPI with Zero-Code Instrumentation

The fastest way to get traces flowing is the opentelemetry-instrument CLI wrapper. It patches supported libraries at startup without changing application code.

Application code (app/main.py)

from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI(title="demo-service")


@app.get("/items/{item_id}")
async def get_item(item_id: int):
    return JSONResponse({"item_id": item_id, "status": "ok"})


@app.get("/health")
async def health():
    return {"status": "healthy"}

Launch with auto-instrumentation

OTEL_SERVICE_NAME=demo-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
OTEL_TRACES_EXPORTER=otlp \
OTEL_METRICS_EXPORTER=otlp \
OTEL_LOGS_EXPORTER=otlp \
opentelemetry-instrument uvicorn app.main:app --host 0.0.0.0 --port 8000

Every HTTP request to demo-service now produces a trace with spans for the ASGI middleware and the route handler. Hit the endpoint a few times:

for i in $(seq 1 10); do curl -s http://localhost:8000/items/$i > /dev/null; done

Open the SigNoz UI, navigate to Services, and you should see demo-service appear within 15–30 seconds.


7. Manual Instrumentation: Custom Spans with Attributes

Auto-instrumentation captures framework-level boundaries, but business logic often needs finer-grained spans. Use the OTel tracing API to add custom spans anywhere in your code.

Configuring the SDK programmatically (app/telemetry.py)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME


def configure_tracing(service_name: str) -> trace.Tracer:
    resource = Resource(attributes={SERVICE_NAME: service_name})

    exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name)

Using the tracer in business logic (app/inventory.py)

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)


def fetch_inventory(item_id: int) -> dict:
    with tracer.start_as_current_span("inventory.fetch") as span:
        span.set_attribute("item.id", item_id)
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.name", "inventory_db")

        try:
            # Simulate a database call
            result = _query_database(item_id)
            span.set_attribute("item.found", True)
            span.set_status(Status(StatusCode.OK))
            return result
        except ItemNotFoundError as exc:
            span.set_status(Status(StatusCode.ERROR, str(exc)))
            span.record_exception(exc)
            raise


def _query_database(item_id: int) -> dict:
    # Real implementation would use SQLAlchemy or asyncpg here
    if item_id > 1000:
        raise ItemNotFoundError(f"Item {item_id} does not exist")
    return {"item_id": item_id, "quantity": 42}


class ItemNotFoundError(Exception):
    pass

Key best practices for manual instrumentation:

  • Always set span.set_status(Status(StatusCode.ERROR, ...)) on exceptions — this is what SigNoz uses to highlight error spans in red.
  • Call span.record_exception(exc) to attach a full stack trace as a span event.
  • Use OTel semantic conventions for attribute names (e.g., db.system, http.method, net.peer.name) so SigNoz can apply prebuilt dashboards.
  • Keep span names short and static (no dynamic IDs in the name) — put dynamic data in attributes.

8. Context Propagation: Tracing Across Service Boundaries

Distributed traces only work if the trace_id and span_id are propagated between services. OpenTelemetry uses propagators that inject and extract context into HTTP headers, message queue metadata, or any other carrier.

Outgoing HTTP calls (app/client.py)

import httpx
from opentelemetry.propagate import inject
from opentelemetry import trace

tracer = trace.get_tracer(__name__)


async def call_downstream(url: str) -> dict:
    headers: dict = {}
    inject(headers)  # Injects traceparent, tracestate headers

    async with httpx.AsyncClient() as client:
        with tracer.start_as_current_span("http.client.get") as span:
            span.set_attribute("http.url", url)
            response = await client.get(url, headers=headers)
            span.set_attribute("http.status_code", response.status_code)
            return response.json()

Incoming context extraction (handled automatically)

When opentelemetry-instrumentation-fastapi is active, it automatically extracts traceparent and tracestate from incoming request headers and sets the remote span as the parent of the server-side root span. You get end-to-end traces across services with no additional code.

W3C TraceContext and Baggage

The default propagator supports the W3C TraceContext specification (traceparent / tracestate headers) and W3C Baggage. Baggage lets you pass key-value pairs that flow with every subsequent span in the trace — useful for propagating a tenant_id or user_id without re-reading it from a database at every service hop:

from opentelemetry.baggage import set_baggage, get_baggage
from opentelemetry import context

ctx = set_baggage("tenant.id", "acme-corp")
token = context.attach(ctx)
try:
    await call_downstream("http://order-service/orders")
finally:
    context.detach(token)

9. Metrics: Counters, Histograms, and Gauges

OpenTelemetry Metrics API provides three instrument types that map to the three metric kinds you care about in production.

Configure the MeterProvider (app/telemetry.py, extended)

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter


def configure_metrics(service_name: str) -> metrics.Meter:
    exporter = OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True)
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=15_000)
    provider = MeterProvider(
        resource=Resource(attributes={SERVICE_NAME: service_name}),
        metric_readers=[reader],
    )
    metrics.set_meter_provider(provider)
    return metrics.get_meter(service_name)

Creating and using instruments (app/metrics.py)

from opentelemetry import metrics

meter = metrics.get_meter(__name__)

# Counter: monotonically increasing (requests served, errors, cache hits)
request_counter = meter.create_counter(
    name="http.server.requests.total",
    description="Total number of HTTP requests handled",
    unit="1",
)

# Histogram: distribution of values (latency, payload size)
request_latency = meter.create_histogram(
    name="http.server.request.duration",
    description="HTTP request duration in milliseconds",
    unit="ms",
)

# Observable gauge: current value at observation time (queue depth, connection pool size)
active_connections = meter.create_observable_gauge(
    name="db.connections.active",
    description="Number of active database connections",
    unit="1",
    callbacks=[lambda options: [metrics.Observation(get_active_connection_count())]],
)


def get_active_connection_count() -> int:
    # Replace with your actual connection pool introspection
    return 5


# Record measurements in your request handler
def record_request(method: str, route: str, status: int, duration_ms: float):
    labels = {"http.method": method, "http.route": route, "http.status_code": str(status)}
    request_counter.add(1, labels)
    request_latency.record(duration_ms, labels)

Histograms in SigNoz are visualized as heatmaps and percentile time-series (p50, p95, p99), giving you a far richer view of latency distribution than averages alone.


10. Structured Logs with the OTel Logs API

OpenTelemetry Logs bridges traditional logging frameworks to the OTel pipeline, attaching trace_id and span_id to every log record emitted while a span is active. This is the mechanism that enables trace-to-log correlation in SigNoz.

Configure the LoggerProvider and bridge Python logging (app/telemetry.py, extended)

import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.sdk._logs._internal.export import ConsoleLogExporter
from opentelemetry.instrumentation.logging import LoggingInstrumentor


def configure_logging(service_name: str) -> None:
    exporter = OTLPLogExporter(endpoint="http://localhost:4317", insecure=True)
    provider = LoggerProvider(
        resource=Resource(attributes={SERVICE_NAME: service_name})
    )
    provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
    set_logger_provider(provider)

    # Bridge the standard Python logging module into the OTel pipeline
    LoggingInstrumentor().instrument(set_logging_format=True)

    logging.basicConfig(level=logging.INFO)

Using the logger

import logging

logger = logging.getLogger(__name__)


def process_order(order_id: str, user_id: str) -> None:
    logger.info(
        "Processing order",
        extra={
            "order.id": order_id,
            "user.id": user_id,
            "order.source": "web",
        },
    )
    # ... business logic ...
    logger.info("Order processed successfully", extra={"order.id": order_id})

When process_order is called within an active span, every log record will automatically include trace_id and span_id fields. In SigNoz, you can click any trace span and jump directly to the correlated log lines.


11. Configure the OpenTelemetry Collector

The OTel Collector acts as a central telemetry hub. Rather than having each application instance export directly to SigNoz, applications send to the local Collector, which handles batching, retry, tail-based sampling, and fan-out to multiple backends.

Collector configuration (otel-collector-config.yaml)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  # Drop health-check noise
  filter/drop_health:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'

  # Add environment tag to every span, metric, and log
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

  # Redact sensitive attribute values
  attributes/redact:
    actions:
      - key: user.email
        action: delete
      - key: http.request.header.authorization
        action: delete

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp/signoz:
    endpoint: signoz-otel-collector:4317
    tls:
      insecure: true

  # Fan-out: also send to a local Prometheus scrape endpoint for Grafana
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/drop_health, resource, attributes/redact, batch]
      exporters: [otlp/signoz]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/signoz, prometheus]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, attributes/redact, batch]
      exporters: [otlp/signoz]

Key design decisions in this config:

  • memory_limiter first: prevents the Collector from OOM-killing itself under burst load.
  • filter/drop_health: avoids polluting your trace backend with synthetic health-check spans that inflate request counts and cost storage.
  • resource processor: stamps every signal with the deployment environment so you can segment dashboards by production vs staging.
  • attributes/redact: a lightweight PII scrubbing layer — strip authorization headers and email addresses before they leave your network perimeter.

12. SigNoz Dashboards: Custom Dashboards and Alerts

Exploring traces

Navigate to Traces in the SigNoz sidebar. The trace explorer lets you filter by service, operation name, status, duration, and any custom attribute. Use the Group By feature to aggregate p99 latency by http.route and spot which endpoints are slowest.

Click any trace row to open the Flame Graph view — a Gantt chart of all spans in the trace with color-coded status. Error spans appear in red. Clicking a span opens its attribute panel, where you can see every key-value pair your instrumentation recorded.

Creating a custom dashboard

  1. Go to DashboardsNew Dashboard.
  2. Add a Time Series panel. Set the query to: signoz_calls_total{service_name="demo-service", http_status_code!~"2.."} This plots the error rate for your service.
  3. Add a Histogram panel for p99 latency: histogram_quantile(0.99, sum(rate(signoz_latency_bucket{service_name="demo-service"}[5m])) by (le, http_route))
  4. Save the dashboard as "demo-service Overview".

Configuring alerts

  1. Go to AlertsNew Alert.
  2. Set the metric query to error rate above 1% for 5 minutes.
  3. Add a notification channel (Slack webhook, PagerDuty, or email).
  4. Set the alert severity to critical and save.

SigNoz alerts are PromQL-compatible, so any query that works in Prometheus will work here.


13. Correlating Traces, Metrics, and Logs in SigNoz UI

Correlation is where SigNoz earns its place in the stack. Here is the typical incident workflow:

  1. Alert fires: error rate on demo-service exceeds 1%. You receive a Slack notification with a link to the SigNoz alert.
  2. Metrics dashboard: you open the dashboard and see a spike in http.server.requests.total{http_status_code="500"} starting at 14:32 UTC.
  3. Trace explorer: filter to demo-service, status = ERROR, time range 14:30–14:35. Sort by duration descending. The slowest traces all share the operation inventory.fetch.
  4. Span detail: click the failing span. Attributes show db.system=postgresql and db.name=inventory_db. The span event contains a full Python stack trace pointing to a connection timeout.
  5. Log correlation: click View Logs in the span detail panel. SigNoz filters logs by trace_id automatically. You see the structured log: "Processing order failed: connection pool exhausted".
  6. Resolution: scale the PostgreSQL connection pool and redeploy. The error rate drops within one scrape interval.

This entire workflow — from alert to root cause — takes under three minutes with full telemetry correlation, compared to potentially hours of log-grepping without it.


14. Production Deployment Considerations

Sampling strategy

Collecting 100% of traces at high traffic is expensive. Use tail-based sampling in the OTel Collector to keep all error traces and a percentage of successful ones:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

High availability Collector deployment

In production, run multiple Collector replicas behind a load balancer. Use the loadbalancing exporter to route traces with the same trace_id to the same Collector replica — this is required for tail sampling to work correctly:

exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-collector-headless.monitoring.svc.cluster.local
        port: 4317

Resource attribution

Always set SERVICE_NAME, SERVICE_VERSION, and DEPLOYMENT_ENVIRONMENT in your Resource. These attributes drive SigNoz service maps and let you correlate a spike to a specific version deployment:

from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_VERSION

resource = Resource(attributes={
    SERVICE_NAME: "demo-service",
    SERVICE_VERSION: "1.4.2",
    "deployment.environment": "production",
    "k8s.namespace.name": "apps",
    "k8s.pod.name": os.environ.get("HOSTNAME", "unknown"),
})

ClickHouse data retention

SigNoz stores data in ClickHouse tables partitioned by day. Configure retention to balance cost and compliance:

-- Reduce trace retention to 30 days (default is 3 days in the community build)
ALTER TABLE signoz_traces.signoz_index_v2
  MODIFY TTL toDateTime(timestamp) + INTERVAL 30 DAY;

For metrics, the default is 30 days. For logs, adjust based on your compliance requirements.

Security hardening

  • Enable TLS on all Collector endpoints in production. Use cert-manager on Kubernetes to automate certificate rotation.
  • Use Kubernetes Secrets or HashiCorp Vault to inject OTEL_EXPORTER_OTLP_HEADERS (for authenticated OTLP endpoints) at runtime — never bake credentials into Docker images.
  • Restrict inbound access to the SigNoz UI (port 3301) to your internal network or VPN. The query service has no built-in authentication in the community edition.

15. FAQ

Q: Can I use SigNoz with Django instead of FastAPI? Yes. Install opentelemetry-instrumentation-django and add DjangoInstrumentor().instrument() to your manage.py or WSGI entry point. The rest of the setup is identical.

Q: Does OpenTelemetry add significant latency overhead? In practice, no. The BatchSpanProcessor exports spans asynchronously in a background thread. At p99, the overhead is under 1 ms per request for typical span volumes. The memory_limiter in the Collector also prevents the export path from blocking the application.

Q: What is the difference between opentelemetry-exporter-otlp-proto-grpc and opentelemetry-exporter-otlp-proto-http? Both are included in opentelemetry-exporter-otlp. The gRPC transport (proto-grpc) has slightly lower overhead and supports streaming. The HTTP transport (proto-http) is easier to use through corporate firewalls and proxies. Prefer gRPC for intra-cluster communication and HTTP when crossing network boundaries.

Q: How do I instrument Celery workers? Install opentelemetry-instrumentation-celery. It patches Celery's task execution hooks to create spans for each task and propagates the trace context from the producer to the consumer automatically.

Q: Can I send data to both SigNoz and Datadog simultaneously? Yes. Configure two exporters in the Collector pipeline — otlp/signoz and datadog. Both will receive identical telemetry. This is a common migration pattern when evaluating SigNoz as a Datadog replacement.

Q: SigNoz is not showing my service. What should I check first? 1. Confirm the Collector is reachable: curl http://localhost:4318/v1/traces should return 405, not a connection refused error. 2. Check that OTEL_SERVICE_NAME is set correctly in your environment. 3. Run docker compose logs otel-collector and look for export errors. 4. Verify the OTEL_EXPORTER_OTLP_ENDPOINT does not include /v1/traces — that path is added automatically by the SDK.


Sources

Leonardo Lazzaro

Software engineer and technical writer. 10+ years experience in DevOps, Python, and Linux systems.

More articles by Leonardo Lazzaro