Canonical Log Lines, Evolved with OpenTelemetry

“Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally.” — loggingsucks.com

The Problem with How We Log

Stripe’s 2019 blog post on canonical log lines is one of those rare engineering pieces that stays relevant for years. The idea is deceptively simple: instead of letting your request telemetry scatter across dozens of log lines that need to be joined at query time, you emit one fat, information-dense log line per request — a canonical record of everything that happened.

At Stripe, this single line contains HTTP method, path, status, authenticated user, rate-limiting decision, database query count, duration — all in one place:

[2019-03-18 22:48:32.999] canonical-log-line alloc_count=9123 auth_type=api_key
database_queries=34 duration=0.009 http_method=POST http_path=/v1/charges
http_status=200 key_id=mk_123 rate_allowed=true rate_quota=100 request_id=req_123

That one line lets them answer questions like “which users are being rate-limited the most?” with a query that takes ten seconds to write, not ten minutes. No joins. No correlating request IDs across five log lines. One record, one query.

This is genuinely good engineering. But it was written in 2019, referencing Splunk, fluentd, Kafka, and Redshift. The ecosystem has shifted. Today we have OpenTelemetry as a vendor-neutral telemetry standard and VictoriaMetrics/VictoriaLogs as a leaner, faster, radically cheaper backend. Let me show you how the idea lands in 2026.

What Stripe Got Right (And What the Ecosystem Missed)

Most logging in production falls into one of two failure modes:

Scatter logs. Five log lines per request, each with a piece of the story. Correlating them requires your log processing system to do expensive joins on a request ID — slow to run, painful to write under incident pressure.

Metrics-only thinking. Pre-defined Prometheus dashboards are fast but completely inflexible. If you didn’t define the metric before the incident started, you can’t answer the question. You’re always partially flying blind.

Canonical log lines give you the best of both: the flexibility of logs with the query ergonomics approaching metrics. The modern observability community has rediscovered this under the name wide events — one comprehensive event per request, every attribute attached.

What the ecosystem has been slow to realize is that OpenTelemetry already provides the plumbing to emit these wide events in a standardized, vendor-neutral format — and VictoriaLogs provides the storage and query layer that makes exploiting them practical and cheap.

Bringing in OpenTelemetry

OpenTelemetry is a delivery mechanism and a data model. It doesn’t tell you what to log — that’s still your job. But it gives you a consistent way to structure, enrich, and ship telemetry to any backend.

The OTel log data model wraps your log record with these key fields:

Field	What it carries
`Body`	The primary log message
`Attributes`	Arbitrary key-value pairs — this is where your canonical fields live
`Resource`	The emitting entity (service name, k8s pod, host)
`TraceId` / `SpanId`	W3C Trace Context — automatic log-trace correlation
`SeverityNumber`	Standardized severity (INFO, WARN, ERROR)

The critical point: OTel Attributes are typed. duration is a float, not a string. db.query_count is an integer. This matters downstream — percentile calculations and range queries are significantly faster on typed fields than on parsed strings.

The Middleware Pattern

The implementation pattern is the same regardless of language. A single outer middleware initializes a context object, inner middleware layers (auth, rate limiter, etc.) annotate it with their piece of the story, and the outer middleware emits the final canonical line on the way out.

# Each layer contributes its piece — no log statements scattered around
def auth_middleware(request):
    user = authenticate(request)
    annotate(request, auth_type="api_key", user_id=user.id)

def rate_limit_middleware(request):
    decision = check_rate_limit(request)
    annotate(request, rate_allowed=decision.allowed, rate_remaining=decision.remaining)

# The outer middleware drains everything into ONE canonical log line
def canonical_log_middleware(request):
    start = time.now()
    yield  # entire request lifecycle runs here
    emit_canonical_line(request, duration=time.now() - start)

That’s it. Every concern stays encapsulated. No cross-cutting log statements. One line emitted per request, carrying context from all layers. The emit_canonical_line call goes out via the OTel Log SDK — as a structured log record with typed attributes, routed through the OTel Collector to VictoriaLogs.

The Stack: OTel Collector → VictoriaMetrics

Stripe’s original pipeline was: servers → fluentd → Splunk + Kafka → S3 → Redshift. That works, but it’s expensive and operationally heavy.

The modern equivalent:

Application
    │ OTLP/HTTP
    ▼
OpenTelemetry Collector
    ├──► VictoriaMetrics  :8428   (metrics)
    ├──► VictoriaLogs     :9428   (canonical log lines)
    └──► VictoriaTraces   :10428  (distributed traces)
              │
              └──► Grafana (unified dashboards + alerts)

VictoriaMetrics supports OTLP natively across all three signals. The Collector pushes directly to each component — no intermediate Kafka, no consumer pipelines. The relevant Collector config is essentially two lines per destination:

exporters:
  otlphttp/victorialogs:
    logs_endpoint: http://victorialogs:9428/insert/opentelemetry/v1/logs
    headers:
      VL-Stream-Fields: service.name,environment  # defines the log stream partitioning key

VL-Stream-Fields tells VictoriaLogs which attributes to use as stream labels — the partitioning key for efficient storage and retrieval. Choose something stable and low-cardinality here (service name, environment) and let everything else be queryable log fields.

Querying: LogsQL vs Splunk SPL

VictoriaLogs uses LogsQL — a purpose-built query language for exactly the use case canonical log lines were built for: fast, ad-hoc queries over structured log fields.

Let me translate Stripe’s example queries directly:

Which users are being rate-limited the most?

# Stripe's Splunk version:
canonical-log-line rate_allowed=false | stats count by user_id

# LogsQL:
_time:1h "canonical-log-line" rate.allowed:false
  | stats by (auth.user_id) count() as hits
  | sort by (hits) desc

p50/p95/p99 latency on the charges endpoint, excluding 4xx:

_time:1h "canonical-log-line" http.path:="/v1/charges" http.status:!~"4.."
  | stats
      quantile(0.50, duration_seconds) as p50,
      quantile(0.95, duration_seconds) as p95,
      quantile(0.99, duration_seconds) as p99

Error rate by endpoint over 24h:

_time:24h "canonical-log-line"
  | stats by (http.path) count() as total,
      count() if (http.status:~"5..") as errors
  | math errors / total * 100 as error_rate_pct
  | sort by (error_rate_pct) desc

The queries are nearly identical to Stripe’s Splunk SPL. The ergonomics hold. What’s different: VictoriaLogs automatically indexes every field in your log entries without schema definition. When a canonical line arrives with 20 attributes, all 20 are immediately queryable — no index configuration, no mapping explosions.

The Trace Correlation Superpower

Here’s something Stripe’s 2019 post couldn’t offer: every canonical log line is automatically correlated to a distributed trace.

Because the OTel SDK emits log records within an active trace span, trace_id and span_id are injected automatically — no extra code from you. When you find a suspicious canonical log line during an incident, you click directly to the full trace in VictoriaTraces to see exactly which downstream service caused the latency.

In Grafana, you configure this once as a derived field on the VictoriaLogs datasource:

matcherRegex: trace_id=(\w+)
url: /explore?datasource=victoriatraces&traceId=${__value.raw}

Now every canonical log line in Grafana Explore has a clickable “View Trace” link. The jump from “this request was slow” to “here’s why” is a single click — something that used to require ten minutes of manual correlation.

Turning Log Signals into Metrics

One more trick the VictoriaMetrics stack enables: deriving Prometheus-compatible metrics from your canonical log lines via vmalert recording rules, persisted into VictoriaMetrics for long-term retention and fast alerting.

# Derive an error rate metric directly from canonical log lines
- record: api:error_rate:5m
  expr: |
    count_over_time({log.type="canonical", http.status=~"5.."}[5m])
    /
    count_over_time({log.type="canonical"}[5m]) * 100

This gives you the best of all three worlds — raw logs for incident investigation, derived metrics for dashboards and alerting, and traces for deep request-level debugging. The canonical log line becomes the single source of truth that powers all three signals without duplicating instrumentation.

Why VictoriaMetrics Over the Alternatives

vs Elasticsearch: VictoriaLogs claims up to 30x less RAM for equivalent workloads. Users have reported replacing 27-node Elasticsearch clusters with a single VictoriaLogs node. That’s not a marginal improvement — it’s a fundamental cost shift.

vs Grafana Loki: Loki treats high-cardinality labels — like user_id, trace_id, request_id — as a performance problem that degrades your stream index badly. VictoriaLogs was explicitly designed for high-cardinality fields. Canonical log lines are naturally high-cardinality. The fit is obvious.

vs Datadog/managed platforms: No egress pricing shock. No per-seat licensing. VictoriaLogs is open source, a single binary, zero-config by default. For a Central Observability Platform serving multiple teams, the cost difference at scale is an order of magnitude.

The trade-off worth being honest about: LogsQL is younger than Splunk SPL, and for long-term analytics with complex joins you’d still want a data warehouse downstream. But for operational observability — incident response, rate-limiting analysis, latency debugging — VictoriaLogs more than covers the ground.

The Mental Model Shift

The deepest lesson from Stripe’s post, and the one most engineers miss:

Stop logging what your code is doing. Start logging what happened to this request.

It’s not about more log lines. It’s about one record per request that captures the full business context — who made the request, what they wanted, what your system decided, how long it took, and how many resources it consumed. Everything in one place, ready to answer any question you didn’t know you’d need to ask.

With OpenTelemetry as the instrumentation layer and VictoriaMetrics as the backend, you get vendor lock-in elimination, automatic trace correlation, schema-free storage, and dramatically lower cost compared to managed platforms — all without sacrificing the simplicity that made Stripe’s original idea so compelling.

Stripe’s canonical log lines were ahead of their time in 2019. The tooling has finally caught up.

The Problem with How We Log#

What Stripe Got Right (And What the Ecosystem Missed)#

Bringing in OpenTelemetry#

The Middleware Pattern#

The Stack: OTel Collector → VictoriaMetrics#

Querying: LogsQL vs Splunk SPL#

The Trace Correlation Superpower#

Turning Log Signals into Metrics#

Why VictoriaMetrics Over the Alternatives#

The Mental Model Shift#

Further Reading#