Tracing LLM apps: what to log when nothing crashes

A traditional application crashes when something goes wrong. An LLM application returns a confident wrong answer and increments your success counter. Your standard observability stack — metrics, traces, exception tracking — will tell you the request finished in 1.2 seconds and report nothing else, while the user reads an answer that is structurally fine and factually incorrect.

The signals worth capturing

Capture the prompt and response — full text, not just hashes — for at least a sampled fraction of traffic, with PII handling appropriate to your domain. Log the model version, the temperature, the top-p, and any system prompt revision identifier. Capture token counts as separate fields, not just total cost. When a tool is called, log the arguments and the tool result, both as structured fields. Tracing libraries like OpenTelemetry handle the transport; the work is deciding what to put in the spans.

What to do with the signals

The capture is half the work. The other half is sampling traces into a review queue — a small percentage of production traffic, plus all traces flagged by user thumbs-down or downstream classifiers. Reviewing fifty traces a week catches drift that no metric will. The teams I’ve seen ship reliable LLM apps all do this; the teams that struggle treat trace review as something to start “once we have time.”

The cheapest way to detect silent failure is to read your own traces. The most expensive way is to wait for users to tell you.

Tracing LLM apps: what to log when nothing crashes

The signals worth capturing

What to do with the signals

Tags :

Share :

Related Posts

Retry, backoff, and the ghosts in your latency graph

LLM security: the threats nobody warned you about