Tracing LLM apps: what to log when nothing crashes
- William Jacob
- Observability , Production
- 08 May, 2026
A traditional application crashes when something goes wrong. An LLM application returns a confident wrong answer and increments your success counter. Your standard observability stack — metrics, traces, exception tracking — will tell you the request finished in 1.2 seconds and report nothing else, while the user reads an answer that is structurally fine and factually incorrect.
The signals worth capturing
Capture the prompt and response — full text, not just hashes — for at least a sampled fraction of traffic, with PII handling appropriate to your domain. Log the model version, the temperature, the top-p, and any system prompt revision identifier. Capture token counts as separate fields, not just total cost. When a tool is called, log the arguments and the tool result, both as structured fields. Tracing libraries like OpenTelemetry handle the transport; the work is deciding what to put in the spans.
What to do with the signals
The capture is half the work. The other half is sampling traces into a review queue — a small percentage of production traffic, plus all traces flagged by user thumbs-down or downstream classifiers. Reviewing fifty traces a week catches drift that no metric will. The teams I’ve seen ship reliable LLM apps all do this; the teams that struggle treat trace review as something to start “once we have time.”
The cheapest way to detect silent failure is to read your own traces. The most expensive way is to wait for users to tell you.