Chain-of-thought prompting that holds up under pressure

Chain-of-thought prompting that holds up under pressure

Chain-of-thought prompting is the easiest reasoning trick to ship and the hardest to keep working. The basic idea — let the model write its working before the answer — is robust in research, but in production the same prompt that worked at 10 RPS quietly degrades at 1000.

The failure modes nobody warned you about

Token bloat is the obvious cost: every reasoning trace you tolerate is two or three times the response you actually wanted. The harder failures are silent. Reasoning that’s too short collapses into the answer, which means the model is no longer thinking step by step — it’s giving you the same answer with a longer preamble. Reasoning that’s too long drifts: the model talks itself out of correct answers because it gives itself room to second-guess.

Patterns that hold up

Constrain the shape of reasoning, not its content. Force a numbered list with a fixed cap. Use a separate model call for reasoning and a smaller call for the final answer — cheap CoT, expensive answer is a real production trade-off. Cache the reasoning-free version for prompts where you’ve already seen the chain converge. None of this is glamorous, and all of it works.

The version of CoT you ship is rarely the version you wrote in the playground. That’s expected.

Related Posts

Mastering prompt engineering for production use

Mastering prompt engineering for production use

Lorem ipsum dolor sit amet consectetur adipisicing ...

System prompts that survive long sessions

System prompts that survive long sessions

Every team writes a careful system prompt and forg ...