Temperature and top-p: tuning when the answer matters more than novelty

Temperature and top-p: tuning when the answer matters more than novelty

Temperature and top-p are the two sampling parameters every team adjusts and almost none tune systematically. The default of 0.7 is everyone’s first guess, the second guess is 0, and that’s where most projects stop. The real cost shows up later: classification tasks running with creative-writing temperatures, and creative writing tasks suffocating at temperature zero.

The decision rule that actually scales

For tasks with a single correct answer — classification, extraction, structured output — temperature should be 0 and top-p doesn’t matter. For tasks with many acceptable answers — summarization, rewriting — 0.5 to 0.7 with top-p around 0.9 is a reasonable starting point. For genuinely creative work, 0.8 to 1.0 is the right band, but always with top-p capped to avoid the tail of low-probability tokens that cause incoherence.

What the defaults hide

Setting temperature to 0 doesn’t make models deterministic — there’s still floating-point noise in tied probabilities. Two identical calls can produce different outputs. If you need true reproducibility, you need to capture the seed too, and not all APIs expose it. Treat temperature 0 as low-variance, not zero-variance, and your tests will stop being flaky.

The teams that ship reliable LLM features pick sampling parameters per task, not per project. The default config is the wrong config for half your endpoints.

Related Posts

Mastering prompt engineering for production use

Mastering prompt engineering for production use

Lorem ipsum dolor sit amet consectetur adipisicing ...

Chain-of-thought prompting that holds up under pressure

Chain-of-thought prompting that holds up under pressure

Chain-of-thought prompting is the easiest reasonin ...

System prompts that survive long sessions

System prompts that survive long sessions

Every team writes a careful system prompt and forg ...