When should you use Self-Consistency?

Numerical reasoning, logic, and multi-step math. Anywhere a single CoT trace is sometimes confidently wrong. When you can afford N× the API cost for a meaningful accuracy bump. Production systems with quality SLAs on reasoning correctness.

When NOT to use Self-Consistency?

Open-ended creative tasks — there's no 'majority' answer. Cost-sensitive flows where N samples is prohibitive. Real-time chat with strict latency budgets.

How does Self-Consistency work?

Same Chain-of-Thought prompt, run N times with temperature > 0. Each sample produces a (different) reasoning chain and final answer. Tally the final answers; pick the most-frequent one. Optionally weight by reasoning quality (e.g. shorter chains, fewer hedges).

All techniques

Glossary · Technique

Self-Consistency

Also known as: Majority voting CoT

Sample the same Chain-of-Thought prompt N times. Take the majority answer. Beats single-sample CoT on reasoning benchmarks.

When to use it

Numerical reasoning, logic, and multi-step math.
Anywhere a single CoT trace is sometimes confidently wrong.
When you can afford N× the API cost for a meaningful accuracy bump.
Production systems with quality SLAs on reasoning correctness.

When not to use it

Open-ended creative tasks — there's no 'majority' answer.
Cost-sensitive flows where N samples is prohibitive.
Real-time chat with strict latency budgets.

How it works

1Same Chain-of-Thought prompt, run N times with temperature > 0.
2Each sample produces a (different) reasoning chain and final answer.
3Tally the final answers; pick the most-frequent one.
4Optionally weight by reasoning quality (e.g. shorter chains, fewer hedges).

Example

Lazy prompt

Let's think step by step about <hard problem>.

Using the technique

Sample this CoT prompt 5 times (temperature 0.7). For each, record the final answer. Return the answer that appears most often, and flag if no answer reached majority.

Common pitfalls

N× cost — only worth it if accuracy matters.
Temperature too high = noise; too low = all samples agree on the wrong answer.
Majority isn't always right; on adversarial questions it can lock in the popular-but-wrong answer.

Where this came from

Wang et al., 2022 — "Self-Consistency Improves Chain of Thought Reasoning in Language Models".

Related techniques

Chain-of-Thought (CoT) Prompting

Force the model to think step-by-step before answering. Dramatically improves accuracy on multi-step problems.

Tree-of-Thoughts (ToT) Prompting

Generate multiple reasoning branches per step, evaluate each, and prune. Beats single-path Chain-of-Thought on hard decisions.

Self-Refine

Generate → critique own output → revise → repeat. Pushes a model's output much closer to its capability ceiling.