Glossary · Technique
Retrieval-Augmented Generation (RAG)
Also known as: RAG, Retrieval-augmented LLM
Don't train on it — retrieve it. Inject relevant documents into the prompt at runtime so the model answers from real source material.
When to use it
- Anywhere the source material changes faster than you can fine-tune.
- Internal-document Q&A (legal contracts, code repos, support tickets).
- Citation-required answers where you need to trace which source said what.
- Domains where hallucination cost is high (medical, legal, finance).
When not to use it
- Open-ended creative tasks with no specific source material.
- Real-time chat where retrieval latency matters more than precision.
- Tasks the base model already handles well from pre-training.
How it works
- 1Embed your documents into a vector database (chunked first, usually 200–1000 tokens per chunk).
- 2At query time, embed the user's question and find the top-K most-similar chunks.
- 3Inject those chunks into the model's context, along with the original question.
- 4Optionally use re-ranking, multi-query generation, or HyDE for better retrieval.
- 5Always cite which chunks were used in the response — auditable trail.
Example
Lazy prompt
What's our return policy?
Using the technique
Use only the documents in <retrieved> tags to answer. If the docs don't contain the answer, say 'not found in policy docs'. Cite the document name for every claim. <retrieved>...top-K vector-search results pasted here...</retrieved> Question: What's our return policy?
Common pitfalls
- Bad retrieval = bad answer. Garbage chunks in, garbage answer out.
- Chunk size and overlap tuning matters more than the model choice.
- Models can still hallucinate even with retrieved context — instruct them to say 'I don't see this in the docs' when applicable.
- Don't retrieve from sources the user shouldn't see (auth boundaries).
Where this came from
Lewis et al., 2020 — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Standard architecture for production LLM apps since 2023.
Related techniques
Function Calling / Tool Use
Let the model decide when to invoke a real function or API instead of free-text answering. The foundation of every modern agent.
ReAct (Reason + Act)
Alternate reasoning and acting in a tight loop. The dominant pattern for tool-using agents — think, act, observe, repeat.
System Prompt Design
The hidden instructions that set the model's role, constraints, and ground rules for the entire conversation. Where 80% of product behavior actually lives.