RAG Through the Encoder

Effective RAG needs two specialized models working in harmony

an Encoder (embedding model) that finds the right evidence
a Decoder (LLM) that answers faithfully from that evidence

Most teams tune only one. Fine-tuning both is where real accuracy gains live.

1) Introduction — The Two-Model Problem Most Teams Ignore

Walk into any RAG discussion and you’ll hear about prompts, chunk sizes, and top-K. What you won’t hear enough about: who decides which documents get retrieved, and who decides how to use them.

RAG’s core is a two-model architecture:

Encoder (Embedding Model) — Converts queries and documents into vectors. Determines what gets retrieved.
Decoder (LLM) — Reads retrieved context and generates grounded answers. Determines what gets said.

Most teams fine-tune the LLM but leave the embedding model untouched. That’s like upgrading the engine while ignoring the steering—you’ll go faster, not necessarily in the right direction.

Key insight: If retrieval is off, even a great LLM can’t save you. If the LLM hallucinates on good context, your expensive retriever is wasted.
Solution: Fine-tune both.

2) Encoder vs. Decoder — Side-by-Side

Dimension	Encoder (Bi-encoder)	Decoder (Autoregressive LLM)
Primary function	Semantic matching & retrieval	Grounded text generation
Input → Output	Text → Dense vector	Context + Query → Answer
Representative models	BGE, E5, GTE, Stella, MPNet	LLaMA, Mistral, Qwen, GPT
RAG responsibility	Find relevant passages (recall)	Generate faithful answers (precision)
Training focus	Contrastive learning on (query, doc) pairs	Instruction-following + grounding
Key metrics	Recall@K, nDCG@K, MRR	Citation accuracy, factuality, calibrated refusals
Compute profile	Fast inference (milliseconds)	Slower inference (seconds)

Takeaway: Retrieval quality sets your ceiling. Generation quality determines whether you hit it.

3) Four Training Phases — From Raw Model to Production RAG

Phase	Model	What	Goal	Typical Tools	Output
1. Pretraining	Decoder	Train on massive corpora	Language & world knowledge	Megatron-LM, DeepSpeed, FSDP	Base LLM
2. SFT	Decoder	Instruction-response pairs	Task formats, instruction following	HF TRL, Axolotl, LLaMA-Factory, PEFT/LoRA	Instruction LLM
3. Alignment	Decoder	RLHF/DPO/ORPO/GRPO	Human preferences, safety, fewer hallucinations	TRL, OpenRLHF	Aligned LLM
4. Embedding FT	Encoder	Contrastive (query, passage, hard-negatives)	Domain semantics & similarity	SentenceTransformers, FlagEmbedding, LlamaIndex FT	In-domain embedder

Critical distinction: Phases 1–3 teach how to write. Phase 4 teaches what to retrieve.

4) Encoder Fine-Tuning — Teach “Apple” to Mean the Right Thing

Why: General-purpose embedders struggle with ambiguity (“Apple” fruit vs. brand), domain jargon (medical codes, legal citations, SKUs), and task-specific similarity (support vs. marketing vs. docs).

How:

Train on (query, positive, hard-negative) triplets so the model pulls true pairs together and pushes confusers apart.
Mine hard negatives from BM25 false-positives, the old embedder’s near-misses, or curated distractors.
Use hybrid retrieval (dense + BM25/SPLADE) to handle rare tokens, SKUs, and proper nouns.

Takeaway: Hard-negative mining mirrors real confusions and is the secret sauce for better recall and fewer off-sense hits. (Phil Schmid’s guide provides practical patterns—see References.)

5) Decoder Fine-Tuning — Teach Evidence-First Answering

Grounded SFT: Input = question + top-K context; Output = answer with citations. Train the model to quote or tightly paraphrase from provided context, and to say “insufficient information” when evidence is missing.

Preference alignment (DPO/ORPO/GRPO): Use good-vs-bad pairs under the same context to reward faithful, concise, cited answers over stylish speculation.

Takeaway: Make the decoder rely on context, cite it clearly, and avoid guessing.

6) Add a Reranker — Precise Ordering Beats Noisy Top-K

Bi-encoders are fast but shallow (independent encodings, dot-product scoring).
A cross-encoder reranker jointly reads query × candidate to re-order top-K (e.g., top-100 → top-10), catching nuances the bi-encoder misses. For very large corpora, consider late-interaction approaches (e.g., ColBERT).

Takeaway: Bi-encoders are fast; cross-encoders are sharp. Use both.

7) System View — Text → Chunk → Embed → Store → Retrieve → Rerank → Generate

Data flow

Parse & chunk with HTML/Markdown awareness; preserve headings, lists, tables; attach metadata (store/site, department, brand).
Embed & index with dense + lexical signals.
Retrieve & rerank for recall and precision.
Generate with citations; prefer short quotes and exact clause references.
Evaluate & monitor with offline metrics and an online feedback loop.

Key metrics

Stage	Metric	Why it matters
Retrieval	Recall@K / nDCG@K / MRR	Can we consistently surface the right passages?
Generation	Citation accuracy / factuality / calibrated refusal	Do answers stick to evidence and avoid guessing?
Business	First-contact resolution / time-to-answer / CSAT	Does it actually help users faster and better?

Takeaway: Treat RAG as a loop—refresh encoder/reranker frequently; refresh the decoder less often but with stronger supervision.

8) Real-World Disambiguation — Why Domain Tuning Matters

Grocery
Query: “apple return policy?” → fruit (perishables window, freshness checks, receipt rules)
Failure mode: Retrieves Apple-the-brand returns
Electronics
Query: “Apple return policy?” → Apple Inc. (14-day window, opened-box fee, serial verification)
Failure mode: Falls into produce rules

Takeaway: Encoder tuning separates senses. Decoder tuning narrates those senses faithfully—with receipts.

Conclusion — Two Models, One Mission

Your RAG system is only as strong as its weaker half:

Retrieval line (Encoder + Reranker) — consistently surface the right evidence.
Generation line (Decoder) — consistently answer from that evidence with clear citations.

Most teams over-invest in the decoder and under-invest in the encoder. Flip that bias: tune both halves and accuracy—and trust—will climb fast.

An intelligent system isn’t measured by how eloquently it writes, but by how accurately it remembers, retrieves, and reasons from evidence.

References & Further Reading

Phil Schmid — Fine-tune Embedding Models for Retrieval-Augmented Generation
Practical patterns for contrastive training with SentenceTransformers, hard-negative mining, and retrieval-oriented evaluation.
Databricks — Improving Retrieval and RAG with Embedding Model Finetuning
Enterprise-focused discussion on domain adaptation, production deployment, and downstream RAG accuracy gains.

RAG Through the Encoder–Decoder Lens

1) Introduction — The Two-Model Problem Most Teams Ignore

2) Encoder vs. Decoder — Side-by-Side

3) Four Training Phases — From Raw Model to Production RAG

4) Encoder Fine-Tuning — Teach “Apple” to Mean the Right Thing

5) Decoder Fine-Tuning — Teach Evidence-First Answering

6) Add a Reranker — Precise Ordering Beats Noisy Top-K

7) System View — Text → Chunk → Embed → Store → Retrieve → Rerank → Generate

8) Real-World Disambiguation — Why Domain Tuning Matters

Conclusion — Two Models, One Mission

References & Further Reading

Comments

More from this blog

Training an Embedding Encoder for Medical Data

Understanding QLoRA(Quantized Low-Rank Adaptatio)

Understanding LoRA (Low-Rank Adaptation)

Command Palette

1) Introduction — The Two-Model Problem Most Teams Ignore

2) Encoder vs. Decoder — Side-by-Side

3) Four Training Phases — From Raw Model to Production RAG

4) Encoder Fine-Tuning — Teach “Apple” to Mean the Right Thing

5) Decoder Fine-Tuning — Teach Evidence-First Answering

6) Add a Reranker — Precise Ordering Beats Noisy Top-K

7) System View — Text → Chunk → Embed → Store → Retrieve → Rerank → Generate

8) Real-World Disambiguation — Why Domain Tuning Matters

Conclusion — Two Models, One Mission

References & Further Reading

Comments

More from this blog