RAG Through the Encoder–Decoder Lens
Why Fine-Tuning Both Sides Wins
Effective RAG needs two specialized models working in harmony
an Encoder (embedding model) that finds the right evidence
a Decoder (LLM) that answers faithfully from that evidence
Most teams tune only one. Fine-tuning both is where real accuracy gains live.
1) Introduction — The Two-Model Problem Most Teams Ignore
Walk into any RAG discussion and you’ll hear about prompts, chunk sizes, and top-K. What you won’t hear enough about: who decides which documents get retrieved, and who decides how to use them.
RAG’s core is a two-model architecture:
Encoder (Embedding Model) — Converts queries and documents into vectors. Determines what gets retrieved.
Decoder (LLM) — Reads retrieved context and generates grounded answers. Determines what gets said.
Most teams fine-tune the LLM but leave the embedding model untouched. That’s like upgrading the engine while ignoring the steering—you’ll go faster, not necessarily in the right direction.
Key insight: If retrieval is off, even a great LLM can’t save you. If the LLM hallucinates on good context, your expensive retriever is wasted.
Solution: Fine-tune both.

2) Encoder vs. Decoder — Side-by-Side
| Dimension | Encoder (Bi-encoder) | Decoder (Autoregressive LLM) |
| Primary function | Semantic matching & retrieval | Grounded text generation |
| Input → Output | Text → Dense vector | Context + Query → Answer |
| Representative models | BGE, E5, GTE, Stella, MPNet | LLaMA, Mistral, Qwen, GPT |
| RAG responsibility | Find relevant passages (recall) | Generate faithful answers (precision) |
| Training focus | Contrastive learning on (query, doc) pairs | Instruction-following + grounding |
| Key metrics | Recall@K, nDCG@K, MRR | Citation accuracy, factuality, calibrated refusals |
| Compute profile | Fast inference (milliseconds) | Slower inference (seconds) |
Takeaway: Retrieval quality sets your ceiling. Generation quality determines whether you hit it.
3) Four Training Phases — From Raw Model to Production RAG
| Phase | Model | What | Goal | Typical Tools | Output |
| 1. Pretraining | Decoder | Train on massive corpora | Language & world knowledge | Megatron-LM, DeepSpeed, FSDP | Base LLM |
| 2. SFT | Decoder | Instruction-response pairs | Task formats, instruction following | HF TRL, Axolotl, LLaMA-Factory, PEFT/LoRA | Instruction LLM |
| 3. Alignment | Decoder | RLHF/DPO/ORPO/GRPO | Human preferences, safety, fewer hallucinations | TRL, OpenRLHF | Aligned LLM |
| 4. Embedding FT | Encoder | Contrastive (query, passage, hard-negatives) | Domain semantics & similarity | SentenceTransformers, FlagEmbedding, LlamaIndex FT | In-domain embedder |
Critical distinction: Phases 1–3 teach how to write. Phase 4 teaches what to retrieve.
4) Encoder Fine-Tuning — Teach “Apple” to Mean the Right Thing
Why: General-purpose embedders struggle with ambiguity (“Apple” fruit vs. brand), domain jargon (medical codes, legal citations, SKUs), and task-specific similarity (support vs. marketing vs. docs).
How:
Train on (query, positive, hard-negative) triplets so the model pulls true pairs together and pushes confusers apart.
Mine hard negatives from BM25 false-positives, the old embedder’s near-misses, or curated distractors.
Use hybrid retrieval (dense + BM25/SPLADE) to handle rare tokens, SKUs, and proper nouns.
Takeaway: Hard-negative mining mirrors real confusions and is the secret sauce for better recall and fewer off-sense hits. (Phil Schmid’s guide provides practical patterns—see References.)
5) Decoder Fine-Tuning — Teach Evidence-First Answering
Grounded SFT: Input = question + top-K context; Output = answer with citations. Train the model to quote or tightly paraphrase from provided context, and to say “insufficient information” when evidence is missing.
Preference alignment (DPO/ORPO/GRPO): Use good-vs-bad pairs under the same context to reward faithful, concise, cited answers over stylish speculation.
Takeaway: Make the decoder rely on context, cite it clearly, and avoid guessing.
6) Add a Reranker — Precise Ordering Beats Noisy Top-K
Bi-encoders are fast but shallow (independent encodings, dot-product scoring).
A cross-encoder reranker jointly reads query × candidate to re-order top-K (e.g., top-100 → top-10), catching nuances the bi-encoder misses. For very large corpora, consider late-interaction approaches (e.g., ColBERT).
Takeaway: Bi-encoders are fast; cross-encoders are sharp. Use both.
7) System View — Text → Chunk → Embed → Store → Retrieve → Rerank → Generate
Data flow
Parse & chunk with HTML/Markdown awareness; preserve headings, lists, tables; attach metadata (store/site, department, brand).
Embed & index with dense + lexical signals.
Retrieve & rerank for recall and precision.
Generate with citations; prefer short quotes and exact clause references.
Evaluate & monitor with offline metrics and an online feedback loop.
Key metrics
| Stage | Metric | Why it matters |
| Retrieval | Recall@K / nDCG@K / MRR | Can we consistently surface the right passages? |
| Generation | Citation accuracy / factuality / calibrated refusal | Do answers stick to evidence and avoid guessing? |
| Business | First-contact resolution / time-to-answer / CSAT | Does it actually help users faster and better? |
Takeaway: Treat RAG as a loop—refresh encoder/reranker frequently; refresh the decoder less often but with stronger supervision.
8) Real-World Disambiguation — Why Domain Tuning Matters
Grocery
Query: “apple return policy?” → fruit (perishables window, freshness checks, receipt rules)
Failure mode: Retrieves Apple-the-brand returnsElectronics
Query: “Apple return policy?” → Apple Inc. (14-day window, opened-box fee, serial verification)
Failure mode: Falls into produce rules
Takeaway: Encoder tuning separates senses. Decoder tuning narrates those senses faithfully—with receipts.
Conclusion — Two Models, One Mission
Your RAG system is only as strong as its weaker half:
Retrieval line (Encoder + Reranker) — consistently surface the right evidence.
Generation line (Decoder) — consistently answer from that evidence with clear citations.
Most teams over-invest in the decoder and under-invest in the encoder. Flip that bias: tune both halves and accuracy—and trust—will climb fast.
An intelligent system isn’t measured by how eloquently it writes, but by how accurately it remembers, retrieves, and reasons from evidence.
References & Further Reading
Phil Schmid — Fine-tune Embedding Models for Retrieval-Augmented Generation
Practical patterns for contrastive training with SentenceTransformers, hard-negative mining, and retrieval-oriented evaluation.Databricks — Improving Retrieval and RAG with Embedding Model Finetuning
Enterprise-focused discussion on domain adaptation, production deployment, and downstream RAG accuracy gains.