Skip to main content

Command Palette

Search for a command to run...

RAG Through the Encoder–Decoder Lens

Why Fine-Tuning Both Sides Wins

Updated
5 min read

Effective RAG needs two specialized models working in harmony

  • an Encoder (embedding model) that finds the right evidence

  • a Decoder (LLM) that answers faithfully from that evidence

Most teams tune only one. Fine-tuning both is where real accuracy gains live.


1) Introduction — The Two-Model Problem Most Teams Ignore

Walk into any RAG discussion and you’ll hear about prompts, chunk sizes, and top-K. What you won’t hear enough about: who decides which documents get retrieved, and who decides how to use them.

RAG’s core is a two-model architecture:

  • Encoder (Embedding Model) — Converts queries and documents into vectors. Determines what gets retrieved.

  • Decoder (LLM) — Reads retrieved context and generates grounded answers. Determines what gets said.

Most teams fine-tune the LLM but leave the embedding model untouched. That’s like upgrading the engine while ignoring the steering—you’ll go faster, not necessarily in the right direction.

Key insight: If retrieval is off, even a great LLM can’t save you. If the LLM hallucinates on good context, your expensive retriever is wasted.
Solution: Fine-tune both.


2) Encoder vs. Decoder — Side-by-Side

DimensionEncoder (Bi-encoder)Decoder (Autoregressive LLM)
Primary functionSemantic matching & retrievalGrounded text generation
Input → OutputText → Dense vectorContext + Query → Answer
Representative modelsBGE, E5, GTE, Stella, MPNetLLaMA, Mistral, Qwen, GPT
RAG responsibilityFind relevant passages (recall)Generate faithful answers (precision)
Training focusContrastive learning on (query, doc) pairsInstruction-following + grounding
Key metricsRecall@K, nDCG@K, MRRCitation accuracy, factuality, calibrated refusals
Compute profileFast inference (milliseconds)Slower inference (seconds)

Takeaway: Retrieval quality sets your ceiling. Generation quality determines whether you hit it.


3) Four Training Phases — From Raw Model to Production RAG

PhaseModelWhatGoalTypical ToolsOutput
1. PretrainingDecoderTrain on massive corporaLanguage & world knowledgeMegatron-LM, DeepSpeed, FSDPBase LLM
2. SFTDecoderInstruction-response pairsTask formats, instruction followingHF TRL, Axolotl, LLaMA-Factory, PEFT/LoRAInstruction LLM
3. AlignmentDecoderRLHF/DPO/ORPO/GRPOHuman preferences, safety, fewer hallucinationsTRL, OpenRLHFAligned LLM
4. Embedding FTEncoderContrastive (query, passage, hard-negatives)Domain semantics & similaritySentenceTransformers, FlagEmbedding, LlamaIndex FTIn-domain embedder

Critical distinction: Phases 1–3 teach how to write. Phase 4 teaches what to retrieve.


4) Encoder Fine-Tuning — Teach “Apple” to Mean the Right Thing

Why: General-purpose embedders struggle with ambiguity (“Apple” fruit vs. brand), domain jargon (medical codes, legal citations, SKUs), and task-specific similarity (support vs. marketing vs. docs).

How:

  • Train on (query, positive, hard-negative) triplets so the model pulls true pairs together and pushes confusers apart.

  • Mine hard negatives from BM25 false-positives, the old embedder’s near-misses, or curated distractors.

  • Use hybrid retrieval (dense + BM25/SPLADE) to handle rare tokens, SKUs, and proper nouns.

Takeaway: Hard-negative mining mirrors real confusions and is the secret sauce for better recall and fewer off-sense hits. (Phil Schmid’s guide provides practical patterns—see References.)


5) Decoder Fine-Tuning — Teach Evidence-First Answering

Grounded SFT: Input = question + top-K context; Output = answer with citations. Train the model to quote or tightly paraphrase from provided context, and to say “insufficient information” when evidence is missing.

Preference alignment (DPO/ORPO/GRPO): Use good-vs-bad pairs under the same context to reward faithful, concise, cited answers over stylish speculation.

Takeaway: Make the decoder rely on context, cite it clearly, and avoid guessing.


6) Add a Reranker — Precise Ordering Beats Noisy Top-K

Bi-encoders are fast but shallow (independent encodings, dot-product scoring).
A cross-encoder reranker jointly reads query × candidate to re-order top-K (e.g., top-100 → top-10), catching nuances the bi-encoder misses. For very large corpora, consider late-interaction approaches (e.g., ColBERT).

Takeaway: Bi-encoders are fast; cross-encoders are sharp. Use both.


7) System View — Text → Chunk → Embed → Store → Retrieve → Rerank → Generate

Data flow

  1. Parse & chunk with HTML/Markdown awareness; preserve headings, lists, tables; attach metadata (store/site, department, brand).

  2. Embed & index with dense + lexical signals.

  3. Retrieve & rerank for recall and precision.

  4. Generate with citations; prefer short quotes and exact clause references.

  5. Evaluate & monitor with offline metrics and an online feedback loop.

Key metrics

StageMetricWhy it matters
RetrievalRecall@K / nDCG@K / MRRCan we consistently surface the right passages?
GenerationCitation accuracy / factuality / calibrated refusalDo answers stick to evidence and avoid guessing?
BusinessFirst-contact resolution / time-to-answer / CSATDoes it actually help users faster and better?

Takeaway: Treat RAG as a loop—refresh encoder/reranker frequently; refresh the decoder less often but with stronger supervision.


8) Real-World Disambiguation — Why Domain Tuning Matters

  • Grocery
    Query: “apple return policy?” → fruit (perishables window, freshness checks, receipt rules)
    Failure mode: Retrieves Apple-the-brand returns

  • Electronics
    Query: “Apple return policy?” → Apple Inc. (14-day window, opened-box fee, serial verification)
    Failure mode: Falls into produce rules

Takeaway: Encoder tuning separates senses. Decoder tuning narrates those senses faithfully—with receipts.


Conclusion — Two Models, One Mission

Your RAG system is only as strong as its weaker half:

  1. Retrieval line (Encoder + Reranker) — consistently surface the right evidence.

  2. Generation line (Decoder) — consistently answer from that evidence with clear citations.

Most teams over-invest in the decoder and under-invest in the encoder. Flip that bias: tune both halves and accuracy—and trust—will climb fast.

An intelligent system isn’t measured by how eloquently it writes, but by how accurately it remembers, retrieves, and reasons from evidence.


References & Further Reading

  1. Phil SchmidFine-tune Embedding Models for Retrieval-Augmented Generation
    Practical patterns for contrastive training with SentenceTransformers, hard-negative mining, and retrieval-oriented evaluation.

  2. DatabricksImproving Retrieval and RAG with Embedding Model Finetuning
    Enterprise-focused discussion on domain adaptation, production deployment, and downstream RAG accuracy gains.