Training an Embedding Encoder for Medical Data

1️⃣ Why Train a Medical Encoder?

In Retrieval-Augmented Generation (RAG) or semantic-search systems, an embedding encoder converts text into numerical vectors.
Texts with similar meaning should produce close vectors in this semantic space.

In the medical domain, this alignment is especially important:

Medical terms have many synonyms or abbreviations (e.g., “Hypertension” ≈ “High blood pressure”)
General encoders often miss clinical context
High precision is crucial for QA or document retrieval

We will fine-tune sentence-transformers/all-mpnet-base-v2 into a domain-aware encoder — MPNet-medical — using a small medical dataset.

2️⃣ Experiment Setup

Item	Description
Base model	`sentence-transformers/all-mpnet-base-v2`
Framework	Sentence-Transformers (PyTorch)
Dataset	Small medical dataset(sentence-transformers/stsb)
Hardware	Single GPU (16 GB VRAM sufficient)(https://www.inference.ai/)

Training Phases:

Sentence similarity learning → CosineSimilarityLoss
QA retrieval learning → MultipleNegativesRankingLoss
(Optional) Hard negative discrimination → TripletLoss

3️⃣ Data Formats

	Example	Loss	Use Case	Advantages	Drawbacks
(sent1, sent2, score)	“Patient has high blood pressure.” vs “The patient suffers from hypertension.”, score = 0.95	CosineSimilarityLoss	Sentence similarity	Continuous supervision	Requires labeled scores
(query, pos)	“What are the symptoms of diabetes?” → “Common symptoms include polyuria, thirst, and weight loss.”	MultipleNegativesRankingLoss	QA or retrieval	Automatic negatives (batch-wise)	No explicit hard negatives
(query, pos, neg)	“Signs of myocardial infarction?” → pos: “Chest pain and sweating.”, neg: “Angina pain is brief.”	TripletLoss	Hard-negative discrimination	Improves fine-grained accuracy	Requires curated negatives

4️⃣ Loss Function Overview

Loss	Objective	Intuition	When to Use
CosineSimilarityLoss	Minimize (pred_cos − label)²	Align predicted similarity with human labels	When sentence-level similarity scores exist
MultipleNegativesRankingLoss	Rank true pairs higher than in-batch negatives	Learn retrieval relationships	QA or semantic search tasks
TripletLoss	max(0, margin + d(q,pos) − d(q,neg))	Keep positives closer than negatives	For hard-negative fine-tuning

5️⃣ Results and Visualization

After fine-tuning, we compare Base (blue) vs Fine-tuned (orange) encoders on two tasks:

Sentence-Pair Spearman: Correlation between predicted and labeled similarity scores.
Retrieval Metrics (Recall@5 / nDCG@10): Ability to rank relevant passages correctly.

[Base] Sentence-pair Spearman: 0.9038
[Fine-tuned] Sentence-pair Spearman: 0.9431
[FT+Triplet] Sentence-pair Spearman: 0.9431

Training an Embedding Encoder for Medical Data

1️⃣ Why Train a Medical Encoder?

2️⃣ Experiment Setup

3️⃣ Data Formats

4️⃣ Loss Function Overview

5️⃣ Results and Visualization

Comments

More from this blog

RAG Through the Encoder–Decoder Lens

Understanding QLoRA(Quantized Low-Rank Adaptatio)

Understanding LoRA (Low-Rank Adaptation)

Command Palette

1️⃣ Why Train a Medical Encoder?

2️⃣ Experiment Setup

3️⃣ Data Formats

4️⃣ Loss Function Overview

5️⃣ Results and Visualization

Comments

More from this blog