Training an Embedding Encoder for Medical Data
Fine-tune an encoder model (MPNet) on a small medical dataset.
Code: Github
1️⃣ Why Train a Medical Encoder?
In Retrieval-Augmented Generation (RAG) or semantic-search systems, an embedding encoder converts text into numerical vectors.
Texts with similar meaning should produce close vectors in this semantic space.
In the medical domain, this alignment is especially important:
Medical terms have many synonyms or abbreviations (e.g., “Hypertension” ≈ “High blood pressure”)
General encoders often miss clinical context
High precision is crucial for QA or document retrieval
We will fine-tune sentence-transformers/all-mpnet-base-v2 into a domain-aware encoder — MPNet-medical — using a small medical dataset.
2️⃣ Experiment Setup
| Item | Description |
| Base model | sentence-transformers/all-mpnet-base-v2 |
| Framework | Sentence-Transformers (PyTorch) |
| Dataset | Small medical dataset(sentence-transformers/stsb) |
| Hardware | Single GPU (16 GB VRAM sufficient)(https://www.inference.ai/) |
Training Phases:
Sentence similarity learning →
CosineSimilarityLossQA retrieval learning →
MultipleNegativesRankingLoss(Optional) Hard negative discrimination →
TripletLoss
3️⃣ Data Formats
| Example | Loss | Use Case | Advantages | Drawbacks | |
| (sent1, sent2, score) | “Patient has high blood pressure.” vs “The patient suffers from hypertension.”, score = 0.95 | CosineSimilarityLoss | Sentence similarity | Continuous supervision | Requires labeled scores |
| (query, pos) | “What are the symptoms of diabetes?” → “Common symptoms include polyuria, thirst, and weight loss.” | MultipleNegativesRankingLoss | QA or retrieval | Automatic negatives (batch-wise) | No explicit hard negatives |
| (query, pos, neg) | “Signs of myocardial infarction?” → pos: “Chest pain and sweating.”, neg: “Angina pain is brief.” | TripletLoss | Hard-negative discrimination | Improves fine-grained accuracy | Requires curated negatives |
4️⃣ Loss Function Overview
| Loss | Objective | Intuition | When to Use |
| CosineSimilarityLoss | Minimize (pred_cos − label)² | Align predicted similarity with human labels | When sentence-level similarity scores exist |
| MultipleNegativesRankingLoss | Rank true pairs higher than in-batch negatives | Learn retrieval relationships | QA or semantic search tasks |
| TripletLoss | max(0, margin + d(q,pos) − d(q,neg)) | Keep positives closer than negatives | For hard-negative fine-tuning |
5️⃣ Results and Visualization
After fine-tuning, we compare Base (blue) vs Fine-tuned (orange) encoders on two tasks:
Sentence-Pair Spearman: Correlation between predicted and labeled similarity scores.
Retrieval Metrics (Recall@5 / nDCG@10): Ability to rank relevant passages correctly.
[Base] Sentence-pair Spearman: 0.9038[Fine-tuned] Sentence-pair Spearman: 0.9431[FT+Triplet] Sentence-pair Spearman: 0.9431
