LLM Fine-Tuning Guide: Understanding Training Parameters
Created: 2025-10-14 22:57:34 | Last updated: 2025-10-14 22:57:34 | Status: Public
A practical guide to supervised fine-tuning (SFT) for large language models, explained for clarity.
Table of Contents
- Core Training Concepts
- The Pattern: Balancing Training Parameters
- Target Loss Goals
- Practical Decision Tree
- Real-World Example: TCCC Medical Model
- Quick Reference
Core Training Concepts
1. Batch Size (per_device_train_batch_size)
How many examples the model looks at before updating.
- Batch size = 1: Model sees 1 example at a time
- Batch size = 2: Model sees 2 examples at a time
Why it matters:
- Larger batches = more GPU memory (VRAM) needed
- Set to what your GPU can handle without OOM (Out Of Memory)
Real-world analogy: Studying 1 vocabulary word at a time vs. 5 words at a time. More is faster but needs more mental capacity.
2. Gradient Accumulation (gradient_accumulation_steps)
A trick to simulate larger batches without using more memory.
How it works:
1. Look at example 1, calculate error (don’t update yet)
2. Look at example 2, add to error calculation (still don’t update)
3. …continue for N steps…
4. NOW update the model with accumulated error
Math:
Effective Batch Size = batch_size × gradient_accumulation_steps
Example:
per_device_train_batch_size = 1
gradient_accumulation_steps = 8
Effective batch size = 8
Real-world analogy: Taking notes on 8 vocab words, THEN having one study session, instead of studying after each word.
3. Learning Rate (learning_rate)
How much the model changes its weights after seeing examples.
The formula:
New Weight = Old Weight - (Learning Rate × Gradient)
Examples:
- Too high (0.001): Changes too much, unstable, forgets old knowledge
- Too low (0.00001): Changes too little, learns very slowly
- Just right (0.0001): Steady, stable improvement
Real-world analogy: Adjusting your bike speed. Too fast = crash on turns. Too slow = takes forever.
4. Epochs (num_train_epochs)
How many times you go through ALL your training data.
- 1 epoch = See each example once
- 5 epochs = See each example 5 times
- 8 epochs = See each example 8 times
Why it matters:
- More epochs = more repetition = better memorization
- Too many epochs = overfitting (memorizes exact wording, can’t generalize)
Real-world analogy: Reading your study notes 1 time vs. 8 times.
5. Training Steps (calculated automatically)
Total number of weight updates during training.
Formula:
steps = (number of examples ÷ effective batch size) × epochs
Example:
625 examples ÷ 8 batch size × 8 epochs = 625 steps
Each step:
1. Look at a batch of examples
2. Calculate how wrong the model is
3. Update weights slightly
6. Loss (the number you watch)
How wrong the model is.
| Loss Value | What It Means |
|---|---|
| 3.0+ | Model has no idea (random guessing) |
| 1.0-2.0 | Model is learning |
| 0.5-0.8 | Model is getting good |
| 0.2-0.5 | Target range - well trained |
| 0.05-0.2 | Very good, watch for overfitting |
| 0.0-0.05 | Overfit - memorized too perfectly |
7. Gradient (aka Slope)
The direction and magnitude of error for each weight.
In math terms:
Gradient = ∂Error/∂Weight = "slope of error curve"
What it tells you:
- Direction: Should this weight go up or down?
- Magnitude: How much does it contribute to error?
Real-world analogy: You’re hiking down a mountain (minimizing loss). Gradient is your compass pointing downhill plus how steep the slope is.
The Pattern: Balancing Training Parameters
The Core Principle
More data = Need more training time
Think of it like studying:
- 100 flashcards × 10 reviews = 1,000 total views
- 200 flashcards × 5 reviews = 1,000 total views
But the first gives you 10 views per card vs. 5 views per card - better memorization.
The Formula (Simplified)
Learning Quality ∝ (Epochs × Learning Rate) / Data Size
When you add more data, you MUST either:
1. Increase epochs proportionally
2. Increase learning rate
3. Or both
Example Calculations
You went from 625 examples to 825 examples (+32% more data)
Option A: Increase Epochs
Old: 625 examples × 8 epochs = 5,000 example views
New: 825 examples × ? epochs = 5,000 example views
? = 5,000 / 825 = 6.06 → Round up to 7 epochs
Option B: Increase Learning Rate
Old: 5e-5 learning rate
New: 5e-5 × 1.32 (32% increase) = 6.6e-5 ≈ 8e-5
Option C: Both (Recommended)
num_train_epochs = 7, # Compensate for more data
learning_rate = 1e-4, # Higher to learn faster
Target Loss Goals
| Loss Range | What It Means | What To Do |
|---|---|---|
| 0.01-0.15 | Overfit - Memorized too well, can’t generalize | Reduce epochs OR lower learning rate |
| 0.2-0.5 | ✓ Sweet Spot - Learned well, can generalize | Test the model - you’re done! |
| 0.6-1.0 | Underfit - Didn’t learn enough, making things up | More epochs OR higher learning rate |
| 1.5+ | Barely Learning - Need much more training | Double epochs OR triple learning rate |
Practical Decision Tree
After Training: Check Final Loss
Loss < 0.2?
→ OVERFIT
→ Next run: REDUCE epochs by 20% OR LOWER learning rate by 50%
Loss 0.2-0.5?
→ PERFECT
→ Test model thoroughly
→ If good = DONE ✓
Loss 0.5-1.0?
→ UNDERFIT
→ Next run: INCREASE epochs by 50% OR RAISE learning rate by 2x
Loss > 1.0?
→ BARELY LEARNED
→ Next run: DOUBLE epochs OR TRIPLE learning rate
Real-World Example: TCCC Medical Model
Three Training Runs Compared
| Run | Examples | Epochs | Learning Rate | Total Steps | Final Loss | Result |
|---|---|---|---|---|---|---|
| 1 | 625 | 5 | 1e-4 | 395 | 0.4-0.6 | ✓ Okay, some hallucination |
| 2 | 625 | 8 | 5e-5 | 632 | 0.02-0.1 | ✗ TOO perfect (overfit) |
| 3 | 825 | 5 | 8e-5 | 515 | 0.6-0.9 | ✗ Didn’t learn enough |
What Happened in Each Run
Run 1 → Run 2:
- Same data (625 examples)
- More epochs (5→8) + Lower learning rate (1e-4 → 5e-5)
- Result: Loss dropped to 0.02 - model overfit, memorized exact wording
Run 2 → Run 3:
- Added 32% more data (625→825)
- Reduced epochs by 37% (8→5)
- Result: Loss jumped to 0.7 - not enough training for the extra data
The Correct Settings for Run 3
# Starting point: 825 examples
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 8, # Effective batch = 8
warmup_steps = 10,
num_train_epochs = 7, # Increased from 5
learning_rate = 1e-4, # Increased from 8e-5
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
assistant_only_loss = True, # Train only on completions
max_seq_length = 2048,
),
)
Expected result: Final loss around 0.2-0.4 (sweet spot)
Quick Reference
Training Parameters Cheat Sheet
# For small datasets (500-1000 examples)
per_device_train_batch_size = 1 # Adjust for your GPU
gradient_accumulation_steps = 8 # Effective batch of 8
num_train_epochs = 5-8 # More for smaller datasets
learning_rate = 5e-5 to 1e-4 # Start conservative
lr_scheduler_type = "cosine" # Smooth learning curve
When to Adjust
If you get OOM (Out of Memory):
- Reduce per_device_train_batch_size to 1
- Increase gradient_accumulation_steps to compensate
If adding more training data:
- Increase num_train_epochs proportionally
- OR increase learning_rate
- OR both
If loss is too high (>0.5):
- Increase num_train_epochs by 30-50%
- OR increase learning_rate by 50-100%
If loss is too low (<0.2):
- Reduce num_train_epochs by 20-30%
- OR reduce learning_rate by 50%
Loss Interpretation
✓ Target: 0.2-0.5 for production models
- Below 0.2: Likely overfit - test carefully
- Above 0.5: Underfit - needs more training
- Between 0.2-0.5: Good balance - test and deploy
Best Practices
1. Dataset Quality Matters Most
- Clean, accurate data beats fancy hyperparameters
- Add question variations (2-3 per important topic)
- Include “I don’t know” examples for out-of-domain queries
2. Watch the Loss Curve
- Should steadily decrease
- Occasional spikes are normal
- Consistent plateau means adjust learning rate
3. Test Thoroughly
- Test exact matches (should answer correctly)
- Test variations (should answer if close enough)
- Test out-of-domain (should refuse)
4. Start Conservative
- Begin with lower learning rate
- Add more epochs rather than higher learning rate
- Easier to recover from underfit than overfit
5. Document Everything
- Save training configs
- Note final loss values
- Keep test results for comparison
Completion-Based SFT Specifics
Key Settings for Instruction-Following Models
completion_only_loss = False # Not using prompt-completion format
assistant_only_loss = True # Only compute loss on assistant responses
What this does:
- Trains the model ONLY on the assistant’s completions
- Ignores loss on user questions and system prompts
- More efficient - focuses learning on what the model should output
Dataset Format
Input format (instruction/completion):
{
"instruction": "What is MARCH in TCCC?",
"completion": "MARCH stands for Massive hemorrhage, Airway, Respirations, Circulation, Hypothermia/Head injury..."
}
Converted to messages format:
{
"messages": [
{"role": "user", "content": "What is MARCH in TCCC?"},
{"role": "assistant", "content": "MARCH stands for..."}
]
}
Chat template applies formatting:
<|start_header_id|>user<|end_header_id|>
What is MARCH in TCCC?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
MARCH stands for...<|eot_id|>
With assistant_only_loss=True, loss is only computed on the assistant’s response tokens.
Additional Resources
Recommended Reading
Tools
- Unsloth: Fast, memory-efficient fine-tuning
- TRL (Transformer Reinforcement Learning): SFT, DPO, PPO trainers
- Hugging Face Datasets: Easy dataset management
Summary
The Pattern (Simple Version):
- More data = need more training time
- Target final loss: 0.2-0.5
- Too low (< 0.2) = overfit = reduce training
- Too high (> 0.5) = underfit = increase training
Three knobs to balance:
- Data size
- Epochs (repetitions)
- Learning rate (step size)
Goal: Hit the 0.2-0.5 loss sweet spot where the model has learned well without overfitting.
Last updated: October 2025