LLM Fine-Tuning Guide: Understanding Training Parameters

Created: 2025-10-14 22:57:34 | Last updated: 2025-10-14 22:57:34 | Status: Public

A practical guide to supervised fine-tuning (SFT) for large language models, explained for clarity.


Table of Contents

  1. Core Training Concepts
  2. The Pattern: Balancing Training Parameters
  3. Target Loss Goals
  4. Practical Decision Tree
  5. Real-World Example: TCCC Medical Model
  6. Quick Reference

Core Training Concepts

1. Batch Size (per_device_train_batch_size)

How many examples the model looks at before updating.

  • Batch size = 1: Model sees 1 example at a time
  • Batch size = 2: Model sees 2 examples at a time

Why it matters:
- Larger batches = more GPU memory (VRAM) needed
- Set to what your GPU can handle without OOM (Out Of Memory)

Real-world analogy: Studying 1 vocabulary word at a time vs. 5 words at a time. More is faster but needs more mental capacity.


2. Gradient Accumulation (gradient_accumulation_steps)

A trick to simulate larger batches without using more memory.

How it works:
1. Look at example 1, calculate error (don’t update yet)
2. Look at example 2, add to error calculation (still don’t update)
3. …continue for N steps…
4. NOW update the model with accumulated error

Math:

Effective Batch Size = batch_size × gradient_accumulation_steps

Example:
  per_device_train_batch_size = 1
  gradient_accumulation_steps = 8
  Effective batch size = 8

Real-world analogy: Taking notes on 8 vocab words, THEN having one study session, instead of studying after each word.


3. Learning Rate (learning_rate)

How much the model changes its weights after seeing examples.

The formula:

New Weight = Old Weight - (Learning Rate × Gradient)

Examples:
- Too high (0.001): Changes too much, unstable, forgets old knowledge
- Too low (0.00001): Changes too little, learns very slowly
- Just right (0.0001): Steady, stable improvement

Real-world analogy: Adjusting your bike speed. Too fast = crash on turns. Too slow = takes forever.


4. Epochs (num_train_epochs)

How many times you go through ALL your training data.

  • 1 epoch = See each example once
  • 5 epochs = See each example 5 times
  • 8 epochs = See each example 8 times

Why it matters:
- More epochs = more repetition = better memorization
- Too many epochs = overfitting (memorizes exact wording, can’t generalize)

Real-world analogy: Reading your study notes 1 time vs. 8 times.


5. Training Steps (calculated automatically)

Total number of weight updates during training.

Formula:

steps = (number of examples ÷ effective batch size) × epochs

Example:
  625 examples ÷ 8 batch size × 8 epochs = 625 steps

Each step:
1. Look at a batch of examples
2. Calculate how wrong the model is
3. Update weights slightly


6. Loss (the number you watch)

How wrong the model is.

Loss Value What It Means
3.0+ Model has no idea (random guessing)
1.0-2.0 Model is learning
0.5-0.8 Model is getting good
0.2-0.5 Target range - well trained
0.05-0.2 Very good, watch for overfitting
0.0-0.05 Overfit - memorized too perfectly

7. Gradient (aka Slope)

The direction and magnitude of error for each weight.

In math terms:

Gradient = ∂Error/∂Weight = "slope of error curve"

What it tells you:
- Direction: Should this weight go up or down?
- Magnitude: How much does it contribute to error?

Real-world analogy: You’re hiking down a mountain (minimizing loss). Gradient is your compass pointing downhill plus how steep the slope is.


The Pattern: Balancing Training Parameters

The Core Principle

More data = Need more training time

Think of it like studying:
- 100 flashcards × 10 reviews = 1,000 total views
- 200 flashcards × 5 reviews = 1,000 total views

But the first gives you 10 views per card vs. 5 views per card - better memorization.


The Formula (Simplified)

Learning Quality ∝ (Epochs × Learning Rate) / Data Size

When you add more data, you MUST either:
1. Increase epochs proportionally
2. Increase learning rate
3. Or both


Example Calculations

You went from 625 examples to 825 examples (+32% more data)

Option A: Increase Epochs

Old: 625 examples × 8 epochs = 5,000 example views
New: 825 examples × ? epochs = 5,000 example views
? = 5,000 / 825 = 6.06 → Round up to 7 epochs

Option B: Increase Learning Rate

Old: 5e-5 learning rate
New: 5e-5 × 1.32 (32% increase) = 6.6e-5 ≈ 8e-5

Option C: Both (Recommended)

num_train_epochs = 7,    # Compensate for more data
learning_rate = 1e-4,    # Higher to learn faster

Target Loss Goals

Loss Range What It Means What To Do
0.01-0.15 Overfit - Memorized too well, can’t generalize Reduce epochs OR lower learning rate
0.2-0.5 ✓ Sweet Spot - Learned well, can generalize Test the model - you’re done!
0.6-1.0 Underfit - Didn’t learn enough, making things up More epochs OR higher learning rate
1.5+ Barely Learning - Need much more training Double epochs OR triple learning rate

Practical Decision Tree

After Training: Check Final Loss

Loss < 0.2?
  → OVERFIT
  → Next run: REDUCE epochs by 20% OR LOWER learning rate by 50%

Loss 0.2-0.5?
  → PERFECT
  → Test model thoroughly
  → If good = DONE ✓

Loss 0.5-1.0?
  → UNDERFIT  
  → Next run: INCREASE epochs by 50% OR RAISE learning rate by 2x

Loss > 1.0?
  → BARELY LEARNED
  → Next run: DOUBLE epochs OR TRIPLE learning rate

Real-World Example: TCCC Medical Model

Three Training Runs Compared

Run Examples Epochs Learning Rate Total Steps Final Loss Result
1 625 5 1e-4 395 0.4-0.6 ✓ Okay, some hallucination
2 625 8 5e-5 632 0.02-0.1 ✗ TOO perfect (overfit)
3 825 5 8e-5 515 0.6-0.9 ✗ Didn’t learn enough

What Happened in Each Run

Run 1 → Run 2:
- Same data (625 examples)
- More epochs (5→8) + Lower learning rate (1e-4 → 5e-5)
- Result: Loss dropped to 0.02 - model overfit, memorized exact wording

Run 2 → Run 3:
- Added 32% more data (625→825)
- Reduced epochs by 37% (8→5)
- Result: Loss jumped to 0.7 - not enough training for the extra data

The Correct Settings for Run 3

# Starting point: 825 examples
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 8,  # Effective batch = 8
        warmup_steps = 10,
        num_train_epochs = 7,             # Increased from 5
        learning_rate = 1e-4,             # Increased from 8e-5
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        assistant_only_loss = True,       # Train only on completions
        max_seq_length = 2048,
    ),
)

Expected result: Final loss around 0.2-0.4 (sweet spot)


Quick Reference

Training Parameters Cheat Sheet

# For small datasets (500-1000 examples)
per_device_train_batch_size = 1        # Adjust for your GPU
gradient_accumulation_steps = 8        # Effective batch of 8
num_train_epochs = 5-8                 # More for smaller datasets
learning_rate = 5e-5 to 1e-4          # Start conservative
lr_scheduler_type = "cosine"           # Smooth learning curve

When to Adjust

If you get OOM (Out of Memory):
- Reduce per_device_train_batch_size to 1
- Increase gradient_accumulation_steps to compensate

If adding more training data:
- Increase num_train_epochs proportionally
- OR increase learning_rate
- OR both

If loss is too high (>0.5):
- Increase num_train_epochs by 30-50%
- OR increase learning_rate by 50-100%

If loss is too low (<0.2):
- Reduce num_train_epochs by 20-30%
- OR reduce learning_rate by 50%

Loss Interpretation

Target: 0.2-0.5 for production models

  • Below 0.2: Likely overfit - test carefully
  • Above 0.5: Underfit - needs more training
  • Between 0.2-0.5: Good balance - test and deploy

Best Practices

1. Dataset Quality Matters Most

  • Clean, accurate data beats fancy hyperparameters
  • Add question variations (2-3 per important topic)
  • Include “I don’t know” examples for out-of-domain queries

2. Watch the Loss Curve

  • Should steadily decrease
  • Occasional spikes are normal
  • Consistent plateau means adjust learning rate

3. Test Thoroughly

  • Test exact matches (should answer correctly)
  • Test variations (should answer if close enough)
  • Test out-of-domain (should refuse)

4. Start Conservative

  • Begin with lower learning rate
  • Add more epochs rather than higher learning rate
  • Easier to recover from underfit than overfit

5. Document Everything

  • Save training configs
  • Note final loss values
  • Keep test results for comparison

Completion-Based SFT Specifics

Key Settings for Instruction-Following Models

completion_only_loss = False        # Not using prompt-completion format
assistant_only_loss = True          # Only compute loss on assistant responses

What this does:
- Trains the model ONLY on the assistant’s completions
- Ignores loss on user questions and system prompts
- More efficient - focuses learning on what the model should output

Dataset Format

Input format (instruction/completion):

{
  "instruction": "What is MARCH in TCCC?",
  "completion": "MARCH stands for Massive hemorrhage, Airway, Respirations, Circulation, Hypothermia/Head injury..."
}

Converted to messages format:

{
  "messages": [
    {"role": "user", "content": "What is MARCH in TCCC?"},
    {"role": "assistant", "content": "MARCH stands for..."}
  ]
}

Chat template applies formatting:

<|start_header_id|>user<|end_header_id|>
What is MARCH in TCCC?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
MARCH stands for...<|eot_id|>

With assistant_only_loss=True, loss is only computed on the assistant’s response tokens.


Additional Resources

Tools

  • Unsloth: Fast, memory-efficient fine-tuning
  • TRL (Transformer Reinforcement Learning): SFT, DPO, PPO trainers
  • Hugging Face Datasets: Easy dataset management

Summary

The Pattern (Simple Version):

  1. More data = need more training time
  2. Target final loss: 0.2-0.5
  3. Too low (< 0.2) = overfit = reduce training
  4. Too high (> 0.5) = underfit = increase training

Three knobs to balance:
- Data size
- Epochs (repetitions)
- Learning rate (step size)

Goal: Hit the 0.2-0.5 loss sweet spot where the model has learned well without overfitting.


Last updated: October 2025