How to Prepare Training Data for AI Models
Created: 2025-10-13 19:16:39 | Last updated: 2025-10-13 19:16:39 | Status: Public
What This Guide Teaches You
This guide shows you how to turn books, manuals, or documents into training data for an AI model. By the end, you’ll know how to make your AI answer questions accurately without making stuff up.
Step 1: Get Your Raw Data
Start with your source material. This could be:
- PDF files
- Word documents
- Text files
- Scanned images (with text)
Example: A 500-page medical manual in PDF format.
Step 2: Clean Up the Data
Remove junk that doesn’t help:
- Page numbers
- Headers and footers
- Table of contents
- Copyright notices
- Weird formatting from scanning errors
Why this matters: You want pure information, not clutter. Page numbers don’t teach an AI anything useful.
Step 3: Break It Into Chunks
Cut your document into smaller sections. Each section should:
- Be 1000-2000 words long (about 2-4 pages)
- Cover ONE complete topic
- Not cut off mid-sentence or mid-procedure
The Overlap Trick
For procedures that follow each other (like medical steps), use this method:
Each chunk contains:
1. The FULL procedure before it (for context)
2. The MAIN procedure (this is your focus)
3. The FULL procedure after it (for context)
Example chunk layout:
[FULL: How to check if someone is breathing]
>>> MAIN: How to do a jaw thrust <<<
[FULL: How to insert a breathing tube]
Important: Only ask questions about the MAIN procedure. The surrounding procedures are just there so the AI understands how things connect.
Step 4: Generate Questions and Answers
For each chunk, use a smart AI (like GPT-4 or Claude) to create 3-5 question-and-answer pairs.
Types of Questions to Create
1. Basic facts
- “What is the first step in performing a jaw thrust?”
2. How-to questions
- “How do you perform a jaw thrust on an unconscious patient?”
3. Why questions
- “Why do you use a jaw thrust instead of tilting the head back?”
4. When-not-to questions
- “When should you NOT perform a jaw thrust?”
The Golden Rule
The AI must ONLY answer from the text in front of it. No outside knowledge. No guessing.
Use this prompt:
“Create 5 questions and answers from this text. Only use information from this exact text. If the text doesn’t say something, don’t include it in your answer.”
Step 5: Create “I Don’t Know” Examples
This step is critical for safety, especially in medical or technical fields.
What Are These?
Questions the AI cannot answer because the information isn’t in your training data.
How to Make Good Ones
The questions should sound like they belong to your topic, but you don’t actually have the answer.
Example for combat medical training:
✅ Good “I don’t know” questions:
- “How do I treat a nerve gas attack?” (if this isn’t in your manual)
- “What’s the exact dosage of morphine for a 180lb soldier?” (if you don’t include drug dosages)
- “How do I perform emergency brain surgery?” (way beyond your manual’s scope)
❌ Bad “I don’t know” questions:
- “How do I bake a cake?” (obviously not related)
- “What’s the capital of Spain?” (has nothing to do with medicine)
The AI’s answer should be:
- “I don’t know. This information isn’t in my training.”
- OR: “This is outside my scope. Consult a doctor/expert/higher authority.”
Why This Matters
Without these examples, your AI will make up answers when it doesn’t know something. In medicine, that could kill someone.
Step 6: Filter Out Bad Q&A Pairs
Go through all your generated questions and remove:
1. Hallucinations - Answers that include facts NOT in the source text
- Check: Does this fact appear in the chunk? If no, delete it.
2. Vague questions - “Tell me about airways” (too broad)
- Keep: “What are the three steps to assess airway patency?” (specific)
3. Too-short answers - “Yes” or “Use a tourniquet” (not helpful)
- Keep: Answers with at least 2-3 complete sentences
4. Questions needing other sections - “Compare jaw thrust to cricothyrotomy”
- Problem: If cricothyrotomy is in a different chunk, this question can’t be answered properly
Quick Quality Check
For each Q&A pair, ask yourself:
1. Is the answer completely from the source text?
2. Would this question actually help someone learn?
3. Is the answer detailed enough to be useful?
If any answer is “no,” delete that pair.
Step 7: Add Question Variations
Take your good questions and rewrite them in different ways.
Original: “How do you perform a jaw thrust?”
Variations:
- “What’s the procedure for doing a jaw thrust?”
- “Walk me through performing a jaw thrust.”
- “I need to do a jaw thrust. What are the steps?”
Why this helps: People ask the same question in many different ways. This teaches your AI to recognize them all.
Step 8: Create Correction Examples
Make questions where someone has a wrong idea, and the AI corrects them.
Example:
Question: “Should I tilt the head back if someone has a neck injury?”
Answer: “No, that’s incorrect. If you suspect a neck injury, use a jaw thrust instead of tilting the head back. Tilting the head could damage the spine further. Here’s how to do a jaw thrust safely…”
Why this helps: Real users make mistakes. Your AI needs to catch and correct dangerous errors.
Step 9: Format for Training
Convert everything into the format your training software needs. Common formats:
ShareGPT format:
{
"conversations": [
{"from": "human", "value": "How do you perform a jaw thrust?"},
{"from": "assistant", "value": "To perform a jaw thrust..."}
]
}
Alpaca format:
{
"instruction": "How do you perform a jaw thrust?",
"input": "",
"output": "To perform a jaw thrust..."
}
Check your training tool’s documentation for the exact format it needs.
Step 10: Remove Duplicates
After all this work, you’ll have very similar Q&A pairs. Remove near-duplicates to avoid wasting training time.
Tools that help:
- Compare questions word-by-word
- Use embedding similarity (advanced)
- Manual review for small datasets
Step 11: Split Into Training and Testing Sets
Divide your data:
- 90% for training (the AI learns from this)
- 10% for validation (checks if the AI actually learned)
Important: Keep these sets completely separate. The validation data should be questions the AI has never seen during training.
Step 12: Mix With General Data (Optional)
If you’re fine-tuning an existing AI model (like Llama or Qwen3), you might not need this step.
But if your AI starts acting weird or forgetting how to hold normal conversations, add 20-30% general instruction-following data:
- Basic math problems
- Simple reasoning tasks
- General knowledge questions
Balance: 70% your specialized data, 30% general data.
Final Checklist
Before training, verify:
- [ ] All chunks are complete (no cut-off sentences)
- [ ] Every answer comes from the source text
- [ ] You have “I don’t know” examples for unsafe questions
- [ ] You removed duplicates
- [ ] Questions are specific and useful
- [ ] Answers are detailed (2-3+ sentences)
- [ ] You have question variations
- [ ] Data is in the correct format
- [ ] Training/validation split is done
Common Mistakes to Avoid
1. Letting the AI add outside knowledge
- Wrong: AI answers from what it already knows
- Right: AI only answers from the text you gave it
2. Skipping “I don’t know” examples
- Result: Your AI makes up dangerous answers
3. Making chunks too small
- Problem: Loses context, AI can’t connect related ideas
4. Not removing bad Q&A pairs
- Result: AI learns incorrect information
5. Only using one way to ask questions
- Problem: AI only understands that exact phrasing
How Long Does This Take?
For a 500-page manual:
- Cleaning: 2-4 hours
- Chunking: 1-2 hours
- Generating Q&A: 4-8 hours (with AI help)
- Quality filtering: 4-6 hours
- Creating negatives/variations: 2-3 hours
Total: 13-23 hours of work
Most of this can be automated with good prompts and scripts.
You’re Ready!
You now know how to turn any document into high-quality training data. The key points:
- Clean your data first
- Chunk it smartly (with context overlap for procedures)
- Generate multiple types of questions
- Always include “I don’t know” examples
- Filter ruthlessly for quality
- Add variations so the AI recognizes different phrasings
The most important rule: Your AI should only say what it actually knows, and admit when it doesn’t know something.
Good luck with your training!