Training SLMs on Synthetic Data: The Distillation Pipeline

Citable Key Findings

•The Data Wall: We have run out of high-quality human text. Synthetic data generated by Frontier Models is the only way to scale training for Small Language Models (SLMs).
•Model Collapse: Training on raw, unfiltered synthetic data leads to model collapse. Agentic Filtering (using a Critic Agent to grade data) is essential.
•Task-Specific Excellence: A 3B parameter model trained on synthetic CoT (Chain of Thought) data can outperform GPT-4 on specific domains like SQL generation or medical coding.
•Cost: Distilling a domain-expert SLM costs <$5,000 in compute, democratizing "Sovereign AI."

The Distillation Architecture

Instead of training a model to "know everything," we train it to "copy the teacher."

The Synthetic Data Factory

Step 1: Generating "Textbook Quality" Data

The key is to ask the Teacher Model to explain its reasoning step-by-step.

Python: Data Generation Loop

# Synthetic Data Generator
import json
import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-pro")

SYSTEM_PROMPT = """
You are a professor of computer science. 
Generate a complex Python coding problem, then solve it using a step-by-step Chain of Thought.
Format: {"problem": str, "cot": str, "code": str}
"""

def generate_sample():
    # Force JSON output mode
    response = model.generate_content(
        f"{SYSTEM_PROMPT}\n\nTask: Generate one sample.",
        generation_config={"response_mime_type": "application/json"}
    )
    return json.loads(response.text)

# Generate 10,000 samples
dataset = [generate_sample() for _ in range(10000)]

Step 2: The Critic Loop

Not all synthetic data is good. A "Critic Agent" must verify the code actually runs and the reasoning is sound.

Benchmark: General vs. Specialized

Model	Parameters	General MMLU	Python Coding (HumanEval)	SQL Generation (Spider)
Llama 3 (Base)	8B	68%	55%	40%
GPT-4o (Teacher)	~1.8T	88%	92%	85%
Llama 3 (Distilled)	8B	65%	89%	82%

Conclusion

Synthetic data distillation breaks the correlation between model size and model capability. We are entering the era of "Micro-Expert" models that run on-device but think like giants.

See Also: The Referential Graph