Dataset Generation¶
DeepFabric generates four types of synthetic training datasets, each designed for different model capabilities.
Dataset Types¶
| Type | Purpose | Use Case |
|---|---|---|
| Basic | Simple Q&A pairs | General instruction following |
| Reasoning | Chain-of-thought traces | Step-by-step problem solving |
| Agent (Single-Turn) | Tool calls in one response | Simple tool use |
| Agent (Multi-Turn) | Extended tool conversations | Complex multi-step tasks |
Generation Pipeline¶
All dataset types follow the same three-stage pipeline:
- Topic Generation - Creates a tree or graph of subtopics from your root prompt
- Sample Generation - Produces training examples for each topic
- Output - Saves to JSONL with optional HuggingFace upload
Quick Comparison¶
# Basic: Simple Q&A
conversation:
type: basic
# Reasoning: Chain-of-thought
conversation:
type: chain_of_thought
reasoning_style: freetext
# Agent Single-Turn: One-shot tool use
conversation:
type: chain_of_thought
reasoning_style: agent
agent_mode: single_turn
# Agent Multi-Turn: Extended tool conversations
conversation:
type: chain_of_thought
reasoning_style: agent
agent_mode: multi_turn
Choosing a Dataset Type¶
Basic datasets work for general instruction-following tasks. The model learns to answer questions directly without explicit reasoning.
Reasoning datasets teach models to think before answering. The output includes a reasoning field with the model's thought process, useful for training models that explain their logic.
Agent datasets train tool-calling capabilities. Single-turn generates complete tool workflows in one response. Multi-turn creates extended conversations with multiple tool calls and observations, following a ReAct-style pattern.
Next Steps¶
- Basic Datasets - Simple Q&A generation
- Reasoning Datasets - Chain-of-thought training data
- Agent Datasets - Tool-calling datasets
- Configuration Reference - Full YAML options