Dataset Generation¶
DeepFabric generates three types of synthetic training datasets, each designed for different model capabilities.
Dataset Types¶
-
Basic
Simple Q&A pairs for general instruction following
-
Reasoning
Chain-of-thought traces for step-by-step problem solving
-
Agent
Tool-calling datasets with ReAct-style reasoning
Generation Pipeline¶
All dataset types follow the same three-stage pipeline:
graph LR
A[Topic Generation] --> B[Sample Generation] --> C[Output]
A --> |"Tree or Graph"| A1[Subtopics]
B --> |"Per Topic"| B1[Training Examples]
C --> |"JSONL"| C1[HuggingFace Upload]
- Topic Generation - Creates a tree or graph of subtopics from your root prompt
- Sample Generation - Produces training examples for each topic
- Output - Saves to JSONL with optional HuggingFace upload
Quick Comparison¶
Includes step-by-step reasoning traces.
Choosing a Dataset Type¶
Quick Selection Guide
- Basic: General instruction-following without explicit reasoning
- Reasoning: Models that need to explain their logic
- Agent: Tool-calling capabilities
Basic datasets work for general instruction-following tasks. The model learns to answer questions directly without explicit reasoning.
Reasoning datasets teach models to think before answering. The output includes a reasoning field with the model's thought process, useful for training models that explain their logic.
Agent datasets train tool-calling capabilities. When tools are configured, agent mode is automatically enabled and generates complete tool workflows with ReAct-style reasoning.
Next Steps¶
- Basic Datasets - Simple Q&A generation
- Reasoning Datasets - Chain-of-thought training data
- Agent Datasets - Tool-calling datasets
- Configuration Reference - Full YAML options