Dataset Generation¶

DeepFabric generates three types of synthetic training datasets, each designed for different model capabilities.

Dataset Types¶

Basic

Simple Q&A pairs for general instruction following

Learn more
Reasoning

Chain-of-thought traces for step-by-step problem solving

Learn more
Agent

Tool-calling datasets with ReAct-style reasoning

Learn more

Generation Pipeline¶

All dataset types follow the same three-stage pipeline:

graph LR
    A[Topic Generation] --> B[Sample Generation] --> C[Output]
    A --> |"Tree or Graph"| A1[Subtopics]
    B --> |"Per Topic"| B1[Training Examples]
    C --> |"JSONL"| C1[HuggingFace Upload]

Topic Generation - Creates a tree or graph of subtopics from your root prompt
Sample Generation - Produces training examples for each topic
Output - Saves to JSONL with optional HuggingFace upload

Quick Comparison¶

BasicReasoningAgent

config.yaml

conversation:
  type: basic

Simple Q&A without explicit reasoning.

config.yaml

conversation:
  type: cot
  reasoning_style: freetext

Includes step-by-step reasoning traces.

config.yaml

# Agent mode is implicit when tools are configured
conversation:
  type: cot
  reasoning_style: agent

Tool-calling with ReAct-style reasoning.

Choosing a Dataset Type¶

Quick Selection Guide

Basic: General instruction-following without explicit reasoning
Reasoning: Models that need to explain their logic
Agent: Tool-calling capabilities

Basic datasets work for general instruction-following tasks. The model learns to answer questions directly without explicit reasoning.

Reasoning datasets teach models to think before answering. The output includes a reasoning field with the model's thought process, useful for training models that explain their logic.

Agent datasets train tool-calling capabilities. When tools are configured, agent mode is automatically enabled and generates complete tool workflows with ReAct-style reasoning.

Next Steps¶

Basic Datasets - Simple Q&A generation
Reasoning Datasets - Chain-of-thought training data
Agent Datasets - Tool-calling datasets
Configuration Reference - Full YAML options