Skip to content

Dataset Generation

DeepFabric generates three types of synthetic training datasets, each designed for different model capabilities.

Dataset Types

  • Basic


    Simple Q&A pairs for general instruction following

    Learn more

  • Reasoning


    Chain-of-thought traces for step-by-step problem solving

    Learn more

  • Agent


    Tool-calling datasets with ReAct-style reasoning

    Learn more

Generation Pipeline

All dataset types follow the same three-stage pipeline:

graph LR
    A[Topic Generation] --> B[Sample Generation] --> C[Output]
    A --> |"Tree or Graph"| A1[Subtopics]
    B --> |"Per Topic"| B1[Training Examples]
    C --> |"JSONL"| C1[HuggingFace Upload]
  1. Topic Generation - Creates a tree or graph of subtopics from your root prompt
  2. Sample Generation - Produces training examples for each topic
  3. Output - Saves to JSONL with optional HuggingFace upload

Quick Comparison

config.yaml
conversation:
  type: basic

Simple Q&A without explicit reasoning.

config.yaml
conversation:
  type: cot
  reasoning_style: freetext

Includes step-by-step reasoning traces.

config.yaml
# Agent mode is implicit when tools are configured
conversation:
  type: cot
  reasoning_style: agent

Tool-calling with ReAct-style reasoning.

Choosing a Dataset Type

Quick Selection Guide

  • Basic: General instruction-following without explicit reasoning
  • Reasoning: Models that need to explain their logic
  • Agent: Tool-calling capabilities

Basic datasets work for general instruction-following tasks. The model learns to answer questions directly without explicit reasoning.

Reasoning datasets teach models to think before answering. The output includes a reasoning field with the model's thought process, useful for training models that explain their logic.

Agent datasets train tool-calling capabilities. When tools are configured, agent mode is automatically enabled and generates complete tool workflows with ReAct-style reasoning.

Next Steps