Skip to content

Configuration Reference

DeepFabric uses YAML configuration with four main sections: llm, topics, generation, and output.

Complete Example

config.yaml
# Shared LLM defaults (optional)
llm:
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.7

# Topic generation
topics:
  prompt: "Python programming fundamentals"
  mode: graph             # tree | graph
  prompt_style: anchored  # default | isolated | anchored (graph mode only)
  depth: 2
  degree: 3
  save_as: "topics.json"
  llm:                    # Override shared LLM
    model: "gpt-4o-mini"

# Sample generation
generation:
  system_prompt: |
    Generate clear, educational examples.
  instructions: "Create diverse, practical scenarios."

  # Agent mode is implicit when tools are configured
  conversation:
    type: cot
    reasoning_style: agent

  tools:
    spin_endpoint: "http://localhost:3000"
    components:
      builtin:
        - read_file
        - write_file

  max_retries: 3
  llm:
    temperature: 0.5

# Output configuration
output:
  system_prompt: |
    You are a helpful assistant with tool access.
  include_system_message: true
  num_samples: 4
  batch_size: 2
  save_as: "dataset.jsonl"

  # Optional: Checkpoint for resumable generation
  checkpoint:
    interval: 500       # Save every 500 samples
    retry_failed: false

# Optional: Upload to HuggingFace
huggingface:
  repository: "org/dataset-name"
  tags: ["python", "agents"]

HuggingFace Upload

The huggingface section is optional and used to upload the dataset after generation. It requires a token exported as HF_TOKEN or pre-authentication via huggingface-cli.

Section Reference

llm (Optional)

Shared LLM defaults inherited by topics and generation.

Field Type Description
provider string LLM provider: openai, anthropic, gemini, ollama
model string Model name
temperature float Sampling temperature (0.0-2.0)
base_url string Custom API endpoint

topics

Controls topic tree/graph generation.

Field Type Default Description
prompt string required Root topic for generation
mode string "tree" Generation mode: tree or graph
depth int 2 Hierarchy depth (1-10)
degree int 3 Subtopics per node (1-50)
max_concurrent int 4 Max concurrent LLM calls (graph mode only, 1-20)
prompt_style string "default" Graph expansion prompt style (graph mode only, see below)
system_prompt string "" Custom instructions for topic LLM
save_as string - Path to save topics JSONL
llm object - Override shared LLM settings

topics.prompt_style (Graph Mode Only)

Controls how subtopics are generated during graph expansion:

Style Cross-connections Examples Use Case
default Yes Generic General-purpose topic graphs with cross-links
isolated No Generic Independent subtopics without cross-connections
anchored No Domain-aware Focused generation with domain-specific examples

anchored is recommended for specialized domains (security, technical) where you want subtopics to stay tightly focused on the parent topic. It automatically detects the domain from your system_prompt and provides relevant examples to guide generation.

Example: Security-focused topic generation
topics:
  prompt: "Credential access attack scenarios"
  mode: graph
  prompt_style: anchored   # Uses security-domain examples
  depth: 3
  degree: 8
  system_prompt: |
    Generate adversarial security test cases for AI assistant hardening.

generation

Controls sample generation.

Field Type Default Description
system_prompt string - Instructions for generation LLM
instructions string - Additional guidance
conversation object - Conversation type settings
tools object - Tool configuration
max_retries int 3 Retries on API failures
sample_retries int 2 Retries on validation failures
max_tokens int 2000 Max tokens per generation
llm object - Override shared LLM settings

generation.conversation

Field Type Options Description
type string basic, cot Conversation format
reasoning_style string freetext, agent For cot only

Agent Mode

Agent mode is automatically enabled when tools are configured. No explicit agent_mode setting is required.

generation.tools

Field Type Description
spin_endpoint string Spin service URL
tools_endpoint string Endpoint to load tool definitions (for non-builtin components)
components object Map of component name to tool names (see below)
custom list Inline tool definitions
max_per_query int Max tools per sample
max_agent_steps int Max ReAct iterations
scenario_seed object Initial file state
components

The components field maps component names to lists of tool names. Each component routes to /{component}/execute:

Component routing
components:
  builtin:              # Routes to /vfs/execute (built-in tools)
    - read_file
    - write_file
  mock:                 # Routes to /mock/execute
    - get_weather
  github:               # Routes to /github/execute
    - list_issues

Component Types

  • builtin: Uses built-in VFS tools (read_file, write_file, list_files, delete_file)
  • Other components: Load tool definitions from tools_endpoint

output

Controls final dataset.

Field Type Default Description
system_prompt string - System message in training data
include_system_message bool true Include system message
num_samples int | string required Total samples: integer, "auto", or percentage like "50%"
batch_size int 1 Parallel generation concurrency (number of simultaneous LLM calls)
save_as string required Output file path
checkpoint object - Checkpoint configuration (see below)

Auto and Percentage Samples

num_samples supports special values:

  • Integer (e.g., 100): Generate exactly this many samples
  • "auto": Generate one sample per unique topic (100% coverage)
  • Percentage (e.g., "50%", "200%"): Generate samples relative to unique topic count

When num_samples exceeds the number of unique topics, DeepFabric iterates through multiple cycles. Each cycle processes all unique topics once. For example, with 50 unique topics and num_samples: 120:

  • Cycle 1: 50 samples (topics 1-50)
  • Cycle 2: 50 samples (topics 1-50)
  • Cycle 3: 20 samples (topics 1-20, partial)

output.checkpoint (Optional)

Configuration for checkpoint-based resume capability. Checkpoints allow pausing and resuming long-running dataset generation without losing progress.

Field Type Default Description
interval int required Save checkpoint every N samples
path string XDG data dir Directory to store checkpoint files (auto-managed)
retry_failed bool false When resuming, retry previously failed samples
Checkpoint configuration
output:
  save_as: "dataset.jsonl"
  num_samples: 5000
  batch_size: 5
  checkpoint:
    interval: 500     # Save every 500 samples
    retry_failed: false

Checkpointing creates three files in the checkpoint directory:

  • {name}.checkpoint.json - Metadata (progress, IDs processed)
  • {name}.checkpoint.jsonl - Samples saved so far
  • {name}.checkpoint.failures.jsonl - Failed samples for debugging

Choosing Checkpoint Interval

The checkpoint interval specifies how many samples to generate between saves.

Choose an interval that balances recovery granularity (smaller = less work lost) against I/O overhead (larger = fewer disk writes). For example, interval: 100 saves progress every 100 samples.

Cycle-Based Generation Model

DeepFabric uses a cycle-based generation model:

  • Unique topics: Deduplicated topics from your tree/graph (by UUID)
  • Cycles: Number of times to iterate through all unique topics
  • Concurrency: Maximum parallel LLM calls (batch_size)

For example, with 100 unique topics and num_samples: 250:

  • Cycles needed: 3 (ceil(250/100))
  • Cycle 1: 100 samples, Cycle 2: 100 samples, Cycle 3: 50 samples (partial)

Checkpoints track progress as (topic_uuid, cycle) tuples, enabling precise resume from any point.

Memory Optimization

When checkpointing is enabled, samples are flushed to disk periodically, keeping memory usage constant regardless of dataset size.

huggingface (Optional)

Field Type Description
repository string HuggingFace repo (user/name)
tags list Dataset tags

CLI Overrides

Most config options can be overridden via CLI:

CLI overrides
deepfabric generate config.yaml \
  --provider anthropic \
  --model claude-3-5-sonnet-20241022 \
  --num-samples 100 \
  --batch-size 10 \
  --temperature 0.5

Full Options

Run deepfabric generate --help for all available options.