Configuration Reference¶
DeepFabric uses YAML configuration with four main sections: llm, topics, generation, and output.
Complete Example¶
# Shared LLM defaults (optional)
llm:
provider: "openai"
model: "gpt-4o"
temperature: 0.7
# Topic generation
topics:
prompt: "Python programming fundamentals"
mode: graph # tree | graph
prompt_style: anchored # default | isolated | anchored (graph mode only)
depth: 2
degree: 3
save_as: "topics.json"
llm: # Override shared LLM
model: "gpt-4o-mini"
# Sample generation
generation:
system_prompt: |
Generate clear, educational examples.
instructions: "Create diverse, practical scenarios."
# Agent mode is implicit when tools are configured
conversation:
type: cot
reasoning_style: agent
tools:
spin_endpoint: "http://localhost:3000"
components:
builtin:
- read_file
- write_file
max_retries: 3
llm:
temperature: 0.5
# Output configuration
output:
system_prompt: |
You are a helpful assistant with tool access.
include_system_message: true
num_samples: 4
batch_size: 2
save_as: "dataset.jsonl"
# Optional: Checkpoint for resumable generation
checkpoint:
interval: 500 # Save every 500 samples
retry_failed: false
# Optional: Upload to HuggingFace
huggingface:
repository: "org/dataset-name"
tags: ["python", "agents"]
HuggingFace Upload
The huggingface section is optional and used to upload the dataset after generation. It requires a token exported as HF_TOKEN or pre-authentication via huggingface-cli.
Section Reference¶
llm (Optional)¶
Shared LLM defaults inherited by topics and generation.
| Field | Type | Description |
|---|---|---|
provider |
string | LLM provider: openai, anthropic, gemini, ollama |
model |
string | Model name |
temperature |
float | Sampling temperature (0.0-2.0) |
base_url |
string | Custom API endpoint |
topics¶
Controls topic tree/graph generation.
| Field | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Root topic for generation |
mode |
string | "graph" | Generation mode: tree or graph |
depth |
int | 2 | Hierarchy depth (1-10) |
degree |
int | 3 | Subtopics per node (1-50) |
max_concurrent |
int | 4 | Max concurrent LLM calls (graph mode only, 1-20) |
prompt_style |
string | "default" | Graph expansion prompt style (graph mode only, see below) |
system_prompt |
string | "" | Custom instructions for topic LLM |
max_tokens |
int | 1000 | Max tokens for topic generation LLM calls |
save_as |
string | - | Path to save topics JSONL |
llm |
object | - | Override shared LLM settings |
topics.prompt_style (Graph Mode Only)¶
Controls how subtopics are generated during graph expansion:
| Style | Cross-connections | Examples | Use Case |
|---|---|---|---|
default |
Yes | Generic | General-purpose topic graphs with cross-links |
isolated |
No | Generic | Independent subtopics without cross-connections |
anchored |
No | Domain-aware | Focused generation with domain-specific examples |
anchored is recommended for specialized domains (security, technical) where you want subtopics to stay tightly focused on the parent topic. It automatically detects the domain from your system_prompt and provides relevant examples to guide generation.
topics:
prompt: "Credential access attack scenarios"
mode: graph
prompt_style: anchored # Uses security-domain examples
depth: 3
degree: 8
system_prompt: |
Generate adversarial security test cases for AI assistant hardening.
Failed Generation Tracking¶
When topic generation encounters failures, sidecar files are created automatically:
- Tree mode:
{save_as}_failed.jsonlalongside the JSONL output - Graph mode:
{save_as}_failed.jsonlalongside the JSON output
These files contain error details for each failed generation attempt, useful for debugging provider issues or prompt problems.
Truncation Detection
DeepFabric detects truncated LLM responses ("EOF while parsing" errors) and suggests increasing max_tokens in the topic configuration.
generation¶
Controls sample generation.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | Instructions for generation LLM |
instructions |
string | - | Additional guidance |
conversation |
object | - | Conversation type settings |
tools |
object | - | Tool configuration |
max_retries |
int | 3 | Retries on API failures |
sample_retries |
int | 2 | Retries on validation failures |
max_tokens |
int | 2000 | Max tokens per generation |
llm |
object | - | Override shared LLM settings |
generation.conversation¶
| Field | Type | Options | Description |
|---|---|---|---|
type |
string | basic, cot | Conversation format |
reasoning_style |
string | freetext, agent | For cot only |
Agent Mode
Agent mode is automatically enabled when tools are configured. No explicit agent_mode setting is required.
generation.tools¶
| Field | Type | Description |
|---|---|---|
spin_endpoint |
string | Spin service URL |
tools_endpoint |
string | Endpoint to load tool definitions (for non-builtin components) |
components |
object | Map of component name to tool names (see below) |
custom |
list | Inline tool definitions |
max_per_query |
int | Max tools per sample |
max_agent_steps |
int | Max ReAct iterations |
scenario_seed |
object | Initial file state |
components¶
The components field maps component names to lists of tool names. Each component routes to /{component}/execute:
components:
builtin: # Routes to /vfs/execute (built-in tools)
- read_file
- write_file
mock: # Routes to /mock/execute
- get_weather
github: # Routes to /github/execute
- list_issues
Component Types
builtin: Uses built-in VFS tools (read_file, write_file, list_files, delete_file)- Other components: Load tool definitions from
tools_endpoint
output¶
Controls final dataset.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | System message in training data |
include_system_message |
bool | true | Include system message |
num_samples |
int | string | required | Total samples: integer, "auto", or percentage like "50%" |
batch_size |
int | 1 | Parallel generation concurrency (number of simultaneous LLM calls) |
save_as |
string | required | Output file path |
checkpoint |
object | - | Checkpoint configuration (see below) |
Auto and Percentage Samples
num_samples supports special values:
- Integer (e.g.,
100): Generate exactly this many samples "auto": Generate one sample per unique topic (100% coverage)- Percentage (e.g.,
"50%","200%"): Generate samples relative to unique topic count
When num_samples exceeds the number of unique topics, DeepFabric iterates through multiple cycles. Each cycle processes all unique topics once. For example, with 50 unique topics and num_samples: 120:
- Cycle 1: 50 samples (topics 1-50)
- Cycle 2: 50 samples (topics 1-50)
- Cycle 3: 20 samples (topics 1-20, partial)
output.checkpoint (Optional)¶
Configuration for checkpoint-based resume capability. Checkpoints allow pausing and resuming long-running dataset generation without losing progress.
| Field | Type | Default | Description |
|---|---|---|---|
interval |
int | required | Save checkpoint every N samples |
path |
string | XDG data dir | Directory to store checkpoint files (auto-managed) |
retry_failed |
bool | false | When resuming, retry previously failed samples |
output:
save_as: "dataset.jsonl"
num_samples: 5000
batch_size: 5
checkpoint:
interval: 500 # Save every 500 samples
retry_failed: false
Checkpointing creates three files in the checkpoint directory:
{name}.checkpoint.json- Metadata (progress, IDs processed){name}.checkpoint.jsonl- Samples saved so far{name}.checkpoint.failures.jsonl- Failed samples for debugging
Choosing Checkpoint Interval
The checkpoint interval specifies how many samples to generate between saves.
Choose an interval that balances recovery granularity (smaller = less work lost) against I/O overhead (larger = fewer disk writes). For example, interval: 100 saves progress every 100 samples.
Cycle-Based Generation Model
DeepFabric uses a cycle-based generation model:
- Unique topics: Deduplicated topics from your tree/graph (by UUID)
- Cycles: Number of times to iterate through all unique topics
- Concurrency: Maximum parallel LLM calls (
batch_size)
For example, with 100 unique topics and num_samples: 250:
- Cycles needed: 3 (ceil(250/100))
- Cycle 1: 100 samples, Cycle 2: 100 samples, Cycle 3: 50 samples (partial)
Checkpoints track progress as (topic_uuid, cycle) tuples, enabling precise resume from any point.
Memory Optimization
When checkpointing is enabled, samples are flushed to disk periodically, keeping memory usage constant regardless of dataset size.
huggingface (Optional)¶
| Field | Type | Description |
|---|---|---|
repository |
string | HuggingFace repo (user/name) |
tags |
list | Dataset tags |
CLI Overrides¶
Most config options can be overridden via CLI:
deepfabric generate config.yaml \
--provider anthropic \
--model claude-3-5-sonnet-20241022 \
--num-samples 100 \
--batch-size 10 \
--temperature 0.5
Full Options
Run deepfabric generate --help for all available options.