Configuration Reference¶
DeepFabric uses YAML configuration with four main sections: llm, topics, generation, and output.
Complete Example¶
# Shared LLM defaults (optional)
llm:
provider: "openai"
model: "gpt-4o"
temperature: 0.7
# Topic generation
topics:
prompt: "Python programming fundamentals"
mode: graph # tree | graph
prompt_style: anchored # default | isolated | anchored (graph mode only)
depth: 2
degree: 3
save_as: "topics.json"
llm: # Override shared LLM
model: "gpt-4o-mini"
# Sample generation
generation:
system_prompt: |
Generate clear, educational examples.
instructions: "Create diverse, practical scenarios."
# Agent mode is implicit when tools are configured
conversation:
type: cot
reasoning_style: agent
tools:
spin_endpoint: "http://localhost:3000"
components:
builtin:
- read_file
- write_file
max_retries: 3
llm:
temperature: 0.5
# Output configuration
output:
system_prompt: |
You are a helpful assistant with tool access.
include_system_message: true
num_samples: 4
batch_size: 2
save_as: "dataset.jsonl"
# Optional: Checkpoint for resumable generation
checkpoint:
interval: 500 # Save every 500 samples
retry_failed: false
# Optional: Upload to HuggingFace
huggingface:
repository: "org/dataset-name"
tags: ["python", "agents"]
HuggingFace Upload
The huggingface section is optional and used to upload the dataset after generation. It requires a token exported as HF_TOKEN or pre-authentication via huggingface-cli.
Section Reference¶
llm (Optional)¶
Shared LLM defaults inherited by topics and generation.
| Field | Type | Description |
|---|---|---|
provider |
string | LLM provider: openai, anthropic, gemini, ollama |
model |
string | Model name |
temperature |
float | Sampling temperature (0.0-2.0) |
base_url |
string | Custom API endpoint |
topics¶
Controls topic tree/graph generation.
| Field | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Root topic for generation |
mode |
string | "tree" | Generation mode: tree or graph |
depth |
int | 2 | Hierarchy depth (1-10) |
degree |
int | 3 | Subtopics per node (1-50) |
max_concurrent |
int | 4 | Max concurrent LLM calls (graph mode only, 1-20) |
prompt_style |
string | "default" | Graph expansion prompt style (graph mode only, see below) |
system_prompt |
string | "" | Custom instructions for topic LLM |
save_as |
string | - | Path to save topics JSONL |
llm |
object | - | Override shared LLM settings |
topics.prompt_style (Graph Mode Only)¶
Controls how subtopics are generated during graph expansion:
| Style | Cross-connections | Examples | Use Case |
|---|---|---|---|
default |
Yes | Generic | General-purpose topic graphs with cross-links |
isolated |
No | Generic | Independent subtopics without cross-connections |
anchored |
No | Domain-aware | Focused generation with domain-specific examples |
anchored is recommended for specialized domains (security, technical) where you want subtopics to stay tightly focused on the parent topic. It automatically detects the domain from your system_prompt and provides relevant examples to guide generation.
topics:
prompt: "Credential access attack scenarios"
mode: graph
prompt_style: anchored # Uses security-domain examples
depth: 3
degree: 8
system_prompt: |
Generate adversarial security test cases for AI assistant hardening.
generation¶
Controls sample generation.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | Instructions for generation LLM |
instructions |
string | - | Additional guidance |
conversation |
object | - | Conversation type settings |
tools |
object | - | Tool configuration |
max_retries |
int | 3 | Retries on API failures |
sample_retries |
int | 2 | Retries on validation failures |
max_tokens |
int | 2000 | Max tokens per generation |
llm |
object | - | Override shared LLM settings |
generation.conversation¶
| Field | Type | Options | Description |
|---|---|---|---|
type |
string | basic, cot | Conversation format |
reasoning_style |
string | freetext, agent | For cot only |
Agent Mode
Agent mode is automatically enabled when tools are configured. No explicit agent_mode setting is required.
generation.tools¶
| Field | Type | Description |
|---|---|---|
spin_endpoint |
string | Spin service URL |
tools_endpoint |
string | Endpoint to load tool definitions (for non-builtin components) |
components |
object | Map of component name to tool names (see below) |
custom |
list | Inline tool definitions |
max_per_query |
int | Max tools per sample |
max_agent_steps |
int | Max ReAct iterations |
scenario_seed |
object | Initial file state |
components¶
The components field maps component names to lists of tool names. Each component routes to /{component}/execute:
components:
builtin: # Routes to /vfs/execute (built-in tools)
- read_file
- write_file
mock: # Routes to /mock/execute
- get_weather
github: # Routes to /github/execute
- list_issues
Component Types
builtin: Uses built-in VFS tools (read_file, write_file, list_files, delete_file)- Other components: Load tool definitions from
tools_endpoint
output¶
Controls final dataset.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | System message in training data |
include_system_message |
bool | true | Include system message |
num_samples |
int | string | required | Total samples: integer, "auto", or percentage like "50%" |
batch_size |
int | 1 | Parallel generation concurrency (number of simultaneous LLM calls) |
save_as |
string | required | Output file path |
checkpoint |
object | - | Checkpoint configuration (see below) |
Auto and Percentage Samples
num_samples supports special values:
- Integer (e.g.,
100): Generate exactly this many samples "auto": Generate one sample per unique topic (100% coverage)- Percentage (e.g.,
"50%","200%"): Generate samples relative to unique topic count
When num_samples exceeds the number of unique topics, DeepFabric iterates through multiple cycles. Each cycle processes all unique topics once. For example, with 50 unique topics and num_samples: 120:
- Cycle 1: 50 samples (topics 1-50)
- Cycle 2: 50 samples (topics 1-50)
- Cycle 3: 20 samples (topics 1-20, partial)
output.checkpoint (Optional)¶
Configuration for checkpoint-based resume capability. Checkpoints allow pausing and resuming long-running dataset generation without losing progress.
| Field | Type | Default | Description |
|---|---|---|---|
interval |
int | required | Save checkpoint every N samples |
path |
string | XDG data dir | Directory to store checkpoint files (auto-managed) |
retry_failed |
bool | false | When resuming, retry previously failed samples |
output:
save_as: "dataset.jsonl"
num_samples: 5000
batch_size: 5
checkpoint:
interval: 500 # Save every 500 samples
retry_failed: false
Checkpointing creates three files in the checkpoint directory:
{name}.checkpoint.json- Metadata (progress, IDs processed){name}.checkpoint.jsonl- Samples saved so far{name}.checkpoint.failures.jsonl- Failed samples for debugging
Choosing Checkpoint Interval
The checkpoint interval specifies how many samples to generate between saves.
Choose an interval that balances recovery granularity (smaller = less work lost) against I/O overhead (larger = fewer disk writes). For example, interval: 100 saves progress every 100 samples.
Cycle-Based Generation Model
DeepFabric uses a cycle-based generation model:
- Unique topics: Deduplicated topics from your tree/graph (by UUID)
- Cycles: Number of times to iterate through all unique topics
- Concurrency: Maximum parallel LLM calls (
batch_size)
For example, with 100 unique topics and num_samples: 250:
- Cycles needed: 3 (ceil(250/100))
- Cycle 1: 100 samples, Cycle 2: 100 samples, Cycle 3: 50 samples (partial)
Checkpoints track progress as (topic_uuid, cycle) tuples, enabling precise resume from any point.
Memory Optimization
When checkpointing is enabled, samples are flushed to disk periodically, keeping memory usage constant regardless of dataset size.
huggingface (Optional)¶
| Field | Type | Description |
|---|---|---|
repository |
string | HuggingFace repo (user/name) |
tags |
list | Dataset tags |
CLI Overrides¶
Most config options can be overridden via CLI:
deepfabric generate config.yaml \
--provider anthropic \
--model claude-3-5-sonnet-20241022 \
--num-samples 100 \
--batch-size 10 \
--temperature 0.5
Full Options
Run deepfabric generate --help for all available options.