Configuration Reference¶
DeepFabric uses YAML configuration with four main sections: llm, topics, generation, and output.
Complete Example¶
# Shared LLM defaults (optional)
llm:
provider: "openai"
model: "gpt-4o"
temperature: 0.7
# Topic generation
topics:
prompt: "Python programming fundamentals"
mode: tree # tree | graph
depth: 3
degree: 3
save_as: "topics.jsonl"
llm: # Override shared LLM
model: "gpt-4o-mini"
# Sample generation
generation:
system_prompt: |
Generate clear, educational examples.
instructions: "Create diverse, practical scenarios."
conversation:
type: chain_of_thought
reasoning_style: agent
agent_mode: single_turn
tools:
spin_endpoint: "http://localhost:3000"
available:
- read_file
- write_file
max_retries: 3
llm:
temperature: 0.5
# Output configuration
output:
system_prompt: |
You are a helpful assistant with tool access.
include_system_message: true
num_samples: 50
batch_size: 5
save_as: "dataset.jsonl"
# Optional: Upload to HuggingFace
huggingface:
repository: "username/dataset-name"
tags: ["python", "agents"]
Section Reference¶
llm (Optional)¶
Shared LLM defaults inherited by topics and generation.
| Field | Type | Description |
|---|---|---|
provider |
string | LLM provider: openai, anthropic, gemini, ollama |
model |
string | Model name |
temperature |
float | Sampling temperature (0.0-2.0) |
base_url |
string | Custom API endpoint |
topics¶
Controls topic tree/graph generation.
| Field | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Root topic for generation |
mode |
string | "tree" | Generation mode: tree or graph |
depth |
int | 2 | Hierarchy depth (1-10) |
degree |
int | 3 | Subtopics per node (1-50) |
max_concurrent |
int | 4 | Max concurrent LLM calls (graph mode only, 1-20) |
system_prompt |
string | "" | Custom instructions for topic LLM |
save_as |
string | - | Path to save topics JSONL |
llm |
object | - | Override shared LLM settings |
generation¶
Controls sample generation.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | Instructions for generation LLM |
instructions |
string | - | Additional guidance |
conversation |
object | - | Conversation type settings |
tools |
object | - | Tool configuration |
max_retries |
int | 3 | Retries on API failures |
sample_retries |
int | 2 | Retries on validation failures |
max_tokens |
int | 2000 | Max tokens per generation |
llm |
object | - | Override shared LLM settings |
generation.conversation¶
| Field | Type | Options | Description |
|---|---|---|---|
type |
string | basic, chain_of_thought | Conversation format |
reasoning_style |
string | freetext, agent | For chain_of_thought only |
agent_mode |
string | single_turn, multi_turn | For agent style only |
min_turns |
int | 1 | Minimum turns (multi_turn) |
max_turns |
int | 5 | Maximum turns (multi_turn) |
min_tool_calls |
int | 1 | Minimum tool calls (multi_turn) |
generation.tools¶
| Field | Type | Description |
|---|---|---|
spin_endpoint |
string | Spin service URL |
tools_endpoint |
string | MCP tools endpoint |
available |
list | Tool names to use (empty = all) |
custom |
list | Inline tool definitions |
max_per_query |
int | Max tools per sample |
max_agent_steps |
int | Max ReAct iterations |
scenario_seed |
object | Initial file state |
output¶
Controls final dataset.
| Field | Type | Default | Description |
|---|---|---|---|
system_prompt |
string | - | System message in training data |
include_system_message |
bool | true | Include system message |
num_samples |
int | required | Total samples to generate |
batch_size |
int | 1 | Parallel generation batch size |
save_as |
string | required | Output file path |
huggingface (Optional)¶
| Field | Type | Description |
|---|---|---|
repository |
string | HuggingFace repo (user/name) |
tags |
list | Dataset tags |
CLI Overrides¶
Most config options can be overridden via CLI:
deepfabric generate config.yaml \
--provider anthropic \
--model claude-3-5-sonnet-20241022 \
--num-samples 100 \
--batch-size 10 \
--temperature 0.5
Run deepfabric generate --help for all options.