Home
-
Diverse Data
Topic graph algorithms ensure coverage without redundancy - no overfit from repetitive samples.
-
Real Execution
Tools run in sandboxed environments, not simulated. Training data reflects actual behavior.
-
Validated Output
Constrained decoding and strict validation ensure correct syntax and structure every time.
Quick Start¶
For basic dataset generation, install DeepFabric with the default dependencies using the commands below.
If you plan to use the training or evaluation utilities described in the Training or Evaluation sections, install the training extra instead (e.g., pip install "deepfabric[training]").
Provider Setup¶
Set your API key for your chosen provider:
Verify Installation¶
Now generate your first dataset:
export OPENAI_API_KEY="your-key"
deepfabric generate \
--topic-prompt "DevOps and Platform Engineering" \
--generation-system-prompt "You are an expert in DevOps and Platform Engineering generate examples of issue resolution and best practices" \
--mode graph \
--depth 2 \
--degree 2 \
--provider openai \
--model gpt-4o \
--num-samples 2 \
--batch-size 1 \
--conversation-type cot \
--reasoning-style freetext \
--output-save-as dataset.jsonl
What Just Happened?¶
The key steps in this example were as follows:
- Topic Graph Generation: A topic hierarchy was created starting from "DevOps and Platform Engineering". Topic graphs take a root prompt and recursively expand subtopics to form a DAG (Direct Acyclic Graph) structure. Here, we used a depth of 2 and degree of 2 to ensure coverage of subtopics.
- Dataset Generation: For each node topic in the graph, a synthetic dataset sample was generated using a chain-of-thought conversation style. Each example includes reasoning traces to illustrate the thought process behind the answers. With the above example, 2 total samples were generated as specified by
--num-samples 2. You can also use--num-samples autoto generate one sample per topic path. - Conversation and Reasoning Style: The
cotconversation type withfreetextreasoning style. This encourages the model to provide detailed explanations along with answers, enhancing the quality of the training data.
So lets' break down this down visually:
graph TD
A[DevOps and Platform Engineering] --> B[CI/CD Pipelines]
A --> C[Infrastructure as Code]
B --> D[Best Practices for CI/CD]
B --> E[Common CI/CD Tools]
C --> F[IaC Benefits]
C --> G[Popular IaC Tools]
So as you can see we have a depth of 2 (root + 2 levels) and a degree of 2 (2 subtopics per topic).
Each of these topics would then be used to generate a corresponding dataset samples.
Best Practices for CI/CD - Sample Output
{
"question": "What are some best practices for implementing CI/CD pipelines?",
"answer": "Some best practices include automating testing, using version control, and ensuring fast feedback loops.",
"reasoning_trace": [
"The user is asking about best practices for CI/CD pipelines.",
"I know that automation is key in CI/CD to ensure consistency and reliability.",
"Version control allows tracking changes and collaboration among team members.",
"Fast feedback loops help catch issues early in the development process."
]
}
Using Config Files¶
For more control over dataset generation, create a configuration file:
topics:
prompt: "Machine learning fundamentals"
mode: tree
depth: 2
degree: 3
generation:
system_prompt: "Generate educational Q&A pairs."
conversation:
type: basic
llm:
provider: openai
model: gpt-4o
output:
system_prompt: "You are a helpful ML tutor."
num_samples: 5
batch_size: 1
save_as: "ml-dataset.jsonl"
Then run:
Config vs CLI
Use configuration files for reproducible dataset generation. CLI flags are great for quick experiments.
Dataset Types¶
DeepFabric supports multiple dataset types to suit different training needs:
-
Basic Datasets
Simple Q&A pairs for instruction following tasks
-
Reasoning Datasets
Chain-of-thought traces for step-by-step problem solving
-
Agent Datasets
Tool-calling with real execution for building agents