Home

Focused Training for more Grounded and Efficient Models

Diverse Data

Topic graph algorithms ensure coverage without redundancy - no overfit from repetitive samples.
Real Execution

Tools run in sandboxed environments, not simulated. Training data reflects actual behavior.
Validated Output

Constrained decoding and strict validation ensure correct syntax and structure every time.

Quick Start¶

For basic dataset generation, install DeepFabric with the default dependencies using the commands below.

If you plan to use the training or evaluation utilities described in the Training or Evaluation sections, install the training extra instead (e.g., pip install "deepfabric[training]").

pipuvDevelopment

pip install deepfabric

uv add deepfabric

git clone https://github.com/always-further/deepfabric.git
cd deepfabric
uv sync --all-extras

Provider Setup¶

Set your API key for your chosen provider:

OpenAIAnthropicGoogle GeminiOllama (Local)

export OPENAI_API_KEY="sk-..."

export ANTHROPIC_API_KEY="sk-ant-..."

export GEMINI_API_KEY="..."

curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral
ollama serve

No API Key Required

Ollama runs locally, so no API key is needed.

Verify Installation¶

deepfabric --help
deepfabric info

Now generate your first dataset:

Generate a dataset

export OPENAI_API_KEY="your-key"

deepfabric generate \
  --topic-prompt "DevOps and Platform Engineering" \
  --generation-system-prompt "You are an expert in DevOps and Platform Engineering generate examples of issue resolution and best practices" \
  --mode graph \
  --depth 2 \
  --degree 2 \
  --provider openai \
  --model gpt-4o \
  --num-samples 2 \
  --batch-size 1 \
  --conversation-type cot \
  --reasoning-style freetext \
  --output-save-as dataset.jsonl

What Just Happened?¶

The key steps in this example were as follows:

Topic Graph Generation: A topic hierarchy was created starting from "DevOps and Platform Engineering". Topic graphs take a root prompt and recursively expand subtopics to form a DAG (Direct Acyclic Graph) structure. Here, we used a depth of 2 and degree of 2 to ensure coverage of subtopics.
Dataset Generation: For each node topic in the graph, a synthetic dataset sample was generated using a chain-of-thought conversation style. Each example includes reasoning traces to illustrate the thought process behind the answers. With the above example, 2 total samples were generated as specified by --num-samples 2. You can also use --num-samples auto to generate one sample per topic path.
Conversation and Reasoning Style: The cot conversation type with freetext reasoning style. This encourages the model to provide detailed explanations along with answers, enhancing the quality of the training data.

So lets' break down this down visually:

graph TD
    A[DevOps and Platform Engineering] --> B[CI/CD Pipelines]
    A --> C[Infrastructure as Code]
    B --> D[Best Practices for CI/CD]
    B --> E[Common CI/CD Tools]
    C --> F[IaC Benefits]
    C --> G[Popular IaC Tools]

So as you can see we have a depth of 2 (root + 2 levels) and a degree of 2 (2 subtopics per topic).

Each of these topics would then be used to generate a corresponding dataset samples.

Best Practices for CI/CD - Sample Output

dataset.jsonl

{
  "question": "What are some best practices for implementing CI/CD pipelines?",
  "answer": "Some best practices include automating testing, using version control, and ensuring fast feedback loops.",
  "reasoning_trace": [
    "The user is asking about best practices for CI/CD pipelines.",
    "I know that automation is key in CI/CD to ensure consistency and reliability.",
    "Version control allows tracking changes and collaboration among team members.",
    "Fast feedback loops help catch issues early in the development process."
  ]
}

Using Config Files¶

For more control over dataset generation, create a configuration file:

config.yaml

topics:
  prompt: "Machine learning fundamentals"
  mode: graph            # or "tree" for JSONL format
  depth: 2
  degree: 3

generation:
  system_prompt: "Generate educational Q&A pairs."
  conversation:
    type: basic
  llm:
    provider: openai
    model: gpt-4o

output:
  system_prompt: "You are a helpful ML tutor."
  num_samples: 5
  batch_size: 1
  save_as: "ml-dataset.jsonl"

Then run:

deepfabric generate config.yaml

Config vs CLI

Use configuration files for reproducible dataset generation. CLI flags are great for quick experiments.

Dataset Types¶

DeepFabric supports multiple dataset types to suit different training needs:

Basic Datasets

Simple Q&A pairs for instruction following tasks

Learn more
Reasoning Datasets

Chain-of-thought traces for step-by-step problem solving

Learn more
Agent Datasets

Tool-calling with real execution for building agents

Learn more