generate¶

The generate command executes the complete synthetic data generation pipeline from YAML configuration to finished dataset. This command represents the primary interface for transforming domain concepts into structured training data through topic modeling and content generation.

graph LR
    A[YAML Config] --> B[Topic Generation]
    B --> C[Dataset Creation]
    C --> D[Output Files]

The generation process operates through multiple stages that can be monitored in real-time, providing visibility into topic expansion, content creation, and quality control measures.

Basic Usage¶

Generate a complete dataset from a configuration file:

Basic generation

deepfabric generate config.yaml

This command reads your configuration, generates the topic structure, creates training examples, and saves all outputs to the specified locations.

Configuration Override¶

Override specific configuration parameters without modifying the configuration file:

Override parameters

deepfabric generate config.yaml \
  --provider anthropic \
  --model claude-sonnet-4-5 \
  --temperature 0.8 \
  --num-samples 100 \
  --batch-size 5

Experimentation

Configuration overrides apply to all stages, enabling experimentation with different settings while maintaining the base configuration.

File Management Options¶

Control where intermediate and final outputs are saved:

Custom output paths

deepfabric generate config.yaml \
  --topics-save-as custom_topics.jsonl \
  --output-save-as custom_dataset.jsonl

Output Organization

These options override the file paths specified in your configuration, useful for organizing outputs by experiment or preventing accidental overwrites.

Loading Existing Topic Structures¶

Skip topic generation by loading previously generated topic trees or graphs:

Load existing topics

deepfabric generate config.yaml --topics-load existing_topics.jsonl

Faster Iteration

This approach accelerates iteration when experimenting with dataset generation parameters while keeping the topic structure constant.

Topic-Only Generation¶

Generate and save only the topic structure without proceeding to dataset creation:

Tree ModeGraph Mode

deepfabric generate config.yaml --topic-only

deepfabric generate config.yaml --mode graph --topic-only

The --topic-only flag stops the pipeline after topic generation and saves the topic structure to the configured location.

Topic Modeling Parameters¶

Fine-tune topic generation behavior through command-line parameters:

Topic parameters

deepfabric generate config.yaml \
  --degree 5 \
  --depth 4 \
  --temperature 0.7

Parameter	Effect
`--degree`	More subtopics per node (broader exploration)
`--depth`	More levels of detail (deeper exploration)
`--temperature`	Higher values create more diverse topics

Dataset Generation Controls¶

Adjust dataset creation parameters for different scales and quality requirements:

Generation controls

deepfabric generate config.yaml \
  --num-samples 500 \
  --batch-size 10 \
  --no-system-message

Parameter	Description
`--num-samples`	Total samples: integer, `auto`, or percentage (e.g., `50%`, `200%`)
`--batch-size`	Affects generation speed and resource usage
`--include-system-message`	Include system prompts in training examples
`--no-system-message`	Exclude system prompts from training examples

Automatic Sample Count

Use --num-samples auto to generate exactly one sample per topic path. Use percentages like --num-samples 50% for partial coverage or --num-samples 200% to cycle through topics multiple times.

Integer values larger than the number of topic paths work the same way—--num-samples 100 with 50 paths is equivalent to --num-samples 200%.

Conversation Type Options¶

Control the type of conversations generated:

Conversation options

deepfabric generate config.yaml \
  --conversation-type cot \
  --reasoning-style freetext

Option	Values	Description
`--conversation-type`	`basic`, `cot`	Base conversation type
`--reasoning-style`	`freetext`, `agent`	Reasoning style for cot

Agent Mode

Agent mode is automatically enabled when tools are configured. No explicit --agent-mode flag is required. The --agent-mode flag is deprecated.

TUI Options¶

Control the terminal user interface:

Rich TUISimple Output

deepfabric generate config.yaml --tui rich

Two-pane interface with real-time progress (default).

deepfabric generate config.yaml --tui simple

Headless-friendly plain text output.

Provider and Model Selection¶

Use different providers or models for different components:

Provider selection

deepfabric generate config.yaml \
  --provider openai \
  --model gpt-4 \
  --temperature 0.9

Provider Scope

Provider changes apply to all components unless overridden in the configuration file.

Complete Example¶

Comprehensive generation command

Full example

deepfabric generate research-dataset.yaml \
  --topics-save-as research_topics.jsonl \
  --output-save-as research_examples.jsonl \
  --provider anthropic \
  --model claude-sonnet-4-5 \
  --degree 4 \
  --depth 3 \
  --num-samples 200 \
  --batch-size 8 \
  --temperature 0.8 \
  --include-system-message

This creates a research dataset with comprehensive topic coverage and high-quality content generation.

Progress Monitoring¶

The generation process provides real-time feedback:

Topic tree construction progress with node counts
Dataset generation status with completion percentages
Error reporting with retry attempts and failure categorization
Final statistics including success rates and output file locations

Checkpoint and Resume¶

For long-running generation jobs, enable checkpointing to save progress and allow resuming after interruptions:

Enable checkpointing

deepfabric generate config.yaml --checkpoint-interval 500

If generation is interrupted, resume from the checkpoint:

Resume from checkpoint

deepfabric generate config.yaml --resume

To retry previously failed samples when resuming:

Resume with retry

deepfabric generate config.yaml --resume --retry-failed

Parameter	Description
`--checkpoint-interval N`	Save checkpoint every N samples
`--checkpoint-path PATH`	Override checkpoint directory (rarely needed)
`--resume`	Resume from existing checkpoint
`--retry-failed`	When resuming, retry previously failed samples

Checkpoint Location

Checkpoints are stored in an XDG-compliant location: ~/.local/share/deepfabric/checkpoints/ (or $XDG_DATA_HOME/deepfabric/checkpoints/). Each config file gets its own subdirectory based on a hash of its path, so different projects don't interfere with each other.

Checkpoint Status

Use deepfabric checkpoint-status config.yaml to inspect checkpoint progress without resuming.

How Checkpointing Works¶

Checkpointing saves progress at regular intervals:

Samples file - Incrementally appends successful samples (no write amplification)
Failures file - Records failed samples with error details
Metadata file - Tracks progress, processed paths, and generation config

On resume, the generator:

Loads previously processed topic paths
Skips already-completed work
Continues from where it left off
Optionally retries failed samples with --retry-failed

Memory Optimization

When checkpointing is enabled, samples are flushed to disk periodically, keeping memory usage constant regardless of dataset size. This allows generating datasets with 50,000+ samples without memory issues.

Graceful Stop¶

Press ++ctrl+c++ once during generation to request a graceful stop. The generator will:

Complete the current batch
Save a final checkpoint
Exit cleanly

The TUI status panel shows "Stopping: at next checkpoint" to confirm the stop is pending. Press ++ctrl+c++ a second time to force quit immediately (no checkpoint saved).

Checkpoint Conflicts¶

If you start generation without --resume but a checkpoint already exists for that configuration, you'll be prompted:

⚠ Existing checkpoint found for this configuration

  1) Resume from checkpoint
  2) Clear checkpoint and start fresh
  3) Abort

Choose an option [1/2/3] (1):

This prevents accidentally overwriting or corrupting existing checkpoint data.

Error Recovery¶

Partial Failures

When generation fails partway through, the system saves intermediate results where possible. Topic trees are saved incrementally, enabling recovery by loading partial results and continuing from the dataset generation stage. With checkpointing enabled, dataset generation progress is also preserved.

Optimizing Generation Performance

Balance batch-size with your API rate limits and system resources. Larger batches increase throughput but consume more memory and may trigger rate limiting. Start with smaller batches and increase based on your provider's capabilities.

Output Validation¶

After generation completes, verify your outputs:

Validation commands

# Check dataset format
head -n 5 your_dataset.jsonl

# Validate JSON structure
python -m json.tool your_dataset.jsonl > /dev/null

# Count generated examples
wc -l your_dataset.jsonl