generate¶
The generate command executes the complete synthetic data generation pipeline from YAML configuration to finished dataset. This command represents the primary interface for transforming domain concepts into structured training data through topic modeling and content generation.
graph LR
A[YAML Config] --> B[Topic Generation]
B --> C[Dataset Creation]
C --> D[Output Files]
The generation process operates through multiple stages that can be monitored in real-time, providing visibility into topic expansion, content creation, and quality control measures.
Basic Usage¶
Generate a complete dataset from a configuration file:
This command reads your configuration, generates the topic structure, creates training examples, and saves all outputs to the specified locations.
Configuration Override¶
Override specific configuration parameters without modifying the configuration file:
deepfabric generate config.yaml \
--provider anthropic \
--model claude-sonnet-4-5 \
--temperature 0.8 \
--num-samples 100 \
--batch-size 5
Experimentation
Configuration overrides apply to all stages, enabling experimentation with different settings while maintaining the base configuration.
File Management Options¶
Control where intermediate and final outputs are saved:
deepfabric generate config.yaml \
--topics-save-as custom_topics.jsonl \
--output-save-as custom_dataset.jsonl
Output Organization
These options override the file paths specified in your configuration, useful for organizing outputs by experiment or preventing accidental overwrites.
Loading Existing Topic Structures¶
Skip topic generation by loading previously generated topic trees or graphs:
Faster Iteration
This approach accelerates iteration when experimenting with dataset generation parameters while keeping the topic structure constant.
Topic-Only Generation¶
Generate and save only the topic structure without proceeding to dataset creation:
The --topic-only flag stops the pipeline after topic generation and saves the topic structure to the configured location.
Topic Modeling Parameters¶
Fine-tune topic generation behavior through command-line parameters:
| Parameter | Effect |
|---|---|
--degree |
More subtopics per node (broader exploration) |
--depth |
More levels of detail (deeper exploration) |
--temperature |
Higher values create more diverse topics |
Dataset Generation Controls¶
Adjust dataset creation parameters for different scales and quality requirements:
deepfabric generate config.yaml \
--num-samples 500 \
--batch-size 10 \
--no-system-message
| Parameter | Description |
|---|---|
--num-samples |
Total samples: integer, auto, or percentage (e.g., 50%, 200%) |
--batch-size |
Affects generation speed and resource usage |
--include-system-message |
Include system prompts in training examples |
--no-system-message |
Exclude system prompts from training examples |
Automatic Sample Count
Use --num-samples auto to generate exactly one sample per topic path. Use percentages like --num-samples 50% for partial coverage or --num-samples 200% to cycle through topics multiple times.
Integer values larger than the number of topic paths work the same way—--num-samples 100 with 50 paths is equivalent to --num-samples 200%.
Conversation Type Options¶
Control the type of conversations generated:
deepfabric generate config.yaml \
--conversation-type cot \
--reasoning-style freetext
| Option | Values | Description |
|---|---|---|
--conversation-type |
basic, cot |
Base conversation type |
--reasoning-style |
freetext, agent |
Reasoning style for cot |
Agent Mode
Agent mode is automatically enabled when tools are configured. No explicit --agent-mode flag is required. The --agent-mode flag is deprecated.
TUI Options¶
Control the terminal user interface:
Provider and Model Selection¶
Use different providers or models for different components:
deepfabric generate config.yaml \
--provider openai \
--model gpt-4 \
--temperature 0.9
Provider Scope
Provider changes apply to all components unless overridden in the configuration file.
Complete Example¶
Comprehensive generation command
deepfabric generate research-dataset.yaml \
--topics-save-as research_topics.jsonl \
--output-save-as research_examples.jsonl \
--provider anthropic \
--model claude-sonnet-4-5 \
--degree 4 \
--depth 3 \
--num-samples 200 \
--batch-size 8 \
--temperature 0.8 \
--include-system-message
This creates a research dataset with comprehensive topic coverage and high-quality content generation.
Progress Monitoring¶
The generation process provides real-time feedback:
- Topic tree construction progress with node counts
- Dataset generation status with completion percentages
- Error reporting with retry attempts and failure categorization
- Final statistics including success rates and output file locations
Checkpoint and Resume¶
For long-running generation jobs, enable checkpointing to save progress and allow resuming after interruptions:
If generation is interrupted, resume from the checkpoint:
To retry previously failed samples when resuming:
| Parameter | Description |
|---|---|
--checkpoint-interval N |
Save checkpoint every N samples |
--checkpoint-path PATH |
Override checkpoint directory (rarely needed) |
--resume |
Resume from existing checkpoint |
--retry-failed |
When resuming, retry previously failed samples |
Checkpoint Location
Checkpoints are stored in an XDG-compliant location: ~/.local/share/deepfabric/checkpoints/ (or $XDG_DATA_HOME/deepfabric/checkpoints/). Each config file gets its own subdirectory based on a hash of its path, so different projects don't interfere with each other.
Checkpoint Status
Use deepfabric checkpoint-status config.yaml to inspect checkpoint progress without resuming.
How Checkpointing Works¶
Checkpointing saves progress at regular intervals:
- Samples file - Incrementally appends successful samples (no write amplification)
- Failures file - Records failed samples with error details
- Metadata file - Tracks progress, processed paths, and generation config
On resume, the generator:
- Loads previously processed topic paths
- Skips already-completed work
- Continues from where it left off
- Optionally retries failed samples with
--retry-failed
Memory Optimization
When checkpointing is enabled, samples are flushed to disk periodically, keeping memory usage constant regardless of dataset size. This allows generating datasets with 50,000+ samples without memory issues.
Graceful Stop¶
Press ++ctrl+c++ once during generation to request a graceful stop. The generator will:
- Complete the current batch
- Save a final checkpoint
- Exit cleanly
The TUI status panel shows "Stopping: at next checkpoint" to confirm the stop is pending. Press ++ctrl+c++ a second time to force quit immediately (no checkpoint saved).
Checkpoint Conflicts¶
If you start generation without --resume but a checkpoint already exists for that configuration, you'll be prompted:
⚠ Existing checkpoint found for this configuration
1) Resume from checkpoint
2) Clear checkpoint and start fresh
3) Abort
Choose an option [1/2/3] (1):
This prevents accidentally overwriting or corrupting existing checkpoint data.
Error Recovery¶
Partial Failures
When generation fails partway through, the system saves intermediate results where possible. Topic trees are saved incrementally, enabling recovery by loading partial results and continuing from the dataset generation stage. With checkpointing enabled, dataset generation progress is also preserved.
Optimizing Generation Performance
Balance batch-size with your API rate limits and system resources. Larger batches increase throughput but consume more memory and may trigger rate limiting. Start with smaller batches and increase based on your provider's capabilities.
Output Validation¶
After generation completes, verify your outputs: