Evaluation¶
DeepFabric includes an evaluation system for testing fine-tuned models on tool-calling tasks.
Prerequisites¶
The examples in this section require DeepFabric to be installed with the training extra, which includes optional dependencies such as PyTorch and PEFT.
Workflow¶
graph LR
A[Generate Dataset] --> B[Train Model]
B --> C[Evaluate]
C --> D[Review Metrics]
D --> E[Improve]
E --> A
- Generate dataset with train/eval split
- Train model on training split
- Evaluate on held-out eval split
- Review metrics and improve
Quick Example¶
evaluation_example.py
from deepfabric.evaluation import Evaluator, EvaluatorConfig, InferenceConfig
config = EvaluatorConfig(
inference_config=InferenceConfig(
model="./output/checkpoint-final",
backend="transformers",
),
)
evaluator = Evaluator(config)
results = evaluator.evaluate(dataset=eval_dataset)
print(f"Tool Selection: {results.metrics.tool_selection_accuracy:.2%}")
print(f"Parameter Accuracy: {results.metrics.parameter_accuracy:.2%}")
print(f"Overall Score: {results.metrics.overall_score:.2%}")
Using In-Memory Models¶
Avoid OOM Errors
After training, pass the model directly without reloading from disk:
# After training with SFTTrainer...
FastLanguageModel.for_inference(model)
config = EvaluatorConfig(
inference_config=InferenceConfig(
model=model, # Pass model object directly
tokenizer=tokenizer, # Required with in-memory model
),
)
results = evaluator.evaluate(dataset=eval_dataset)
This avoids OOM errors and speeds up the train-evaluate workflow.
See Running Evaluation for details.
What Gets Evaluated¶
For each sample in the evaluation dataset:
- Extract ground truth from the sample's expected tool calls
- Run inference with the user message and available tools
- Compare predicted tool selection and parameters against ground truth
- Score the result
Key Metrics¶
| Metric | Description | Weight |
|---|---|---|
| Tool Selection Accuracy | Did the model pick the right tool? | 40% |
| Parameter Accuracy | Are the parameter types correct? | 35% |
| Execution Success Rate | Would the call execute successfully? | 25% |
Next Steps¶
-
Running Evaluation
Configuration and usage for model evaluation
-
Metrics
Understanding evaluation metrics in depth