Evaluation¶

DeepFabric includes an evaluation system for testing fine-tuned models on tool-calling tasks.

Workflow¶

1. Generate dataset with train/eval split
2. Train model on training split
3. Evaluate on held-out eval split
4. Review metrics and improve

Quick Example¶

from deepfabric.evaluation import Evaluator, EvaluatorConfig, InferenceConfig

config = EvaluatorConfig(
    inference_config=InferenceConfig(
        model_path="./output/checkpoint-final",
        backend="transformers",
    ),
)

evaluator = Evaluator(config)
results = evaluator.evaluate(dataset=eval_dataset)

print(f"Tool Selection: {results.metrics.tool_selection_accuracy:.2%}")
print(f"Parameter Accuracy: {results.metrics.parameter_accuracy:.2%}")
print(f"Overall Score: {results.metrics.overall_score:.2%}")

What Gets Evaluated¶

For each sample in the evaluation dataset:

Extract ground truth from the sample's expected tool calls
Run inference with the user message and available tools
Compare predicted tool selection and parameters against ground truth
Score the result

Key Metrics¶

Metric	Description	Weight
Tool Selection Accuracy	Did the model pick the right tool?	40%
Parameter Accuracy	Are the parameter types correct?	35%
Execution Success Rate	Would the call execute successfully?	25%

Next Steps¶

Running Evaluation - Configuration and usage
Metrics - Understanding the metrics