Metrics¶
Understanding DeepFabric's evaluation metrics.
Overview¶
| Metric | Weight | Description |
|---|---|---|
| Tool Selection Accuracy | 40% | Correct tool chosen |
| Parameter Accuracy | 35% | Correct parameter types |
| Execution Success Rate | 25% | Valid executable call |
| Response Quality | 0% | Not used for tool evaluation |
Tool Selection Accuracy¶
Measures whether the model selected the correct tool.
Calculation: (correct selections) / (total samples)
Example
- Expected:
read_file - Predicted:
read_file - Result: Correct (1.0)
Common Issues
- Model selects wrong tool for task
- Model doesn't make a tool call when expected
- Model hallucinates non-existent tools
Parameter Accuracy¶
Measures whether the model provided correct parameter types.
Calculation: Validates that:
- All required parameters are present
- Parameter types match the schema
Example
Type Checking Only
Values aren't compared exactly. The evaluation checks types, not whether the specific value matches.
Execution Success Rate¶
Measures whether the tool call could execute successfully.
A call is valid if:
- Correct tool is selected
- All required parameters are present
- Parameter types are correct
Overall Score¶
Weighted combination of metrics:
Score calculation
overall = (
tool_selection * 0.40 +
parameter_accuracy * 0.35 +
execution_success * 0.25
)
Custom Weights¶
Custom metric weights
config = EvaluatorConfig(
...,
metric_weights={
"tool_selection": 0.50,
"parameter_accuracy": 0.30,
"execution_success": 0.20,
"response_quality": 0.00,
},
)
Interpreting Results¶
| Score Range | Interpretation |
|---|---|
| 90-100% | Excellent - model is production-ready |
| 75-90% | Good - may need more training data |
| 50-75% | Fair - review failure cases |
| <50% | Poor - training issues likely |
Debugging Low Scores¶
Low Tool Selection¶
Check These
- Training data has clear tool usage patterns
- Tools have distinct use cases
- System prompt explains when to use each tool
Low Parameter Accuracy¶
Check These
- Training examples show correct parameter formats
- Required vs optional parameters are clear
- Complex parameter types (lists, dicts) are handled
High Processing Errors¶
Check These
- Model output format matches expected chat format
- Model is generating valid JSON for tool calls
- Inference configuration (temperature, max_tokens) is appropriate
Sample Evaluation Details¶
Access individual sample results: