Metrics¶
Understanding DeepFabric's evaluation metrics.
Overview¶
| Metric | Weight | Description |
|---|---|---|
| Tool Selection Accuracy | 40% | Correct tool chosen |
| Parameter Accuracy | 35% | Correct parameter types |
| Execution Success Rate | 25% | Valid executable call |
| Response Quality | 0% | Not used for tool evaluation |
Tool Selection Accuracy¶
Measures whether the model selected the correct tool.
Calculation: (correct selections) / (total samples)
Example:
- Expected: read_file
- Predicted: read_file
- Result: Correct (1.0)
Common issues: - Model selects wrong tool for task - Model doesn't make a tool call when expected - Model hallucinates non-existent tools
Parameter Accuracy¶
Measures whether the model provided correct parameter types.
Calculation: Validates that: 1. All required parameters are present 2. Parameter types match the schema
Example:
# Tool schema
{
"name": "read_file",
"parameters": [
{"name": "file_path", "type": "str", "required": True}
]
}
# Predicted call
{"file_path": "config.json"} # Correct - string provided
{"file_path": 123} # Wrong - integer instead of string
Note: Values aren't compared exactly. The evaluation checks types, not whether the specific value matches.
Execution Success Rate¶
Measures whether the tool call could execute successfully.
A call is valid if: - Correct tool is selected - All required parameters are present - Parameter types are correct
Overall Score¶
Weighted combination of metrics:
Custom Weights¶
config = EvaluatorConfig(
...,
metric_weights={
"tool_selection": 0.50,
"parameter_accuracy": 0.30,
"execution_success": 0.20,
"response_quality": 0.00,
},
)
Interpreting Results¶
| Score Range | Interpretation |
|---|---|
| 90-100% | Excellent - model is production-ready |
| 75-90% | Good - may need more training data |
| 50-75% | Fair - review failure cases |
| <50% | Poor - training issues likely |
Debugging Low Scores¶
Low Tool Selection¶
Check if: - Training data has clear tool usage patterns - Tools have distinct use cases - System prompt explains when to use each tool
Low Parameter Accuracy¶
Check if: - Training examples show correct parameter formats - Required vs optional parameters are clear - Complex parameter types (lists, dicts) are handled
High Processing Errors¶
Check if: - Model output format matches expected chat format - Model is generating valid JSON for tool calls - Inference configuration (temperature, max_tokens) is appropriate
Sample Evaluation Details¶
Access individual sample results: