Evaluation¶

DeepFabric includes an evaluation system for testing fine-tuned models on tool-calling tasks.

Prerequisites¶

The examples in this section require DeepFabric to be installed with the training extra, which includes optional dependencies such as PyTorch and PEFT.

pipuv

pip install "deepfabric[training]"

uv add "deepfabric[training]"

Workflow¶

graph LR
    A[Generate Dataset] --> B[Train Model]
    B --> C[Evaluate]
    C --> D[Review Metrics]
    D --> E[Improve]
    E --> A

Generate dataset with train/eval split
Train model on training split
Evaluate on held-out eval split
Review metrics and improve

Quick Example¶

evaluation_example.py

from deepfabric.evaluation import Evaluator, EvaluatorConfig, InferenceConfig

config = EvaluatorConfig(
    inference_config=InferenceConfig(
        model="./output/checkpoint-final",
        backend="transformers",
    ),
)

evaluator = Evaluator(config)
results = evaluator.evaluate(dataset=eval_dataset)

print(f"Tool Selection: {results.metrics.tool_selection_accuracy:.2%}")
print(f"Parameter Accuracy: {results.metrics.parameter_accuracy:.2%}")
print(f"Overall Score: {results.metrics.overall_score:.2%}")

Using In-Memory Models¶

Avoid OOM Errors

After training, pass the model directly without reloading from disk:

# After training with SFTTrainer...
FastLanguageModel.for_inference(model)

config = EvaluatorConfig(
    inference_config=InferenceConfig(
        model=model,          # Pass model object directly
        tokenizer=tokenizer,  # Required with in-memory model
    ),
)
results = evaluator.evaluate(dataset=eval_dataset)

This avoids OOM errors and speeds up the train-evaluate workflow.

See Running Evaluation for details.

What Gets Evaluated¶

For each sample in the evaluation dataset:

Extract ground truth from the sample's expected tool calls
Run inference with the user message and available tools
Compare predicted tool selection and parameters against ground truth
Score the result

Key Metrics¶

Metric	Description	Weight
Tool Selection Accuracy	Did the model pick the right tool?	40%
Parameter Accuracy	Are the parameter types correct?	35%
Execution Success Rate	Would the call execute successfully?	25%

Next Steps¶

Running Evaluation

Configuration and usage for model evaluation

Run evaluation
Metrics

Understanding evaluation metrics in depth

Understand metrics