Bonus Unit 2 — Observability & Evaluation¶

Overview¶

As agents become more complex, observability — the ability to inspect what the agent did and why — becomes essential for debugging and improvement.

Key questions observability answers¶

Which tool calls were made, in what order?
How long did each step take?
Where did the agent fail or hallucinate?
Which sub-task caused a wrong final answer?

Tracing with OpenTelemetry¶

smolagents integrates with OpenTelemetry-compatible backends (Langfuse, Arize Phoenix, etc.):

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from openinference.instrumentation.smolagents import SmolagentsInstrumentor

# Configure your exporter (e.g. Langfuse, Phoenix)
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(your_exporter))

SmolagentsInstrumentor().instrument(tracer_provider=provider)

Evaluation metrics¶

Metric	Description
Exact match	Is the final answer exactly correct?
F1 / ROUGE	Partial credit for text overlap
Tool accuracy	Did the agent call the right tools?
Steps to answer	Efficiency of the trajectory
Cost	Total tokens consumed

LLM-as-judge¶

from smolagents import HfApiModel

judge_model = HfApiModel(model_id="meta-llama/Meta-Llama-3-8B-Instruct")

def llm_judge(question: str, answer: str, reference: str) -> bool:
    prompt = f"""Question: {question}
Reference answer: {reference}
Agent answer: {answer}
Is the agent answer correct? Reply only YES or NO."""
    response = judge_model(prompt)
    return "YES" in response.upper()

Notes & experiments¶

Add your observability setup and evaluation results here.