Unit 4 — Capstone Project¶

Overview¶

The capstone project brings together all concepts from the course. The goal is to build an agent that can be automatically evaluated on a leaderboard of tasks.

Project goals¶

Build an agent capable of answering diverse questions (text, code, web search, math)
Integrate at least two different tool types
Achieve a passing score on the GAIA benchmark subset used in the course
Submit results to the public leaderboard

GAIA Benchmark¶

GAIA is a benchmark designed to test the real-world reasoning capabilities of AI agents. It requires:

Multi-step reasoning
Tool use (web search, code execution, file parsing)
Common sense and factual knowledge

Agent architecture (planned)¶

graph TD
    U[User query] --> M[Manager Agent]
    M --> S[Web Search Tool]
    M --> C[Code Execution Tool]
    M --> R[RAG Tool]
    M --> F[File Parser Tool]
    S & C & R & F --> M
    M --> A[Final Answer]

Evaluation¶

from smolagents import evaluate_agent

score = evaluate_agent(
    agent=my_agent,
    dataset="gaia-benchmark/GAIA",
    split="validation",
)
print(f"Score: {score:.1%}")

Notes & results¶

Add your capstone notes, intermediate results and leaderboard submissions here.