Skip to content

Unit 4 — Capstone Project

Overview

The capstone project brings together all concepts from the course. The goal is to build an agent that can be automatically evaluated on a leaderboard of tasks.

Project goals

  • Build an agent capable of answering diverse questions (text, code, web search, math)
  • Integrate at least two different tool types
  • Achieve a passing score on the GAIA benchmark subset used in the course
  • Submit results to the public leaderboard

GAIA Benchmark

GAIA is a benchmark designed to test the real-world reasoning capabilities of AI agents. It requires:

  • Multi-step reasoning
  • Tool use (web search, code execution, file parsing)
  • Common sense and factual knowledge

Agent architecture (planned)

graph TD
    U[User query] --> M[Manager Agent]
    M --> S[Web Search Tool]
    M --> C[Code Execution Tool]
    M --> R[RAG Tool]
    M --> F[File Parser Tool]
    S & C & R & F --> M
    M --> A[Final Answer]

Evaluation

from smolagents import evaluate_agent

score = evaluate_agent(
    agent=my_agent,
    dataset="gaia-benchmark/GAIA",
    split="validation",
)
print(f"Score: {score:.1%}")

Notes & results

Add your capstone notes, intermediate results and leaderboard submissions here.