After class · at home Workshop

Build a research assistant agent

Take everything from class and build something you'll actually use: an agent that researches a topic for your work. You'll grow it in six levels — skeleton, a simple tool, a real tool, parallel researchers you merge into one summary, then the harness pieces that make it real: observability and evals.

🎯 The goal

Give the agent a research question (“What are the trade-offs of vector databases for RAG?”, “Summarize recent work on X for my thesis”). It splits the question into subtopics, researches each — in parallel — and returns one clean, structured brief. Build it with Copilot at your side, using the same ask → test → improve loop from class.

Before you start

Have your free GEMINI_API_KEY set (see the Agents page), and pip install pydantic-ai. Keep a request cap on every run so you never blow through the free tier.

The six levels

Level 1 · Skeleton

A minimal agent that returns structured output

Start with the smallest thing that runs. Define the shape of a research brief and get the agent to fill it in — no tools yet.

research_agent.py

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.usage import UsageLimits

class Brief(BaseModel):
    topic: str
    key_points: list[str] = Field(description="3-5 concise findings")
    summary: str

agent = Agent(
    "google-gla:gemini-2.0-flash",
    output_type=Brief,
    system_prompt="You are a rigorous research assistant. Be concise and factual.",
)

if __name__ == "__main__":
    out = agent.run_sync(
        "Give me a research brief on vector databases for RAG.",
        usage_limits=UsageLimits(request_limit=5),
    )
    print(out.output)

Ask Copilot: “Explain what output_type does here and what happens if the model returns invalid data.”

Level 2 · A simple tool

Give it its first action

Add one easy tool so the model stops relying only on memory. Start with something trivial to prove the loop works — then you'll trust it with a real one.
research_agent.py — add a tool
```
from datetime import date

@agent.tool_plain
def today() -> str:
    """Return today's date as YYYY-MM-DD, for grounding time-sensitive claims."""
    return date.today().isoformat()
```
Ask Copilot: “Write a quick test that runs the agent and asserts the Brief.topic is non-empty.” Then run it and watch the model decide whether to call today().
Level 3 · A real tool

Let it reach the outside world

Now a tool that actually fetches information — a web search. Use any search API you like (Tavily, Brave, DuckDuckGo, SerpAPI). The agent calls it, reads the results, and grounds its brief in them.
research_agent.py — real tool
```
import os, httpx

@agent.tool_plain
async def web_search(query: str) -> list[str]:
    """Search the web and return the top result snippets for the query."""
    resp = httpx.post(
        "https://api.tavily.com/search",
        json={"api_key": os.environ["TAVILY_API_KEY"],
              "query": query, "max_results": 5},
        timeout=30,
    )
    resp.raise_for_status()
    return [r["content"] for r in resp.json().get("results", [])]
```
Vibe-code it: ask Copilot to handle the case where the API returns no results by raising ModelRetry("No results — try a broader query.") so the agent reformulates instead of crashing. This is the “ask → test → improve” loop from class, on your own tool. (Tavily has a free tier; any search API works.)

Level 4 · Parallel + aggregate

Many researchers at once, one merged answer

The real power move: split the question into subtopics, run a research agent on each concurrently with asyncio.gather, then feed all the briefs to a final agent that synthesizes one report. Parallel means it finishes in the time of the slowest subtopic, not the sum of all of them.

parallel_research.py

import asyncio
from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.usage import UsageLimits

# ... (Brief model + `agent` with web_search tool from Levels 1-3) ...

class Report(BaseModel):
    question: str
    briefs: list[str]
    final_summary: str

# A separate agent whose only job is to merge findings.
synthesizer = Agent(
    "google-gla:gemini-2.0-flash",
    output_type=Report,
    system_prompt="Merge the research briefs into one coherent, non-repetitive report.",
)

async def research_one(subtopic: str) -> Brief:
    out = await agent.run(
        f"Research this subtopic and return a brief: {subtopic}",
        usage_limits=UsageLimits(request_limit=6),
    )
    return out.output

async def main(question: str, subtopics: list[str]) -> Report:
    # 1. Fan out: all subtopics researched at the same time.
    briefs = await asyncio.gather(*(research_one(s) for s in subtopics))

    # 2. Fan in: hand every brief to the synthesizer to combine.
    joined = "\n\n".join(f"## {b.topic}\n{b.summary}" for b in briefs)
    out = await synthesizer.run(
        f"Question: {question}\n\nBriefs:\n{joined}",
        usage_limits=UsageLimits(request_limit=5),
    )
    return out.output

if __name__ == "__main__":
    report = asyncio.run(main(
        "Should my team adopt a vector database for RAG?",
        ["performance & scaling", "cost", "alternatives to a dedicated vector DB"],
    ))
    print(report.final_summary)

✅ You just built a mini research pipeline

Fan out (many agents in parallel) → fan in (one agent merges). That pattern scales from 3 subtopics to 30. Cap every run, and log how long the parallel version takes vs. running them one by one.

Level 5 · See inside it

Observability — trace every step

Right now your agent is a black box. Add tracing so you can see every prompt, tool call, retry, token count, and error. This is the harness piece that turns “it's broken somewhere” into “here's the exact call that failed.” Two lines with Logfire (free tier, made by the PydanticAI team):
research_agent.py — top of file
```
import logfire

logfire.configure()             # sign in once with `logfire auth`
logfire.instrument_pydantic_ai()  # now every agent run is traced

# ...define your agents and tools as before...
```
Run the agent, then open your Logfire dashboard and watch the whole tree: the orchestrator, each parallel researcher, every web_search call. Ask Copilot: “Where is most of the time spent?” — and read it off the trace.

Level 6 · Prove it works

A tiny eval — catch regressions before they ship

“Seemed fine” isn't good enough. Write a handful of cases and check the agent still passes them every time you change a prompt or a tool. Start dead simple:

eval_agent.py

import asyncio

# (input question, a keyword the good answer should contain)
CASES = [
    ("Research briefly: what is RAG?", "retrieval"),
    ("Research briefly: what is a vector database?", "embedding"),
]

async def run_evals():
    passed = 0
    for question, must_contain in CASES:
        out = await agent.run(question)
        text = out.output.summary.lower()
        ok = must_contain in text
        print(("PASS" if ok else "FAIL"), "-", question)
        passed += ok
    print(f"{passed}/{len(CASES)} cases passed")

asyncio.run(run_evals())

Keyword checks are a starting point. When you outgrow them, look at pydantic-evals — or add an LLM-as-judge that scores each answer. Either way: an eval you can re-run is what makes improvement measurable instead of vibes.

Stretch goals

📎

Cite sources

Have web_search return URLs too, and make the Brief include a sources list.

🧭

Auto-plan subtopics

Add a planner agent that turns the question into the subtopic list — so you only pass the question.

🔁

Self-check

Add a tool or step that flags weak/contradictory findings and re-researches them.

💾

Save the report

Write the final report to a Markdown file you can drop into your notes.

Deliverable checklist

Aim to tick all six levels. Your progress is saved in this browser.

0 / 8 done

Level 1: skeleton agent returns a valid Briefstructured output works
Level 2: a simple tool the model actually callse.g. today()
Level 3: a real tool that fetches outside informationweb search or an API
Level 4: parallel research with asyncio.gather + a synthesizerfan out, fan in
Level 5: Logfire tracing wired in — I can see the call treeobservability
Level 6: a re-runnable eval with at least 2 casesmeasurable quality
Every run has a UsageLimits request capfree-tier safe
I compared parallel vs. sequential timingand noted the difference

🚀 The takeaway

You vibe-coded a real, useful agent — from a one-shot skeleton to a parallel research pipeline — the same way professionals do: small steps, tools, tests, and a tight loop with the AI. That's the whole course in one project.

The six levels

A minimal agent that returns structured output

Give it its first action

Let it reach the outside world

Many researchers at once, one merged answer

Observability — trace every step

A tiny eval — catch regressions before they ship

Stretch goals

Cite sources

Auto-plan subtopics

Self-check

Save the report

Deliverable checklist