Module 10 · ~55 min

Observability & evaluation

You cannot improve what you cannot measure, and agents fail in ways traditional monitoring will miss. LangSmith gives you traces (debugging) and evals (quality measurement) — the two things you need to ship and iterate on agents safely.

1. Tracing — free, automatic, essential

Set three environment variables and every LangChain/LangGraph call is automatically traced into LangSmith. Each "trace" is a tree of every LLM call, tool call, and node, with full inputs, outputs, token counts, latencies, and costs.

# In your .env or shell — no code changes needed
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2-...
LANGSMITH_PROJECT=jarvis-prod

# Optional: tag a specific run for findability
from langsmith import traceable

@traceable(name="weekly-digest", tags=["scheduled"])
def run_weekly_digest():
    ...

// In your .env or shell — no code changes needed
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2-...
LANGSMITH_PROJECT=jarvis-prod

import { traceable } from "langsmith/traceable";

const runWeeklyDigest = traceable(
  async () => { /* ... */ },
  { name: "weekly-digest", tags: ["scheduled"] },
);

Open the LangSmith UI: every run shows the full agent tree, click any LLM call to see its exact prompt and response, replay a run, fork it from any step. This single feature pays for the entire stack on day one.

If you take one thing from this module

Turn on tracing on day one. Then when something weird happens in production at 11pm, you don't have to guess.

2. Datasets — turning real traffic into a test set

The best test cases come from real users. LangSmith lets you build a dataset from production traces:

In the UI, filter traces (e.g. "all runs where the supervisor routed to hr").
Click "Add to dataset". Capture input + the ideal output (the actual output, edited if needed).
Repeat until you have 50–200 representative cases.

You can also build datasets programmatically:

from langsmith import Client

ls = Client()
ds = ls.create_dataset(dataset_name="jarvis-hr-cases", description="HR-routed user requests with expected answers.")
ls.create_examples(
    dataset_id=ds.id,
    inputs=[
        {"messages": [{"role": "user", "content": "How much leave do I have left?"}]},
        {"messages": [{"role": "user", "content": "What is parental leave policy?"}]},
    ],
    outputs=[
        {"expected_route": "hr", "expected_tool_called": "leave_balance"},
        {"expected_route": "hr", "expected_kb_cited": True},
    ],
)

import { Client } from "langsmith";

const ls = new Client();
const ds = await ls.createDataset("jarvis-hr-cases", { description: "HR routed cases." });
await ls.createExamples({
  datasetId: ds.id,
  inputs: [
    { messages: [{ role: "user", content: "How much leave do I have left?" }] },
  ],
  outputs: [
    { expected_route: "hr", expected_tool_called: "leave_balance" },
  ],
});

3. Evaluators — three kinds

An evaluator grades a run's output. Three common shapes:

Kind	How it grades	Best for
Heuristic	Hand-written code: "did the agent call the `open_it_ticket` tool?"	Verifiable facts, tool-use checks.
LLM-as-judge	Another LLM scores 0–1 against a rubric.	Quality, tone, helpfulness — anything subjective.
Human	You or an annotator labels in the UI.	Building the gold dataset; periodic spot-checks.

from langsmith.evaluation import evaluate

# Heuristic evaluator — pure code
def called_open_ticket(run, example) -> dict:
    tool_calls = [m for m in run.outputs["messages"] if m.get("tool_calls")]
    called = any(tc["name"] == "open_it_ticket" for m in tool_calls for tc in m["tool_calls"])
    return {"key": "called_open_ticket", "score": int(called)}

# LLM-as-judge evaluator
from langchain_anthropic import ChatAnthropic
judge = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)

JUDGE = """Score 0-1 on whether the assistant's final reply is HELPFUL and SAFE.
Question: {input}
Reply: {output}
Return JSON: {"score": <0|1>, "reason": ""}"""

def helpfulness(run, example):
    out = judge.invoke(JUDGE.format(
        input=example.inputs["messages"][-1]["content"],
        output=run.outputs["messages"][-1]["content"],
    ))
    import json; r = json.loads(out.content)
    return {"key": "helpfulness", "score": r["score"], "comment": r["reason"]}

# Run the eval
results = evaluate(
    lambda inputs: jarvis.invoke(inputs),
    data="jarvis-hr-cases",
    evaluators=[called_open_ticket, helpfulness],
)

import { evaluate } from "langsmith/evaluation";

function calledOpenTicket(run: any) {
  const tcs = (run.outputs.messages ?? []).flatMap((m: any) => m.tool_calls ?? []);
  return { key: "called_open_ticket", score: tcs.some((tc: any) => tc.name === "open_it_ticket") ? 1 : 0 };
}

const judge = new ChatAnthropic({ model: "claude-haiku-4-5-20251001", temperature: 0 });

async function helpfulness(run: any, example: any) {
  const out = await judge.invoke(`Score 0-1 helpfulness. Q: ${example.inputs.messages.at(-1).content}
A: ${run.outputs.messages.at(-1).content}
Return JSON {"score": 0|1, "reason": "..."}`);
  const r = JSON.parse(out.content as string);
  return { key: "helpfulness", score: r.score, comment: r.reason };
}

await evaluate(async (inputs) => jarvis.invoke(inputs), {
  data: "jarvis-hr-cases",
  evaluators: [calledOpenTicket, helpfulness],
});

4. Offline vs. online evals

Offline evals — run before every deploy against your dataset. Catch regressions. Wire into CI: fail the build if helpfulness drops by > 5%.
Online evals — run continuously against a sample of production traffic. Detect drift, prompt-injection spikes, tool errors. Alert on regressions.

Both matter. Offline catches "did my change break the gold set?". Online catches "did the world change?".

5. What to actually measure for an agent

Five families of metric every agent should track:

Family	Examples
Task completion	% of runs where the user goal was met; % needing human escalation.
Tool-use correctness	Right tool chosen? Args valid? Tool error rate?
Latency & cost	P50/P95 turn latency. Tokens/turn. $ per resolved task.
Quality	Helpfulness, tone, citation accuracy (judge or human).
Safety	Policy violations, prompt-injection caught, approval-bypass attempts.

For Jarvis specifically

Add a "routing accuracy" eval — did the supervisor send the request to the right specialist? It's the most common failure mode in supervisor systems and the cheapest to fix (usually a prompt tweak).

6. Annotation queues — getting humans in the loop on evals

LangSmith has annotation queues: send a sample of production traces to a queue, your domain experts review them in the UI, and the labels feed back into datasets and evaluators. Use this to grow your gold dataset over time without writing scripts.

7. Cost & latency dashboards

The Platform's monitoring tab plots LLM cost per project, per assistant, per user. Alerts fire when costs/latency spike. Common gotchas to watch:

Long threads quietly pushing more history into every turn — cost grows linearly with thread length.
A specialist getting stuck in a tool-call loop because of an error string the LLM keeps misreading.
A new prompt subtly making the supervisor pick the wrong specialist, so cases get re-routed (and double-billed).

8. Jarvis observed

★ Jarvis status

Every production turn is now: traced (debuggable), sampled for online evals, and weighed against a 200-case gold dataset on every deploy. When something regresses you know within minutes and you know which prompt/model/version caused it. Next: production hardening — making the whole thing actually safe and cheap at scale.

1. Tracing — free, automatic, essential

2. Datasets — turning real traffic into a test set

3. Evaluators — three kinds

4. Offline vs. online evals

5. What to actually measure for an agent

6. Annotation queues — getting humans in the loop on evals

7. Cost & latency dashboards

8. Jarvis observed

Quick check