Observability & evaluation
You cannot improve what you cannot measure, and agents fail in ways traditional monitoring will miss. LangSmith gives you traces (debugging) and evals (quality measurement) — the two things you need to ship and iterate on agents safely.
1. Tracing — free, automatic, essential
Set three environment variables and every LangChain/LangGraph call is automatically traced into LangSmith. Each "trace" is a tree of every LLM call, tool call, and node, with full inputs, outputs, token counts, latencies, and costs.
# In your .env or shell — no code changes needed
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2-...
LANGSMITH_PROJECT=jarvis-prod
# Optional: tag a specific run for findability
from langsmith import traceable
@traceable(name="weekly-digest", tags=["scheduled"])
def run_weekly_digest():
...
// In your .env or shell — no code changes needed
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2-...
LANGSMITH_PROJECT=jarvis-prod
import { traceable } from "langsmith/traceable";
const runWeeklyDigest = traceable(
async () => { /* ... */ },
{ name: "weekly-digest", tags: ["scheduled"] },
);
Open the LangSmith UI: every run shows the full agent tree, click any LLM call to see its exact prompt and response, replay a run, fork it from any step. This single feature pays for the entire stack on day one.
Turn on tracing on day one. Then when something weird happens in production at 11pm, you don't have to guess.
2. Datasets — turning real traffic into a test set
The best test cases come from real users. LangSmith lets you build a dataset from production traces:
- In the UI, filter traces (e.g. "all runs where the supervisor routed to
hr"). - Click "Add to dataset". Capture input + the ideal output (the actual output, edited if needed).
- Repeat until you have 50–200 representative cases.
You can also build datasets programmatically:
from langsmith import Client
ls = Client()
ds = ls.create_dataset(dataset_name="jarvis-hr-cases", description="HR-routed user requests with expected answers.")
ls.create_examples(
dataset_id=ds.id,
inputs=[
{"messages": [{"role": "user", "content": "How much leave do I have left?"}]},
{"messages": [{"role": "user", "content": "What is parental leave policy?"}]},
],
outputs=[
{"expected_route": "hr", "expected_tool_called": "leave_balance"},
{"expected_route": "hr", "expected_kb_cited": True},
],
)
import { Client } from "langsmith";
const ls = new Client();
const ds = await ls.createDataset("jarvis-hr-cases", { description: "HR routed cases." });
await ls.createExamples({
datasetId: ds.id,
inputs: [
{ messages: [{ role: "user", content: "How much leave do I have left?" }] },
],
outputs: [
{ expected_route: "hr", expected_tool_called: "leave_balance" },
],
});
3. Evaluators — three kinds
An evaluator grades a run's output. Three common shapes:
| Kind | How it grades | Best for |
|---|---|---|
| Heuristic | Hand-written code: "did the agent call the open_it_ticket tool?" | Verifiable facts, tool-use checks. |
| LLM-as-judge | Another LLM scores 0–1 against a rubric. | Quality, tone, helpfulness — anything subjective. |
| Human | You or an annotator labels in the UI. | Building the gold dataset; periodic spot-checks. |
from langsmith.evaluation import evaluate
# Heuristic evaluator — pure code
def called_open_ticket(run, example) -> dict:
tool_calls = [m for m in run.outputs["messages"] if m.get("tool_calls")]
called = any(tc["name"] == "open_it_ticket" for m in tool_calls for tc in m["tool_calls"])
return {"key": "called_open_ticket", "score": int(called)}
# LLM-as-judge evaluator
from langchain_anthropic import ChatAnthropic
judge = ChatAnthropic(model="claude-haiku-4-5-20251001", temperature=0)
JUDGE = """Score 0-1 on whether the assistant's final reply is HELPFUL and SAFE.
Question: {input}
Reply: {output}
Return JSON: {"score": <0|1>, "reason": ""}"""
def helpfulness(run, example):
out = judge.invoke(JUDGE.format(
input=example.inputs["messages"][-1]["content"],
output=run.outputs["messages"][-1]["content"],
))
import json; r = json.loads(out.content)
return {"key": "helpfulness", "score": r["score"], "comment": r["reason"]}
# Run the eval
results = evaluate(
lambda inputs: jarvis.invoke(inputs),
data="jarvis-hr-cases",
evaluators=[called_open_ticket, helpfulness],
)
import { evaluate } from "langsmith/evaluation";
function calledOpenTicket(run: any) {
const tcs = (run.outputs.messages ?? []).flatMap((m: any) => m.tool_calls ?? []);
return { key: "called_open_ticket", score: tcs.some((tc: any) => tc.name === "open_it_ticket") ? 1 : 0 };
}
const judge = new ChatAnthropic({ model: "claude-haiku-4-5-20251001", temperature: 0 });
async function helpfulness(run: any, example: any) {
const out = await judge.invoke(`Score 0-1 helpfulness. Q: ${example.inputs.messages.at(-1).content}
A: ${run.outputs.messages.at(-1).content}
Return JSON {"score": 0|1, "reason": "..."}`);
const r = JSON.parse(out.content as string);
return { key: "helpfulness", score: r.score, comment: r.reason };
}
await evaluate(async (inputs) => jarvis.invoke(inputs), {
data: "jarvis-hr-cases",
evaluators: [calledOpenTicket, helpfulness],
});
4. Offline vs. online evals
- Offline evals — run before every deploy against your dataset. Catch regressions. Wire into CI: fail the build if helpfulness drops by > 5%.
- Online evals — run continuously against a sample of production traffic. Detect drift, prompt-injection spikes, tool errors. Alert on regressions.
Both matter. Offline catches "did my change break the gold set?". Online catches "did the world change?".
5. What to actually measure for an agent
Five families of metric every agent should track:
| Family | Examples |
|---|---|
| Task completion | % of runs where the user goal was met; % needing human escalation. |
| Tool-use correctness | Right tool chosen? Args valid? Tool error rate? |
| Latency & cost | P50/P95 turn latency. Tokens/turn. $ per resolved task. |
| Quality | Helpfulness, tone, citation accuracy (judge or human). |
| Safety | Policy violations, prompt-injection caught, approval-bypass attempts. |
Add a "routing accuracy" eval — did the supervisor send the request to the right specialist? It's the most common failure mode in supervisor systems and the cheapest to fix (usually a prompt tweak).
6. Annotation queues — getting humans in the loop on evals
LangSmith has annotation queues: send a sample of production traces to a queue, your domain experts review them in the UI, and the labels feed back into datasets and evaluators. Use this to grow your gold dataset over time without writing scripts.
7. Cost & latency dashboards
The Platform's monitoring tab plots LLM cost per project, per assistant, per user. Alerts fire when costs/latency spike. Common gotchas to watch:
- Long threads quietly pushing more history into every turn — cost grows linearly with thread length.
- A specialist getting stuck in a tool-call loop because of an error string the LLM keeps misreading.
- A new prompt subtly making the supervisor pick the wrong specialist, so cases get re-routed (and double-billed).
8. Jarvis observed
Every production turn is now: traced (debuggable), sampled for online evals, and weighed against a 200-case gold dataset on every deploy. When something regresses you know within minutes and you know which prompt/model/version caused it. Next: production hardening — making the whole thing actually safe and cheap at scale.
Quick check
1. What does it take to start tracing every LangGraph run into LangSmith?
2. Best evaluator type for "is the reply helpful and in the right tone"?
3. Why do both offline and online evals?
4. Single most useful metric to track for a supervisor system?