AI agents that do the work, not just chat
Tool-using, multi-step AI agents wired into your real systems — with the evals, guardrails, and human-in-the-loop checkpoints that make autonomy safe in production.
An agent demo that books a flight in a video is easy. An agent that reliably runs a real workflow against your real data — without going off the rails, leaking secrets, or burning your token budget — is an engineering problem. That's the one I solve.
Eval-gated
No agent change ships without passing the golden task set
Human-in-loop
Approval checkpoints on every high-risk action
60%
Typical token-cost cut via model routing + caching
Full traces
Every agent run logged and replayable
Trusted by founders & teams in
Everything included in every engagement
No upsells. No surprise change orders. One scope, one price.
Agent architecture & orchestration
Planner-executor, tool-calling, and multi-agent patterns built on LangGraph, the OpenAI Agents SDK, or the Claude Agent SDK — chosen for your task, not for hype.
Real tool & system integration
Agents that actually do things: call your APIs, query your database, hit Slack, send email, update a CRM. Each tool is typed, permissioned, and audited.
Guardrails & human-in-the-loop
Approval checkpoints on risky actions, scoped permissions, input/output filtering, and prompt-injection defenses so an agent can't be talked into deleting prod.
Memory & retrieval
Short-term context management plus long-term memory and RAG so the agent remembers what matters and grounds its actions in your real knowledge base.
Evals & observability
A golden task set, automated evals on every change, full trace logging, and cost dashboards. You only ship a new prompt or model when the evals stay green.
Cost & latency control
Smaller models for routine steps, strong models reserved for hard reasoning, caching, and step limits — so a runaway loop can't quietly cost you hundreds of dollars.
The tools I actually use in production
Modern, battle-tested, and chosen for fit — not hype.
Frameworks
- LangGraph
- OpenAI Agents SDK
- Claude Agent SDK
- Vercel AI SDK
Models
- GPT-4o
- Claude
- Llama 3.1
- Mistral
Memory/RAG
- pgvector
- Pinecone
- Redis
- Cohere Rerank
Ops
- LangSmith
- Promptfoo
- Helicone
- Inngest
How we'll work together
Predictable, written-down, no surprises.
- 01
Scope the workflow
Map the task the agent should own, where autonomy helps vs. hurts, and which steps need a human checkpoint. Some 'agents' should just be a script.
- 02
Prototype + evals
A working agent against a golden task set so quality and cost are measurable from day one.
- 03
Harden
Guardrails, permissions, retries, fallbacks, step limits, and prompt-injection defenses — the work that separates a demo from production.
- 04
Ship + monitor
Trace and cost dashboards, eval gates in CI, and prompt versioning so the agent stays reliable as models change.
Pricing that matches the work
Starting prices. Final quote in writing after a 30-minute scoping call.
Agent Prototype
Validating one agent workflow
$3,500starting
- Single workflow, 2–4 tools
- Golden task set + basic evals
- Delivered in 2–3 weeks
Production Agent
Shipping an agent to real users/ops
$11,000starting
- Multi-step agent + real integrations
- Guardrails + human-in-the-loop
- Evals, tracing, cost dashboards
Retainer
Evolving agents over time
$3,500/mostarting
- New tools + workflows
- Model migrations + eval upkeep
- Cost + reliability monitoring
Me vs. an agency vs. hiring in-house
Three ways to get this built. Here's the honest comparison.
Best value Solo Dev (me) $80–$120 /hr or fixed | Agency $150–$300 /hr blended | In-house hire $80–$120K /yr + benefits | |
|---|---|---|---|
| Start date | 1–2 weeks from quote | 4–8 weeks onboarding | 8–16 weeks to hire |
| Who writes the code | Senior dev — every single line | Junior assigned to your account | Whoever you manage to hire |
| Communication | Direct — you talk to who codes | Via account manager first | Direct, but management overhead |
| Flexibility | Scale up or down any time | Locked to contract length | Fixed headcount, hard to change |
| Code ownership | 100% yours, full handover docs | Depends on contract terms | Yours, but bus factor risk |
| Risk | Weekly demos, fixed scope | Scope creep & handoff gaps | Wrong hire = months lost |
Questions I get asked first
What's the difference between an AI agent and a chatbot?+
A chatbot answers questions. An agent takes actions — it uses tools, queries systems, and completes multi-step tasks, often with limited human oversight. If you mainly need Q&A over your content, a RAG chatbot (see /services/ai-chatbot-development) is simpler and cheaper.
Are autonomous agents actually reliable enough for production?+
For narrow, well-scoped workflows with guardrails and human checkpoints — yes. For open-ended 'do anything' autonomy — not yet, and I'll tell you so. The engineering is in scoping tightly, adding approval gates, and evaluating relentlessly.
How do you stop an agent from doing something harmful?+
Scoped tool permissions, human approval on irreversible actions, input/output filtering, prompt-injection defenses, and hard step/cost limits. An agent should be incapable of the worst outcomes, not just discouraged from them.
OpenAI, Claude, or open-source for agents?+
Claude and GPT-4o are both strong at tool use and reasoning; I benchmark on your task. Open-source (Llama, Mistral) when privacy or cost demands it. The orchestration layer is model-agnostic so you can switch as the frontier moves.
Let's scope your project
Tell me what you're building. I'll reply with a written estimate within 24 hours — no sales call required.
Related services
Often paired with ai agent development.
AI Integration
OpenAI, Anthropic Claude, and open-source LLMs wired into your app with RAG, structured outputs, evals, and the discipline that keeps it cheap and reliable at scale.
AI Chatbot Development
Custom RAG assistants grounded in your docs and data, with citations, streaming, and guardrails. Not a generic widget — an assistant that answers from your knowledge, not the open internet.
Backend Development
Typed Node.js and NestJS APIs with PostgreSQL or MongoDB, Redis caching, structured logs, and the boring discipline that keeps p95 latency under 100ms.
API Development
Well-versioned, well-documented REST or GraphQL APIs with auth, rate limiting, and webhooks. Built to be consumed by partners and customers — not only your own frontend.