LLMs in production, not just demos
OpenAI, Anthropic Claude, and open-source LLMs wired into your app with RAG, structured outputs, evals, and the discipline that keeps it cheap and reliable at scale.
A demo with GPT-4 takes an afternoon. An LLM feature that doesn't hallucinate on edge cases, doesn't leak prompts, costs less than your hosting bill, and doesn't break when the model is deprecated — that's a real engineering project. That's the project I take.
60%
Average token cost reduction via model switching + caching
Promptfoo
Automated evals on every PR
<1s
TTFT (time to first token) targeted on streamed responses
0
Prompts leaked in production endpoints
Trusted by founders & teams in
Everything included in every engagement
No upsells. No surprise change orders. One scope, one price.
Model selection that fits the job
GPT-4o, Claude Sonnet, Haiku, Llama 3.1, Mistral — picked on cost, latency, and the actual task. Often Haiku or Llama 70B in production with GPT-4 reserved for retries.
RAG done right
Chunking strategy, embedding model selection, reranking, hybrid (BM25 + vector) search. Pinecone, pgvector, or Weaviate — picked by data size and ops capacity.
Structured outputs & function calling
Tool use, JSON schema enforcement, OpenAI structured outputs, Claude tool_use. No more 'parse the markdown the LLM hopefully returned'.
Prompt injection & safety
Input sanitisation, output filtering, rate limiting per user, abuse detection. Your LLM endpoint isn't a back door to your prod database.
Evals & regression testing
Golden dataset, automated eval suite, A/B between models on every PR. You upgrade the model only when the evals say it's safe.
Streaming + token cost optimisation
Server-Sent Events for token-by-token streaming, prompt caching, context-window discipline. Bills that don't surprise you.
The tools I actually use in production
Modern, battle-tested, and chosen for fit — not hype.
Models
- GPT-4o
- Claude 3.5
- Llama 3.1
- Mistral
RAG
- pgvector
- Pinecone
- Weaviate
- Cohere Rerank
Orchestration
- LangChain
- LlamaIndex
- Vercel AI SDK
- Inngest
Quality
- Promptfoo
- LangSmith
- Braintrust
- Helicone
How we'll work together
Predictable, written-down, no surprises.
- 01
Feature scoping
Where does the LLM actually help vs hurt? Some 'AI features' should not exist. We answer that first.
- 02
Prototype + eval set
Working prototype + a golden dataset to measure quality. You can compare models objectively from day one.
- 03
Productionise
Streaming, retries, fallback model, cost budget, observability — the boring stuff that makes the demo a product.
- 04
Ship + monitor
Cost dashboards, eval dashboards, prompt versioning. New model? Re-run evals, deploy if green.
Pricing that matches the work
Starting prices. Final quote in writing after a 30-minute scoping call.
Prototype
Validating one AI feature
$2,500starting
- Single feature, single model
- Golden dataset + basic eval
- Delivered in 1–2 weeks
Production AI
Shipping AI to real users
$8,500starting
- RAG + structured outputs
- Streaming, retries, fallback model
- Cost + eval dashboards
Retainer
Ongoing LLM evolution
$3,000/mostarting
- Model migrations + evals
- Prompt iteration
- Cost watch + optimisation
Questions I get asked first
OpenAI or Anthropic?+
Depends on the task. Claude is currently stronger at long-context reasoning and tool use; GPT-4o at multimodal and tight latency. I'll benchmark both on your golden dataset.
Can you build a ChatGPT for our docs?+
Yes — that's a classic RAG project. Embedding pipeline + vector store + grounded retrieval + citations in the UI so users know where answers come from.
How do you control costs?+
Smaller model by default, GPT-4 / Claude Opus only on retry. Prompt caching, response caching where safe, streaming so you bail early. Monthly budget alerts.
What about open-source / self-hosted LLMs?+
Yes — Llama 3.1, Mistral, Qwen via Together AI, Groq, or self-hosted on AWS. Right when privacy, cost, or compliance demands it. Often slower to integrate than OpenAI/Anthropic, so we measure tradeoffs honestly.
Let's scope your project
Tell me what you're building. I'll reply with a written estimate within 24 hours — no sales call required.
Related services
Often paired with ai integration.
API Development
Well-versioned, well-documented REST or GraphQL APIs with auth, rate limiting, and webhooks. Built to be consumed by partners and customers — not only your own frontend.
Backend Development
Typed Node.js and NestJS APIs with PostgreSQL or MongoDB, Redis caching, structured logs, and the boring discipline that keeps p95 latency under 100ms.
Web Development
From the database schema to the deployed Next.js frontend, I ship modern web apps designed to rank, convert, and scale. One engineer, full ownership.
SaaS Development
End-to-end SaaS builds with Stripe billing, multi-tenant auth, role-based access, onboarding flows, and admin dashboards — built to take real paying customers.