Building reliable systems around models.
AI engineering is the work of turning foundation models into useful software: prompts, retrieval, tools, evals, inference, feedback loops, and production boundaries.
The model matters, but the system around the model usually decides whether the product is trustworthy, fast, affordable, and useful.
The shift: ML engineering to AI engineering
Reliable AI Engineering Beyond the Prototype
A podcast-style walkthrough of evals, RAG, agents, inference, and the production scaffolding that turns model demos into reliable systems.
Classic ML engineering starts with training a model. AI engineering often starts with a powerful existing model, then builds the product scaffolding around it.
The daily work changes. You still care about data, tests, latency, cost, reliability, and monitoring. But now the uncertain component is a general-purpose model that can reason, hallucinate, follow tools, ignore instructions, or behave differently after a model upgrade.
ML engineering
- Train or tune a model for one task.
- Optimize metrics on labeled datasets.
- Serve predictions through an API.
- Monitor drift and retrain when needed.
AI engineering
- Compose a model with prompts, tools, memory, and retrieval.
- Evaluate behavior across messy real requests.
- Control output shape, safety, latency, and cost.
- Monitor failures and close the feedback loop.
Foundation models
A foundation model is a general model trained on broad data, then adapted by prompting, tools, retrieval, or finetuning.
| Concept | Plain meaning |
|---|---|
| Token | A chunk of text the model reads and writes. |
| Context window | The amount of input and output the model can consider at once. |
| Sampling | The model choosing the next token from likely options. |
| Temperature | A control for how varied or conservative the output is. |
| Structured output | Constraining the answer into JSON or another exact schema. |
Evaluation: the actual hard part
If you cannot tell whether the AI system got better or worse, you do not have an engineering loop yet.
Evals are test suites for model behavior. They include inputs, expected facts, unacceptable behaviors, graders, human review, and production feedback. The goal is not perfect truth. The goal is to catch regressions before users do.
{
"input": "How do I reset my warehouse password?",
"expected_facts": [
"route through internal identity provider",
"never expose raw database credentials",
"link to approved runbook"
],
"checks": [
"factuality",
"policy_compliance",
"citation_quality",
"tone"
]
}- Use deterministic checks for schema, citations, and policy rules.
- Use model graders for fuzzy qualities like helpfulness or completeness.
- Keep a small golden set that humans trust.
- Add failed production examples back into the eval set.
Prompt engineering, treated seriously
A prompt is not magic wording. It is the interface contract between your software and the model.
| Prompt part | Why it exists |
|---|---|
| Role | Sets the job the model is doing. |
| Task | States the actual work to complete. |
| Context | Gives data the model should use. |
| Constraints | Defines what not to do. |
| Output format | Makes downstream parsing reliable. |
RAG: retrieval-augmented generation
RAG gives the model relevant material at answer time instead of trying to bake all knowledge into model weights.
user question
-> query rewrite
-> retrieve candidate chunks
-> rerank
-> build grounded prompt
-> generate answer
-> cite sources
-> evaluate answer qualityThe hard parts are chunking, metadata, retrieval quality, reranking, citation behavior, freshness, and knowing when no answer should be given. A weak retriever makes even a strong model look careless.
Agents
An agent is a loop where a model can decide what to do next, often by calling tools and observing results.
Agentic systems are useful when the path is not fully known in advance: research, debugging, triage, multi-step operations, or workflow automation. They are risky when actions are expensive, irreversible, security-sensitive, or hard to verify.
| Agent control | Reason |
|---|---|
| Tool schemas | Limit what the model can request. |
| Budgets | Cap steps, tokens, time, and money. |
| Approval gates | Require humans before sensitive actions. |
| State logs | Make reasoning and tool results inspectable. |
Finetuning and dataset engineering
Finetuning changes model behavior. Retrieval changes model context. Most teams need better context and evals before they need finetuning.
- Use prompting when behavior changes are small and easy to state.
- Use RAG when the model needs private or changing knowledge.
- Use finetuning when you need repeated style, format, or task behavior.
- Use distillation when a smaller model should imitate a larger one.
Dataset engineering is the unglamorous core: collect examples, label them consistently, remove leakage, balance edge cases, and keep train/eval splits honest.
Inference: latency, cost, throughput
Production AI is constrained by waiting time and money. Every token costs something, and users feel every delay.
| Lever | What it improves |
|---|---|
| Smaller model | Lower latency and cost. |
| Shorter context | Faster calls and fewer tokens. |
| Streaming | Better perceived speed. |
| Caching | Cheaper repeated answers. |
| Batching | Higher throughput for offline jobs. |
Architecting the whole system
A reliable AI product is a normal software system with a probabilistic component inside it.
- Receive the request and classify intent.
- Load user, policy, and product context.
- Retrieve documents or call tools if needed.
- Generate structured output.
- Validate output before showing or acting.
- Log enough detail to debug failures.
- Feed errors back into evals and product design.
The classic failure is launching a chatbot with no evals, no citations, no trace logs, and no way to reproduce bad answers. The demo feels alive. Production feels haunted. Build the feedback loop before the launch party.
Decision frameworks and mental models
| Question | Default answer |
|---|---|
| Do we need an agent? | Only if the steps cannot be known ahead of time. |
| Do we need finetuning? | Not until prompts, RAG, and evals are insufficient. |
| Do we need a vector database? | Only if retrieval quality and scale require it. |
| What should we test first? | The failures that would embarrass or harm users. |
| What should we monitor? | Cost, latency, refusal rate, tool errors, and eval regressions. |