// AI_ENGINEERING.exe

Building reliable systems around models.

AI engineering is the work of turning foundation models into useful software: prompts, retrieval, tools, evals, inference, feedback loops, and production boundaries.

The model matters, but the system around the model usually decides whether the product is trustworthy, fast, affordable, and useful.

// LEGENDREAL-WORLDIMPLEMENTATIONPITFALLWAR_STORY— click to expand any block

// TABLE_OF_CONTENTSclick to jump · sticky map shows on the right →

01.The shift
02.Foundation models
03.Evaluation
04.Prompts
05.RAG
06.Agents
07.Finetuning and data
08.Inference
09.System architecture
10.Decision frameworks

// SECTION_01

The shift: ML engineering to AI engineering

// READ_ALONG

Loading transcript...

// AUDIO_GUIDE

Reliable AI Engineering Beyond the Prototype

A podcast-style walkthrough of evals, RAG, agents, inference, and the production scaffolding that turns model demos into reliable systems.

0:000:00

VolumeSpeed

Classic ML engineering starts with training a model. AI engineering often starts with a powerful existing model, then builds the product scaffolding around it.

The daily work changes. You still care about data, tests, latency, cost, reliability, and monitoring. But now the uncertain component is a general-purpose model that can reason, hallucinate, follow tools, ignore instructions, or behave differently after a model upgrade.

ML engineering

Train or tune a model for one task.
Optimize metrics on labeled datasets.
Serve predictions through an API.
Monitor drift and retrain when needed.

AI engineering

Compose a model with prompts, tools, memory, and retrieval.
Evaluate behavior across messy real requests.
Control output shape, safety, latency, and cost.
Monitor failures and close the feedback loop.

// SECTION_02

Foundation models

A foundation model is a general model trained on broad data, then adapted by prompting, tools, retrieval, or finetuning.

Concept	Plain meaning
Token	A chunk of text the model reads and writes.
Context window	The amount of input and output the model can consider at once.
Sampling	The model choosing the next token from likely options.
Temperature	A control for how varied or conservative the output is.
Structured output	Constraining the answer into JSON or another exact schema.

Bigger is not automatically better. Pick models by task: reasoning, latency, cost, context size, tool use, privacy, and failure tolerance.

// SECTION_03

Evaluation: the actual hard part

If you cannot tell whether the AI system got better or worse, you do not have an engineering loop yet.

Evals are test suites for model behavior. They include inputs, expected facts, unacceptable behaviors, graders, human review, and production feedback. The goal is not perfect truth. The goal is to catch regressions before users do.

{
  "input": "How do I reset my warehouse password?",
  "expected_facts": [
    "route through internal identity provider",
    "never expose raw database credentials",
    "link to approved runbook"
  ],
  "checks": [
    "factuality",
    "policy_compliance",
    "citation_quality",
    "tone"
  ]
}

Use deterministic checks for schema, citations, and policy rules.
Use model graders for fuzzy qualities like helpfulness or completeness.
Keep a small golden set that humans trust.
Add failed production examples back into the eval set.

// SECTION_04

Prompt engineering, treated seriously

A prompt is not magic wording. It is the interface contract between your software and the model.

Prompt part	Why it exists
Role	Sets the job the model is doing.
Task	States the actual work to complete.
Context	Gives data the model should use.
Constraints	Defines what not to do.
Output format	Makes downstream parsing reliable.

Prompts should live in version control, have tests, and change with the same care as code. If a prompt is important enough to affect product behavior, it is production logic.

// SECTION_05

RAG: retrieval-augmented generation

RAG gives the model relevant material at answer time instead of trying to bake all knowledge into model weights.

user question
  -> query rewrite
  -> retrieve candidate chunks
  -> rerank
  -> build grounded prompt
  -> generate answer
  -> cite sources
  -> evaluate answer quality

The hard parts are chunking, metadata, retrieval quality, reranking, citation behavior, freshness, and knowing when no answer should be given. A weak retriever makes even a strong model look careless.

Most RAG failures are retrieval failures disguised as generation failures. Inspect the retrieved chunks before blaming the model.

// SECTION_06

Agents

An agent is a loop where a model can decide what to do next, often by calling tools and observing results.

Agentic systems are useful when the path is not fully known in advance: research, debugging, triage, multi-step operations, or workflow automation. They are risky when actions are expensive, irreversible, security-sensitive, or hard to verify.

Agent control	Reason
Tool schemas	Limit what the model can request.
Budgets	Cap steps, tokens, time, and money.
Approval gates	Require humans before sensitive actions.
State logs	Make reasoning and tool results inspectable.

// SECTION_07

Finetuning and dataset engineering

Finetuning changes model behavior. Retrieval changes model context. Most teams need better context and evals before they need finetuning.

Use prompting when behavior changes are small and easy to state.
Use RAG when the model needs private or changing knowledge.
Use finetuning when you need repeated style, format, or task behavior.
Use distillation when a smaller model should imitate a larger one.

Dataset engineering is the unglamorous core: collect examples, label them consistently, remove leakage, balance edge cases, and keep train/eval splits honest.

// SECTION_08

Inference: latency, cost, throughput

Production AI is constrained by waiting time and money. Every token costs something, and users feel every delay.

Lever	What it improves
Smaller model	Lower latency and cost.
Shorter context	Faster calls and fewer tokens.
Streaming	Better perceived speed.
Caching	Cheaper repeated answers.
Batching	Higher throughput for offline jobs.

// SECTION_09

Architecting the whole system

A reliable AI product is a normal software system with a probabilistic component inside it.

Receive the request and classify intent.
Load user, policy, and product context.
Retrieve documents or call tools if needed.
Generate structured output.
Validate output before showing or acting.
Log enough detail to debug failures.
Feed errors back into evals and product design.

The classic failure is launching a chatbot with no evals, no citations, no trace logs, and no way to reproduce bad answers. The demo feels alive. Production feels haunted. Build the feedback loop before the launch party.

// SECTION_10

Decision frameworks and mental models

Question	Default answer
Do we need an agent?	Only if the steps cannot be known ahead of time.
Do we need finetuning?	Not until prompts, RAG, and evals are insufficient.
Do we need a vector database?	Only if retrieval quality and scale require it.
What should we test first?	The failures that would embarrass or harm users.
What should we monitor?	Cost, latency, refusal rate, tool errors, and eval regressions.

The highest-leverage AI engineering habit is humility: assume the model can fail, make failures visible, and design the system so the wrong answer is caught before it matters.

// WANT THIS AS A WORKSHOP?

JOIN_THE_WEBINAR.exe →