OnlineCharlotte, NC
v2026.05
clt_AIGuy
// AI_ENGINEERING.exe

Building reliable systems around models.

AI engineering is the work of turning foundation models into useful software: prompts, retrieval, tools, evals, inference, feedback loops, and production boundaries.

The model matters, but the system around the model usually decides whether the product is trustworthy, fast, affordable, and useful.

// LEGENDREAL-WORLDIMPLEMENTATIONPITFALLWAR_STORY— click to expand any block
// SECTION_01

The shift: ML engineering to AI engineering

// READ_ALONG
Loading transcript...
// AUDIO_GUIDE

Reliable AI Engineering Beyond the Prototype

A podcast-style walkthrough of evals, RAG, agents, inference, and the production scaffolding that turns model demos into reliable systems.

0:000:00

Classic ML engineering starts with training a model. AI engineering often starts with a powerful existing model, then builds the product scaffolding around it.

The daily work changes. You still care about data, tests, latency, cost, reliability, and monitoring. But now the uncertain component is a general-purpose model that can reason, hallucinate, follow tools, ignore instructions, or behave differently after a model upgrade.

ML engineering
  • Train or tune a model for one task.
  • Optimize metrics on labeled datasets.
  • Serve predictions through an API.
  • Monitor drift and retrain when needed.
AI engineering
  • Compose a model with prompts, tools, memory, and retrieval.
  • Evaluate behavior across messy real requests.
  • Control output shape, safety, latency, and cost.
  • Monitor failures and close the feedback loop.
// SECTION_02

Foundation models

A foundation model is a general model trained on broad data, then adapted by prompting, tools, retrieval, or finetuning.

ConceptPlain meaning
TokenA chunk of text the model reads and writes.
Context windowThe amount of input and output the model can consider at once.
SamplingThe model choosing the next token from likely options.
TemperatureA control for how varied or conservative the output is.
Structured outputConstraining the answer into JSON or another exact schema.
Bigger is not automatically better. Pick models by task: reasoning, latency, cost, context size, tool use, privacy, and failure tolerance.
// SECTION_03

Evaluation: the actual hard part

If you cannot tell whether the AI system got better or worse, you do not have an engineering loop yet.

Evals are test suites for model behavior. They include inputs, expected facts, unacceptable behaviors, graders, human review, and production feedback. The goal is not perfect truth. The goal is to catch regressions before users do.

{
  "input": "How do I reset my warehouse password?",
  "expected_facts": [
    "route through internal identity provider",
    "never expose raw database credentials",
    "link to approved runbook"
  ],
  "checks": [
    "factuality",
    "policy_compliance",
    "citation_quality",
    "tone"
  ]
}
  • Use deterministic checks for schema, citations, and policy rules.
  • Use model graders for fuzzy qualities like helpfulness or completeness.
  • Keep a small golden set that humans trust.
  • Add failed production examples back into the eval set.
// SECTION_04

Prompt engineering, treated seriously

A prompt is not magic wording. It is the interface contract between your software and the model.

Prompt partWhy it exists
RoleSets the job the model is doing.
TaskStates the actual work to complete.
ContextGives data the model should use.
ConstraintsDefines what not to do.
Output formatMakes downstream parsing reliable.
Prompts should live in version control, have tests, and change with the same care as code. If a prompt is important enough to affect product behavior, it is production logic.
// SECTION_05

RAG: retrieval-augmented generation

RAG gives the model relevant material at answer time instead of trying to bake all knowledge into model weights.

user question
  -> query rewrite
  -> retrieve candidate chunks
  -> rerank
  -> build grounded prompt
  -> generate answer
  -> cite sources
  -> evaluate answer quality

The hard parts are chunking, metadata, retrieval quality, reranking, citation behavior, freshness, and knowing when no answer should be given. A weak retriever makes even a strong model look careless.

Most RAG failures are retrieval failures disguised as generation failures. Inspect the retrieved chunks before blaming the model.
// SECTION_06

Agents

An agent is a loop where a model can decide what to do next, often by calling tools and observing results.

Agentic systems are useful when the path is not fully known in advance: research, debugging, triage, multi-step operations, or workflow automation. They are risky when actions are expensive, irreversible, security-sensitive, or hard to verify.

Agent controlReason
Tool schemasLimit what the model can request.
BudgetsCap steps, tokens, time, and money.
Approval gatesRequire humans before sensitive actions.
State logsMake reasoning and tool results inspectable.
// SECTION_07

Finetuning and dataset engineering

Finetuning changes model behavior. Retrieval changes model context. Most teams need better context and evals before they need finetuning.

  • Use prompting when behavior changes are small and easy to state.
  • Use RAG when the model needs private or changing knowledge.
  • Use finetuning when you need repeated style, format, or task behavior.
  • Use distillation when a smaller model should imitate a larger one.

Dataset engineering is the unglamorous core: collect examples, label them consistently, remove leakage, balance edge cases, and keep train/eval splits honest.

// SECTION_08

Inference: latency, cost, throughput

Production AI is constrained by waiting time and money. Every token costs something, and users feel every delay.

LeverWhat it improves
Smaller modelLower latency and cost.
Shorter contextFaster calls and fewer tokens.
StreamingBetter perceived speed.
CachingCheaper repeated answers.
BatchingHigher throughput for offline jobs.
// SECTION_09

Architecting the whole system

A reliable AI product is a normal software system with a probabilistic component inside it.

  1. Receive the request and classify intent.
  2. Load user, policy, and product context.
  3. Retrieve documents or call tools if needed.
  4. Generate structured output.
  5. Validate output before showing or acting.
  6. Log enough detail to debug failures.
  7. Feed errors back into evals and product design.

The classic failure is launching a chatbot with no evals, no citations, no trace logs, and no way to reproduce bad answers. The demo feels alive. Production feels haunted. Build the feedback loop before the launch party.

// SECTION_10

Decision frameworks and mental models

QuestionDefault answer
Do we need an agent?Only if the steps cannot be known ahead of time.
Do we need finetuning?Not until prompts, RAG, and evals are insufficient.
Do we need a vector database?Only if retrieval quality and scale require it.
What should we test first?The failures that would embarrass or harm users.
What should we monitor?Cost, latency, refusal rate, tool errors, and eval regressions.
The highest-leverage AI engineering habit is humility: assume the model can fail, make failures visible, and design the system so the wrong answer is caught before it matters.

// WANT THIS AS A WORKSHOP?

JOIN_THE_WEBINAR.exe