Engineering

How to Build Reliable AI Agents for Lab Instruments

June 15, 202610 min readIacob Marian

How to Build Reliable AI Agents for Lab Instruments

An LLM that can hold a conversation is not an agent that can drive an instrument. Those are different problems. The first one impresses people in a demo. The second one is what a lab actually needs - an agent that does exactly what was asked, every single time, with no surprises on run number 400.

This is the gap most lab automation AI projects fall into. The demo works. Then someone phrases the request differently, or the task gets one step longer, and the agent improvises. In a chat window, improvisation is charming. On a liquid handler holding a real sample, it is a ruined experiment.

We build AI agents for lab instruments for a living, and the chronology below is what we actually learned, in roughly the order we learned it. None of it is theoretical. It is how you get from "look, it talked to the instrument" to lab automation you can trust.

Reliability Is the Product, Not the Demo

Start by being honest about the bar. A lab does not need an agent that is usually right. It needs an agent that is right in a way you can measure and defend.

That means two things most chat-first thinking ignores:

Determinism over cleverness. If a step can be computed, compute it. Do not ask the model to do arithmetic, geometry, or sequencing that a few lines of code do perfectly every time.
Bounded behavior. The agent should have a small, well-understood set of things it can do. The smaller and clearer that surface, the more reliable the system - and the easier it is to validate.

Everything that follows is in service of that bar.

The Foundation: An MCP Server Done Right

The foundation of a reliable agent for lab instruments is the MCP (Model Context Protocol) server. MCP is the open standard for exposing tools to an AI agent. The protocol is the easy part. The hard part - and the part that decides whether your agent is reliable - is what you expose and at what altitude.

The single most important design decision is this: expose a few high-level intent tools at one altitude, and compute everything computable in deterministic server-side code.

By "intent altitude" we mean tools that match how a scientist actually thinks:

transfer_liquid
place_consumables
read_plate

Not the layer underneath:

move_to_position
lower_tips
set_pump_speed
raise_z_axis

The intent tools are what the model should see. The primitives are what your server code orchestrates, deterministically, once the model has expressed intent. The model decides what to do. Your code decides how, the same way every time.

This is the inversion of the obvious approach. The obvious approach is to expose everything the instrument can do and let the smart model figure it out. That is exactly what you must not do.

Intent-level MCP architecture where the agent sees only high-level intent tools while low-level primitives stay hidden behind deterministic server expansion

The agent sees only the intent tools. The primitives below the line are real, but the server orchestrates them deterministically once the model has expressed intent - the model never touches them. This is the same pattern we describe in how to connect AI agents to lab instruments with MCP, pushed one step further into reliability.

Why Mixing Altitudes Breaks Agents - a Measured Finding

Here is the concrete finding that made this non-negotiable for us.

Point a capable, generic agent at a tool surface that mixes high-level intents and low-level primitives, and it goes off the rails. It starts composing the primitives by hand - move here, lower there, check this, adjust that - reasoning step by step through work the server should have done in one shot.

In our own measured testing, one such request produced around 13 tool calls and burned roughly 70,000 tokens for a job that should have been a single intent call. The model was not broken. It was doing exactly what a mixed-altitude surface invites: improvising a procedure out of primitives. More calls mean more places to drift, more latency, more cost, and more ways for a physical action to come out wrong.

Curating the tool surface to one intent altitude was the single highest-leverage design decision we made. Not the model. Not the framework. The tool surface.

Reliability Lives in the Architecture, Not the Framework

Once you have an intent-level MCP, the reliable pattern is short and it is mostly not "AI":

Intent-level MCP - the model expresses what the scientist wants in a few curated tools.
Deterministic expansion - server-side code turns that intent into the exact, validated sequence of instrument operations.
Plan-validate-repair loop - generate the plan, validate it against real constraints (volumes, positions, tip state, deck layout), and repair or reject before anything physical happens.

The plan-validate-repair loop with a human approval gate before deterministic execution on a lab instrument

The plan is an ordered list of intent steps - the MCP contract defines what steps are possible, the agent fills in which ones for this request, and a human approves before anything executes. A plan that fails validation is repaired and re-planned, never run.

We measured the same reliability across multiple agent frameworks running on top of this design. That result is worth sitting with: when the architecture is right, the framework is a detail. People spend weeks arguing about which agent framework to adopt. In our testing, the framework was not what moved the reliability number. The intent-level MCP plus deterministic expansion plus the validate loop was.

So if you are choosing where to spend your engineering effort: spend it on the tool surface and the deterministic layer, not on the framework bake-off.

Evaluate Before Anything Reaches a Lab UI

You do not earn the right to put an agent in front of a scientist by getting one demo to work. You earn it by measuring.

Before any UI, define your use cases - and for each use case, collect many real user phrasings. Scientists do not speak in canonical commands. "Move 50 microliters from A1 to B1," "transfer 50 ul, well A1 into B1," and "pipette fifty microlitres across to B1" are the same intent in three sentences. Your benchmark has to cover that spread.

Then measure three things on every use case:

Accuracy - does the agent pick and use the right tools, with the right arguments, for this intent?
Reliability - run it N times. It must pass every time. A test that passes 9 times out of 10 has failed.
Cost and latency - tokens and seconds per task. This is where mixed-altitude designs quietly bleed out, and where intent-level designs stay cheap and fast.

All of this happens before anything reaches a lab UI. The evaluation harness is not a phase you do at the end. It is the thing that tells you whether you have a product or a demo.

Grow With Skills, and Let the Count Be a Signal

Start with a zero-skill agent: just the intent-level MCP, nothing else. Run it against your benchmark. This tells you the raw quality of the MCP itself, unassisted. If the agent struggles on simple intents with zero scaffolding, the problem is the tool surface, and no amount of scaffolding above it will fix that cleanly.

As tasks get more complex, add skills - routed recipes that guide the agent through a specific multi-step task using the same intent tools underneath. Skills are how you climb the reliability curve for harder workflows without expanding the tool surface or loosening determinism.

There is a quiet diagnostic here: the number of skills an MCP needs to reach a given reliability is itself a quality signal. A clean, well-shaped intent MCP needs few skills to be reliable. If you find yourself writing a skill for every other task just to keep the agent on track, that is the MCP telling you its altitude or its boundaries are wrong. Listen to it.

The Payoff: Automation You Can Trust

Put it together and the recipe is not mysterious:

A few intent tools at one altitude, with everything computable computed deterministically.
A plan-validate-repair loop so nothing physical happens on an unvalidated plan.
An evaluation harness that measures accuracy, reliability, and cost on real phrasings before a UI exists.
Skills added deliberately, with the skill count read as a signal about the MCP underneath.

A quality MCP plus a measured agent is how you get lab automation AI you can trust in a real lab - not a clip that looks good once. The reliability is in the design, and the design is the work.

Key Takeaways

Reliability, not conversational fluency, is the bar for AI agents that drive lab instruments.
The foundation is an MCP server that exposes a few high-level intent tools at one altitude and computes everything else in deterministic server-side code.
Mixing intent tools with low-level primitives sends agents off the rails - in our testing, around 13 tool calls for a job that should be one.
Reliability lives in the architecture (intent MCP + deterministic expansion + plan-validate-repair), not in the choice of agent framework.
Evaluate before any UI: define use cases, gather real phrasings, and measure accuracy, reliability (pass every time, not most times), and cost.
Start with a zero-skill agent to test the MCP itself, then add skills - the number of skills needed to be reliable is a quality signal about the MCP.

Frequently Asked Questions

What makes an AI agent for a lab instrument reliable?

Reliability comes from the architecture, not the model. An intent-level MCP server, deterministic expansion of those intents into validated instrument operations, and a plan-validate-repair loop that rejects bad plans before anything physical happens. The agent decides what to do; deterministic code decides how, the same way every time.

Why is MCP the right foundation for lab automation AI?

The Model Context Protocol is an open standard for exposing tools to AI agents. For lab instruments, it lets you expose a small, curated set of intent tools that match how a scientist thinks, while hiding the mechanical primitives. That curated tool surface is the single biggest factor in whether the agent is reliable.

Should I expose every instrument capability to the AI agent?

No. Exposing low-level primitives alongside high-level intents invites the agent to improvise procedures step by step, which is slower, more expensive, and more error-prone. Expose intent tools only and orchestrate the primitives deterministically in server code.

How do you test an AI agent for a lab instrument before deployment?

Define use cases, and for each collect many real user phrasings. Then measure accuracy (right tools and arguments), reliability (it must pass every repeated run, not most), and cost and latency. All of this happens before the agent reaches a lab UI, against the MCP rather than the physical instrument.

What are skills, and how many should an agent have?

Skills are routed recipes that guide an agent through a specific multi-step task using the same intent tools. Start with zero and add them only as tasks get more complex. If you need a skill for nearly every task, that is a signal the MCP's tool surface is wrong.

For the foundations, see how to connect AI agents to lab instruments with MCP and the MCP protocol for lab automation. For where this is heading, see agentic AI for lab workflows.

Iacob is the Technical Lead at QPillars, a Zurich-based company building intelligent software infrastructure for life sciences laboratory instruments. We ship instrument software that is agent-ready by default - intent-level MCP servers, evaluated agents, and digital twins for safe iteration. Reach out at iacob@qpillars.com.

Iacob Marian

Technical Lead & Co-founder at QPillars

Iacob builds intelligent software infrastructure for life sciences laboratories, with a focus on Rust for instrument control and agentic AI for lab automation.

Full profile LinkedInPublished June 15, 2026

AI agents for lab instrumentslab automation AIMCP for lab instrumentsagentic AIAI agent reliabilityevaluation

Back to Blog

Engineering

How to Build Reliable AI Agents for Lab Instruments

June 15, 202610 min readIacob Marian

How to Build Reliable AI Agents for Lab Instruments

Reliability Is the Product, Not the Demo

Start by being honest about the bar. A lab does not need an agent that is usually right. It needs an agent that is right in a way you can measure and defend.

That means two things most chat-first thinking ignores:

Determinism over cleverness. If a step can be computed, compute it. Do not ask the model to do arithmetic, geometry, or sequencing that a few lines of code do perfectly every time.
Bounded behavior. The agent should have a small, well-understood set of things it can do. The smaller and clearer that surface, the more reliable the system - and the easier it is to validate.

Everything that follows is in service of that bar.

The Foundation: An MCP Server Done Right

The single most important design decision is this: expose a few high-level intent tools at one altitude, and compute everything computable in deterministic server-side code.

By "intent altitude" we mean tools that match how a scientist actually thinks:

transfer_liquid
place_consumables
read_plate

Not the layer underneath:

move_to_position
lower_tips
set_pump_speed
raise_z_axis

This is the inversion of the obvious approach. The obvious approach is to expose everything the instrument can do and let the smart model figure it out. That is exactly what you must not do.

Intent-level MCP architecture where the agent sees only high-level intent tools while low-level primitives stay hidden behind deterministic server expansion

Why Mixing Altitudes Breaks Agents - a Measured Finding

Here is the concrete finding that made this non-negotiable for us.

Curating the tool surface to one intent altitude was the single highest-leverage design decision we made. Not the model. Not the framework. The tool surface.

Reliability Lives in the Architecture, Not the Framework

Once you have an intent-level MCP, the reliable pattern is short and it is mostly not "AI":

Intent-level MCP - the model expresses what the scientist wants in a few curated tools.
Deterministic expansion - server-side code turns that intent into the exact, validated sequence of instrument operations.
Plan-validate-repair loop - generate the plan, validate it against real constraints (volumes, positions, tip state, deck layout), and repair or reject before anything physical happens.

The plan-validate-repair loop with a human approval gate before deterministic execution on a lab instrument

So if you are choosing where to spend your engineering effort: spend it on the tool surface and the deterministic layer, not on the framework bake-off.

Evaluate Before Anything Reaches a Lab UI

You do not earn the right to put an agent in front of a scientist by getting one demo to work. You earn it by measuring.

Then measure three things on every use case:

Accuracy - does the agent pick and use the right tools, with the right arguments, for this intent?
Reliability - run it N times. It must pass every time. A test that passes 9 times out of 10 has failed.
Cost and latency - tokens and seconds per task. This is where mixed-altitude designs quietly bleed out, and where intent-level designs stay cheap and fast.

All of this happens before anything reaches a lab UI. The evaluation harness is not a phase you do at the end. It is the thing that tells you whether you have a product or a demo.

Grow With Skills, and Let the Count Be a Signal

The Payoff: Automation You Can Trust

Put it together and the recipe is not mysterious:

A few intent tools at one altitude, with everything computable computed deterministically.
A plan-validate-repair loop so nothing physical happens on an unvalidated plan.
An evaluation harness that measures accuracy, reliability, and cost on real phrasings before a UI exists.
Skills added deliberately, with the skill count read as a signal about the MCP underneath.

A quality MCP plus a measured agent is how you get lab automation AI you can trust in a real lab - not a clip that looks good once. The reliability is in the design, and the design is the work.

Key Takeaways

Reliability, not conversational fluency, is the bar for AI agents that drive lab instruments.
The foundation is an MCP server that exposes a few high-level intent tools at one altitude and computes everything else in deterministic server-side code.
Mixing intent tools with low-level primitives sends agents off the rails - in our testing, around 13 tool calls for a job that should be one.
Reliability lives in the architecture (intent MCP + deterministic expansion + plan-validate-repair), not in the choice of agent framework.
Evaluate before any UI: define use cases, gather real phrasings, and measure accuracy, reliability (pass every time, not most times), and cost.
Start with a zero-skill agent to test the MCP itself, then add skills - the number of skills needed to be reliable is a quality signal about the MCP.

Frequently Asked Questions

What makes an AI agent for a lab instrument reliable?

Why is MCP the right foundation for lab automation AI?

Should I expose every instrument capability to the AI agent?

How do you test an AI agent for a lab instrument before deployment?

What are skills, and how many should an agent have?

For the foundations, see how to connect AI agents to lab instruments with MCP and the MCP protocol for lab automation. For where this is heading, see agentic AI for lab workflows.

Iacob Marian

Technical Lead & Co-founder at QPillars

Iacob builds intelligent software infrastructure for life sciences laboratories, with a focus on Rust for instrument control and agentic AI for lab automation.

Full profile LinkedInPublished June 15, 2026

AI agents for lab instrumentslab automation AIMCP for lab instrumentsagentic AIAI agent reliabilityevaluation

How to Build Reliable AI Agents for Lab Instruments

How to Build Reliable AI Agents for Lab Instruments

Reliability Is the Product, Not the Demo

The Foundation: An MCP Server Done Right

Why Mixing Altitudes Breaks Agents - a Measured Finding

Reliability Lives in the Architecture, Not the Framework

Evaluate Before Anything Reaches a Lab UI

Grow With Skills, and Let the Count Be a Signal

The Payoff: Automation You Can Trust

Key Takeaways

Frequently Asked Questions

What makes an AI agent for a lab instrument reliable?

Why is MCP the right foundation for lab automation AI?

Should I expose every instrument capability to the AI agent?

How do you test an AI agent for a lab instrument before deployment?

What are skills, and how many should an agent have?

Related Articles

How to Design AI Agents for Lab Automation: Start From the Question, Not the API

The Instrument Contract: What a Lab Instrument Must Expose for an AI Agent to Close the Loop

Agentic AI for Lab Automation: Why a Lab Instrument Is Not Just Another Tool

How to Build Reliable AI Agents for Lab Instruments

How to Build Reliable AI Agents for Lab Instruments

Reliability Is the Product, Not the Demo

The Foundation: An MCP Server Done Right

Why Mixing Altitudes Breaks Agents - a Measured Finding

Reliability Lives in the Architecture, Not the Framework

Evaluate Before Anything Reaches a Lab UI

Grow With Skills, and Let the Count Be a Signal

The Payoff: Automation You Can Trust

Key Takeaways

Frequently Asked Questions

What makes an AI agent for a lab instrument reliable?

Why is MCP the right foundation for lab automation AI?

Should I expose every instrument capability to the AI agent?

How do you test an AI agent for a lab instrument before deployment?

What are skills, and how many should an agent have?

Related Articles

How to Design AI Agents for Lab Automation: Start From the Question, Not the API

The Instrument Contract: What a Lab Instrument Must Expose for an AI Agent to Close the Loop

Agentic AI for Lab Automation: Why a Lab Instrument Is Not Just Another Tool