hiringai agentsvettingreferences7 min read

Questions to Ask an AI Agent Builder's References (12 That Actually Reveal the Truth)

Most reference calls are a waste of time. These 12 questions — organized by what you're actually trying to learn — cut through polished answers and reveal whether a candidate can actually build production-grade AI agents.

By HireAgentBuilders·

Why Reference Calls Fail (and How to Fix Them)

Most reference calls produce nothing useful.

The candidate selects their references. The reference knows they're being called. The questions are generic — "how was it working with them?" "would you work with them again?" — and the answers are too.

You walk away with three glowing testimonials and zero actual signal.

For AI agent builders, this is especially dangerous. Agentic AI projects have a specific failure profile: demos that look great, code that breaks under production load, and a lag between delivery and discovered failures that can stretch weeks. A reference can either surface that failure profile or validate genuine capability — but only if you ask the right questions.

These 12 questions are designed to produce honest, specific answers. They work because they're concrete, they require the reference to recall actual events, and they're hard to fake.


Before You Ask Anything: Set the Frame

Start every reference call with this:

"I'm considering [Name] for an AI agent project. I want to understand specifically how they work — not whether they're a good person. I may ask some detailed questions. Is that okay?"

This does two things: it signals you're serious, and it gives the reference permission to be direct without feeling like they're undermining someone they like.


Section 1: Verifying the Work

Q1: What specifically did they build for you?

What you're listening for: A concrete description — not a summary. Specifics like frameworks used, data volumes handled, tool integrations built, and deployment environment.

Red flag: Vague answers. "They helped with our AI roadmap." "They were involved in the agent project." If the reference can't describe what was built, either the work wasn't substantial or they weren't paying attention — both are problems.

Follow-up: "Was it in production? How long has it been running?"


Q2: How did the project end? Did it go to production?

What you're listening for: Deployment confirmation. Too many AI agent projects produce impressive-looking prototypes that never ship.

Red flag: "We handed it off to our internal team." "We're still evaluating it." These may be real answers, but probe further. What was the handoff state? Was the system documented well enough to hand off?

Follow-up: "What was the handoff package — docs, evals, runbooks?"


Q3: What was the scope when they started vs. when they finished?

What you're listening for: Scope management discipline. Did they push back on scope creep? Did they surface technical complexity early or absorb it silently and then miss timelines?

Red flag: "The scope kept growing and they just kept going." That's a sign of someone who can't say no — which leads to overrun projects. Or: "The scope shrank because they couldn't do what we originally needed." That's a capability signal.


Section 2: Technical Depth and Reliability

Q4: Did anything break in production? How did they handle it?

What you're listening for: Every production system fails. The question is what happens next. Strong builders have runbooks, monitoring, and a calm, systematic debug process. They communicate what's happening before you ask.

Red flag: "Nothing broke." Either the system wasn't in real production, or the reference doesn't know because the builder handled it in silence — which could mean either competence or opacity.

Better answer pattern: "Yes, we had a couple of issues in the first two weeks. They had tracing in place and identified the root cause within a few hours. They pushed a fix and followed up with a post-mortem."


Q5: What was their observability setup?

What you're listening for: Did they instrument the agent? Did they set up tracing (LangSmith, Langfuse, Braintrust, etc.), logging, evals, or alerting?

Red flag: The reference doesn't know what observability means in this context — possible if they're non-technical — but if they are technical and can't answer this, the builder probably didn't set anything up.

Why it matters: An AI agent with no observability is a black box in production. When something goes wrong, you're blind. Good builders treat this as non-optional.


Q6: Did they write evals? How did they test the agent's outputs?

What you're listening for: Evaluation discipline. Strong builders define success criteria per task and test outputs against those criteria. They can explain what "good" looks like for a given agent action.

Red flag: "We just tested it manually." Manual testing for a one-shot demo is fine. For a production agent that runs thousands of times, it's a liability.

Why this separates tiers: Most builders can get an agent working. Fewer can prove it works reliably at scale. Evals are the proof layer.


Section 3: Communication and Collaboration

Q7: How often did they update you without you asking?

What you're listening for: Proactive communication rhythm. The best builders send updates before you wonder what's happening. They flag blockers as soon as they hit them, not after they've tried everything else.

Red flag: "I usually had to chase them for updates." This is one of the strongest indicators of a difficult working relationship. It doesn't mean they can't build — it means managing them costs you time.

Scale the ask: "On a scale of 1-10, how often did you feel you had real-time visibility into where the project stood?"


Q8: Did they push back on anything? How did they handle disagreement?

What you're listening for: Intellectual honesty and engineering judgment. Strong builders tell you when a technical approach is wrong, when scope is too aggressive, or when a deadline is unrealistic — before it's a problem.

Red flag: "They said yes to everything." A builder who never pushes back either lacks the experience to know when something won't work, or lacks the confidence to say so. Both lead to the same outcome: you discover the problem after the work is done.

Good answer pattern: "Early on they told us our timeline was too aggressive for the eval pipeline we wanted. We pushed back, they explained why, and they were right. It saved us about three weeks of rework."


Q9: How did they handle not knowing something?

What you're listening for: Epistemic honesty. Agentic AI is moving fast. Nobody knows everything. Strong builders say "I don't know, let me find out" and come back with research. Weak builders guess and don't tell you they're guessing.

Red flag: "They seemed to have an answer for everything immediately." Sometimes this is real expertise. Often it's overconfidence. Ask: "Were there ever cases where their first answer turned out to be wrong?"


Section 4: Impact and Fit

Q10: What's running in production right now because of their work?

What you're listening for: Lasting, working output. Not "what did they build" but "what's still working six months later." Production longevity is the best proxy for build quality.

Red flag: "We decided to rebuild it." Full rebuilds happen — requirements change, tech stacks shift — but if the reference can't give a reason, explore it. "What drove the rebuild decision?" A technical rebuild decision post-delivery often signals quality issues.


Q11: What would you have done differently in how you worked with them?

What you're listening for: This question is for you, not about them. It reveals workflow preferences, communication gaps, and project setup failures. But it also sometimes reveals builder-specific information the reference didn't think to share.

Useful signals: "We should have given them clearer requirements upfront" (scope management on your side), "We should have pushed them harder for documentation" (documentation weakness), "We should have set up a faster feedback loop on their output" (iteration preference).


Q12: Would you hire them for a project where failure meant real business damage?

What you're listening for: Stakes-calibrated endorsement. Most references will say they'd "work with them again." That's a low bar. Raise the bar: critical path project, production traffic, real financial exposure.

Red flag: Hesitation on a high-stakes scenario from someone who enthusiastically endorsed them for standard work. That gap tells you something about where this builder operates well vs. where they reach their ceiling.

Variant: "Would you hire them to build something you were betting a revenue quarter on?"


How to Weight the Answers

Not all references are equal:

  • Technical peer or engineering manager: Highest value for signal on craft quality, code standards, debugging behavior
  • Product manager or operator: Best signal on communication, scope discipline, delivery follow-through
  • Founder or executive: Often less technical, but strong signal on reliability, judgment, and ability to work with uncertainty

A candidate who gets strong endorsements from all three types — especially on high-stakes framing — is genuinely rare.


What to Do if References Are Sparse

Some strong builders are individual contributors, contractors, or open-source contributors who don't have a traditional manager reference chain. If that's the case:

  1. Ask for the GitHub repo or observable production system — and actually look at it
  2. Ask for a worked example: "Walk me through the hardest debugging session you had on this"
  3. Ask for eval samples or system design documentation from a prior project
  4. Look for HN, blog, or talk history — builders who explain how they work in public are usually more reliable

References are one signal, not the only signal.


Starting with a Pre-Checked Pool

The reference problem is partly a sourcing problem. If you start with a pool that's already been pre-screened against technical depth criteria — evaluated portfolio, framework fluency, eval discipline, production track record — you spend your reference call confirming, not discovering.

At HireAgentBuilders, every builder on our shortlist has been scored against a 6-dimension vetting framework before you see their name. We pass about 15% of candidates. Reference checks on our pool tend to confirm rather than surface new concerns.

Request a free preview — 2–3 matched profiles, no deposit required.

Get matched with a vetted AI agent builder →

Need a vetted AI agent builder?

We send 2–3 matched profiles in 72 hours. No deposit needed for a free preview.

Get free profiles