Test Driven Development for Prompt Engineering

Build AI applications with guarantees

Sep 15, 2025

Building reliable software comes down to two things: specification (defining what it should do) and verification (checking that it actually does it). When your software is an AI prompt, this gets tricky. The prompt itself is the specification, but how do you consistently verify its behavior?

Test-Driven Development (TDD) offers a simple solution. It combines specification and verification into a single, practical workflow. By writing a test before you write the prompt, you create a clear, executable specification. Running that test then automatically verifies the AI’s output against your specification. This approach brings structure and predictability to prompt engineering, establishing it as a reliable discipline.

The Core Cycle: Red-Green-Refactor

The beauty of TDD is its simple, three-step loop. You repeat this cycle until your prompt works exactly as intended.

1. Red: Define What You Want (Write a Failing Test)

Start by writing a test for a single, specific behavior. This test will fail because you haven’t written the prompt yet—that’s the point. You’re defining success before you try to achieve it.

Let’s say you’re building a customer service bot. You want it to be empathetic when customers complain.

// Jest Test Case
test('responds empathetically to complaints', async () => {
  const userComplaint = "My order arrived damaged and I'm really frustrated!";
  const response = await getAIResponse(userComplaint, /* current prompt */);
  
  // Check for empathetic keywords
  expect(response.toLowerCase()).toContain("sorry");
  expect(response.toLowerCase()).toContain("frustrating");
});

This test will fail with a generic prompt like “You are a customer service agent.”

2. Green: Make It Work (Update Your Prompt)

Now, modify your prompt just enough to make the test pass.

// The Prompt
const promptV1 = `You are a customer service agent. When a customer expresses frustration, acknowledge their feelings ("I understand how frustrating that is")and apologize sincerely ("I'm sorry to hear that").`;

Run the test again. Once it passes, you’ve successfully implemented the desired behavior.

3. Refactor: Make It Better (Without Breaking It)

With a passing test as your safety net, you can now improve the prompt. You could make it more concise or add an example for clarity. After each change, you run the test again to ensure it still works.

// The Refactored Prompt
const promptV2 = `
  You are an empathetic customer service agent.

  When a customer reports a problem, always start by acknowledging 
  their frustration and apologizing.

  Example:
  Customer: "My package was damaged!"
  You: "I'm so sorry to hear that. I understand how frustrating that must be."
`;

How to Test AI Output: Two Approaches

AI responses can vary, unlike traditional code. We handle this with two main testing strategies.

1. Pattern Matching: For Predictable, Structured Output

Use this when you need the AI to extract or format information in a precise way. This is perfect for checking for keywords, JSON structures, or regular expressions.

Example: A sales tool that analyzes a conversation to qualify a lead.

test('correctly identifies a qualified lead from a transcript', async () => {
  const conversation = `
    Customer: We need a solution for 500 employees. Our budget is approved.
    Sales Rep: Great. What's your timeline?
    Customer: We need to implement it by Q2.
  `;
  
  const prompt = `
    Analyze the conversation and determine if the lead is qualified.
    A lead is qualified if they have a budget, timeline, and company size over 100.
    Return ONLY a JSON object with "qualified": "yes" or "no".
  `;

  const responseText = await getAIResponse(conversation, prompt);
  const result = JSON.parse(responseText);

  expect(result.qualified).toBe("yes");
});

2. AI-based Evaluation: For Nuanced, Subjective Output

Use this when you care more about the quality or tone of the response than specific words. A second evaluator AI can score the output against your criteria.

Example: An AI that generates follow-up emails.

test('generates a follow-up email with a warm, professional tone', async () => {
  const generationPrompt = `
    Write a warm, professional follow-up email to a potential client named Sarah 
    who requested pricing information 3 days ago. Do not be pushy.
  `;
  
  const email = await getAIResponse(generationPrompt);

  // Use another AI to evaluate the generated email's tone
  const evaluationPrompt = `
    Rate the following email on a scale of 0.0 to 1.0 for "professionalism" 
    and "pushiness". Return ONLY a JSON object. Email: "${email}"
  `;
  
  const evaluationText = await getAIResponse(evaluationPrompt);
  const toneScores = JSON.parse(evaluationText);

  expect(toneScores.professionalism).toBeGreaterThan(0.8);
  expect(toneScores.pushiness).toBeLessThan(0.3);
});

Conclusion

TDD transforms prompt engineering into a systematic process. By writing tests first, you:

Define clear expectations for your AI.
Catch regressions before they affect users.
Build confidence in your AI’s behavior.
Create living documentation through your test suite.

Start small. Pick one critical behavior in your AI application, write a test for it, and then make it pass. You’ll be surprised how quickly this improves both your prompts and your peace of mind.

Aniket’s Substack

Discussion about this post

Ready for more?