How to evaluate a product with a non-deterministic output

Written by Christopher McTurk-Starkie | May 2026

In 2024, a Deloitte Australia team submitted a government workforce report worth AU$440,000. Upon audit, it was discovered that most of the academic references and quoted passages were fabricated by the GPT-4o model used to draft it.

This was not an isolated incident. AllAboutAI's research suggests AI hallucinations cost businesses $67.4 billion globally in 2024.

Generative AI chatbots, recommendation engines, content personalization features, fraud detection models, and AI-powered search tools all produce outputs that are inherently variable. Send the same prompt twice, and you'll get two different but potentially equally valid responses.

There is no canonical "correct" answer to check against.

This is the core challenge of evaluating non-deterministic products. And most QA teams are still trying to solve it with a toolkit that was built for a different era. Our team at Global App Testing (GAT) addresses this by leveraging crowdsourced testing on real devices to uncover hidden inconsistencies that scripted tests miss.

This article explores how to evaluate products with non-deterministic behavior. We will look at practical testing methods, multi-run evaluation techniques, benchmarking strategies, and the key performance metrics.

What makes a product non-deterministic and why it changes everything

In a deterministic system, the relationship between input and output is fixed. You can write a test, run it a thousand times, and get the same result each time. The entire architecture of traditional software testing, unit tests, regression suites, and automated assertions is built on this assumption.

Non-deterministic systems break this contract at the foundation. Feed the large language model (LLM) the same prompt twice, and it will produce different tokens, different phrasings, different examples, and sometimes different conclusions.

This means the system's outputs are probabilistic rather than fixed. Their outputs are influenced by statistical prediction and inference rather than strict rule-based execution. To understand how this behavior shows up in practice, it is important to look at where the variability actually comes from.

Where the variability comes from in non-deterministic systems

Because non-deterministic AI products generate probabilistic outputs instead of fixed responses, the same input can produce different results across runs. To evaluate these systems properly, teams need to understand what causes that variability. The main sources of variation are:

Temperature and sampling: Language models generate responses by sampling from probability distributions. Higher temperature creates more varied and creative outputs, while lower temperature produces more consistent responses, though not perfectly deterministic ones.
Prompt sensitivity: Small wording changes can produce very different answers. Asking a model to “think step by step” or changing the order of options can significantly affect results.
Context window interactions: Earlier conversation context affects later responses. The same question may receive different answers depending on what came before it.
Training data distribution: Models know some topics better than others. Outputs in weaker or less-represented domains tend to be more inconsistent and error-prone.
Agentic compounding: In multi-step AI agents, variability grows across each step. Differences in execution paths, tool usage, or memory retrieval can affect the reliability of the final outcome.

Real incidents show how AI failures increase the legal, reputational, and compliance risks.

For instance, in 2024, a Canadian court held Air Canada legally liable after its customer service chatbot provided a passenger with false information about refund policies. The company argued the chatbot was a "separate legal entity." However, the court rejected the defense.

It set a precedent that organizations are accountable for what their AI systems say, regardless of how those systems are built.

Why traditional QA methods break down

Traditional quality assurance (QA) was built for systems that behave predictably. When software became more complex, QA scaled with it. But AI systems introduced probabilistic behavior that does not fit into old testing models.

In real-world testing environments at GAT, we often see this gap when teams shift from rule-based applications to LLM-powered systems. So, what worked for deterministic QA is no longer viable.

This is amplified when we break down the specific ways traditional QA methods fail when applied to non-deterministic AI systems.

Traditional QA fails with non-deterministic AI

1. The pass/fail problem

Automated testing frameworks operate on assertions. If you send input X, the output must satisfy condition Y. This works well when outputs are stable and predefined.

In non-deterministic (more accurately, probabilistic) AI systems, this approach breaks down because:

A single prompt can yield dozens of equally valid responses,
"Correct" is often contextual, cultural, or subjective,
Two outputs can be semantically equivalent but textually different, and
Model behavior drifts over time even without code changes.

2. Benchmarks are not enough

Benchmarks are useful during development, but they often fail in real environments. A model that scores 95% accuracy on a benchmark will still fail specific user groups, misread cultural context, or produce something no user can trust.

This happens because benchmarks are limited and controlled; real users are not. A recent MIT study on enterprise AI shows 95% of enterprise AI pilots fail to deliver meaningful ROI because models that perform well in tests do not always perform reliably in real-world conditions.

3. The infinite input space problem

Traditional software testing assumes a limited and predictable input space. You can test key scenarios and cover most cases.

Natural language AI products have no such boundaries. The range of things a user might ask a system, or the queries a search model might receive, or the scenarios an agentic AI might encounter, is effectively infinite.

Any test suite built on predetermined test cases will always leave critical edge cases unexplored and those edge cases are often exactly where failures are most harmful.

Research on AI agent testing notes that traditional software testing assumes deterministic outputs and binary pass/fail verdicts, both of which are ill-suited to non-deterministic systems.

The same paper demonstrated that binary pass/fail testing has 0% detection power for behavioral regressions in non-deterministic AI agents, whereas statistical behavioral fingerprinting achieves 86% detection power.

4. Model drift

Another key issue is that AI systems do not stay stable over time, even if the product code does not change. This is often called model drift, but it includes the following as well:

Data drift: User inputs change over time. For example, users start asking new types of questions.
Concept drift: The real-world meaning behind data changes. What is “correct” or “safe” may evolve.
Model or provider updates: The underlying AI model changes due to silent updates or retraining by the provider.

All of this can change system behavior without any visible change in your application.

Our experience testing non-deterministic products across markets shows that critical failures are rarely caught in staging. They surface only when real users, devices, languages, and expectations interact with the product in production. We provide real-world testing to detect these drift-related issues early by validating behavior across diverse environments at scale.

A framework for evaluating non-deterministic products

At Global App Testing, we’ve observed that effective evaluation of AI systems depends on clearly defined behavioral expectations that are validated across diverse real-world environments.

This is why the first step in any non-deterministic evaluation framework is to explicitly define what “good” looks like before testing begins.

Define acceptance criteria

Unlike deterministic testing, evaluation of non-deterministic products requires teams to use predefined quality thresholds to judge acceptable model behavior.

Defining success for a non-deterministic product means answering questions like:

What does a “good” output look like within an acceptable range of responses (not a single fixed answer)?
What outputs are considered harmful, misleading, or unacceptable (test oracle for failure conditions)?
Which quality dimensions matter for acceptance: accuracy, tone, safety, cultural appropriateness, latency, and completeness?
Who acts as the evaluation oracle: the model itself, human raters, or domain experts?

In QA terms, this forms the foundation of test design, where teams structure evaluation datasets, edge cases, adversarial prompts, and expected behavioral outcomes. The NIST AI Risk Management Framework provides a useful governance structure. It organizes AI governance into four functions:

Govern
Map
Measure
Manage

The Measure function explicitly centers on evaluation. This framework makes clear that evaluation is not a technical afterthought but a governance discipline that requires upfront investment in criteria definition, stakeholder alignment, and documentation.

Image source- AI Risk Management Framework

Without clear definitions, evaluation produces data without insight and unactionable results. This phase closely aligns with traditional test planning, where teams define acceptance criteria, risk areas, evaluation goals, and expected behavior ranges before execution begins.

Choose the right evaluation methods

Evaluation metrics for non-deterministic AI products fall into three broad families:

Automated reference-based metrics,
Model-based evaluation,
Human evaluation.

Each has its place, and none is sufficient alone.

a) Automated metrics (baseline layer)

These metrics compare model outputs against a reference "gold standard," a human-written example of the correct output. They provide useful signals and can be embedded in CI/CD pipelines to catch obvious regressions. Automated metrics are fast, cheap, and scalable.

Common metrics for non-deterministic outputs include:

BLEU / ROUGE scores: Measure how similar the output is to a reference text based on word overlap. It works for basic comparison but does not capture meaning well, so it is weak for modern LLM evaluation.

Semantic similarity (cosine similarity): Checks whether outputs mean the same thing, even if wording differs.

Hallucination rate: Measures how often the model produces unsupported or incorrect claims.

Perplexity: Measures how confidently the model predicts text, helping estimate how well the model fits or understands a given language pattern.
Confidence calibration: Checks whether the model’s confidence matches its actual accuracy.
METEOR: Improves overlap scoring by including synonyms and word order.
BERTScore: Uses embeddings to measure semantic similarity, so it captures meaning better than surface-level overlap.

Use these metrics as early warning systems, not as the arbiter of product quality.

b) LLM-as-a-judge

Using a second AI model to evaluate the primary model's outputs is common due to the scale involved. The judge model is given the input, the output, and a rubric, and asked to score the output on one or more dimensions.

However, LLM-as-a-judge has important failure modes. In expert domains such as medicine, law, and specialized technical content, subject matter experts agree with LLM judges only about 64–68% of the time, substantially lower than human-to-human agreement within the domain.

Moreover, the judge model inherits the biases, blind spots, and hallucinations of the underlying model. In QA pipelines, this can lead to false positives or false negatives in CI/CD, where good outputs are flagged as failures or poor outputs are incorrectly accepted.

Therefore, LLM-as-judge can be a scalable middle layer. It should be calibrated against a smaller set of high-quality human annotations, rather than as a standalone quality gate. This mirrors QA validation workflows, where automated checks are supplemented with human review before release decisions are made. Our team can help to strengthen this calibration through crowd-based validation, where diverse human reviewers help ground model judgments in real-world expectations and improve reliability.

c) Red teaming and adversarial exploration

Red teaming involves probing the system with challenging, ambiguous, edge-case, and boundary-pushing inputs, designed to reveal failure modes. It also includes testing against adversarial techniques such as prompt injection attacks, jailbreak attempts, and manipulative inputs that try to bypass system safeguards.

Our diverse crowd testers across different regions, languages, and usage patterns help simulate real-world adversarial behavior that internal teams and synthetic test suites may overlook.

This is especially important for safety, bias, and hallucination detection. To this end, major AI labs such as Anthropic and OpenAI are concerned and have conducted cross-model evaluations using human raters to evaluate each other's public models for sycophancy, whistleblowing tendencies, and self-preservation behaviors.

This evaluation shows a growing industry focus on shared safety testing methods.

d) Human evaluation (Gold standard layer)

Structured human judgment remains the most reliable method for evaluating the quality dimensions that matter most. These include cultural appropriateness, brand voice consistency, domain-specific accuracy, and subjective quality that automated metrics cannot capture.

It is also a gold standard for assessing safety. However, it can be expensive and difficult to scale to the sample sizes required for statistically reliable conclusions.

Our GAT AI GroundTruth routes structured evaluation techniques, including preference ranking, prompt evaluation, safety review, bias detection, and adversarial exploration to a crowd of 100,000+ diverse real-world testers across 190+ countries.

This gives enterprise teams the statistical depth and cultural breadth that internal evals simply cannot replicate.

Test across dimensions, not just inputs

The most common failure in AI product evaluation is optimizing for one dimension while neglecting others. This is why evaluating non-deterministic products requires testing across multiple dimensions, not just multiple prompts.

Real-world users introduce constant variation through:

Dimension	Why it matters
Languages & locales	AI behavior can degrade significantly across non-English languages.
Demographics	Bias and fairness gaps often only appear in specific user groups.
Devices & OS versions	The rendering and performance of AI features vary by hardware.
Network conditions	Latency affects perceived quality of generative responses.
Cultural context	Outputs appropriate in one market may be offensive or confusing in another.
Edge case inputs	Corner cases and adversarial prompts reveal safety and robustness gaps.

This is where real-world crowdtesting becomes especially useful. Our distributed testers network across different countries, devices, languages, and network conditions helps teams evaluate how AI products behave in live conditions that are difficult to detect in controlled lab environments.

Use statistical methods, not single-run verdicts

Because non-deterministic systems produce variable outputs, single-run testing is statistically meaningless.

Robust evaluation of non-deterministic systems requires testing the same input multiple times, not just once. This helps teams measure consistency and understand how much outputs vary across runs.

Statistical techniques like confidence intervals, bootstrapping, and Monte Carlo simulations help determine whether performance changes are meaningful or simply random variation.

Beyond initial evaluation, non-deterministic systems also require continuous monitoring and regression testing to detect drift, behavioral changes, and emerging failure patterns over time.

Designing your test suite

Evaluation is only as good as the inputs you test against. A test suite that only covers the happy path will give you false confidence. Here is how to build one that actually stress-tests your system.

Designing a test suite

The evaluation mindset

Instead of validating one fixed output, design evaluation processes that measure how reliably the system behaves across diverse scenarios and repeated runs. The goal is to evaluate accuracy, consistency, safety, and overall reliability despite the probabilistic nature of AI systems.

Building your test data strategies

A strong eval dataset is built from multiple types of inputs:

Adversarial cases: Inputs designed to break the system, including ambiguous or harmful prompts.
Golden cases: High-quality, verified examples that define expected behavior and serve as a baseline for smoke/sanity.
Edge cases: Outlier scenarios and boundary conditions that fall outside the "happy path." Integrating them into a dataset is essential for training models to handle rare but critical failures.
Diverse coverage: Tests across languages, user types, and complexity levels to avoid blind spots.
Real conversation data: Actual user interactions (where privacy allows), which best reflect real-world usage patterns.

Real interaction data often reveals behavior that synthetic datasets fail to capture, making it one of the most valuable inputs for evaluation design.

Regression testing for non-deterministic systems

AI systems can change behavior even when only small components are updated. A prompt tweak or model upgrade may improve some cases while degrading others.

To manage this, teams should:

Run full eval suites before and after every change
Compare overall score trends, not just single metrics
Track shifts in failure types and affected user segments

Some emerging approaches, such as behavioral fingerprinting, help detect subtle regressions that aggregate scores can miss by comparing system behavior patterns over time.

However, regression testing alone is not enough. Because AI behavior can drift after deployment, teams also need ongoing production monitoring and observability to detect emerging issues, quality degradation, and unexpected behavior patterns in real user environments.

Our team offers regression testing for AI systems, strengthened by combining structured evals with real-world testing across devices, regions, and user environments. This helps identify changes that may not appear in controlled test runs but become visible at production scale.

For example, GAT helped Flip reduce regression test duration by 1.5 weeks through targeted crowdtesting, ensuring stability across features.

Ready to master non-deterministic evaluation?

Non-deterministic systems don’t fail in neat, predictable ways. They fail across languages, devices, cultures, and real-world conditions that no single test suite can fully simulate.

If you’re building or scaling AI products, you need evaluation that goes beyond lab testing and benchmark scores. Global App Testing can help you validate AI and digital products in real-world conditions by combining structured QA with distributed, on-demand testing across global environments.

Talk to the GAT team to explore how distributed, real-device testing can help you validate non-deterministic AI systems at scale and close the gap between lab performance and production reality.

View full post