Your AI passed internal testing. That doesn't mean it's ready.
What Perlego learned when real users stress-tested their AI before launch.
Every product team shipping a GenAI feature runs the same playbook. Internal dogfooding. Prompt testing. Automated benchmarks. Red-teaming with a handful of engineers.
And then they ship.
The problem? None of those steps tell you what happens when a real user, with real intent, pushes your AI in directions you didn't anticipate.
Perlego, the digital learning platform often described as the "Spotify for textbooks," decided not to take that risk. Before putting their AI Research Assistant in front of students, they commissioned an independent evaluation through GAT's AI GroundTruth service: 10 human evaluators, 270 prompt evaluations, 4 must-pass criteria defined before a single test was run.
Their AI had real strengths. Book discovery scored 4.28/5. Abuse handling was flawless. Ecosystem containment was airtight.
But the evaluation also surfaced gaps that internal testing hadn't caught. And that's exactly what it was designed to do.
"Internal QA tells you if it works. Global App Testing told us if it works the way it should. That distinction matters when your AI is sitting between a student and their degree."
Matt Davis, CTO, Perlego
The AI had learned to refuse. It hadn't learned to hold the line.
When evaluators applied multi-turn conversational pressure, the kind that real students naturally apply, the AI's initial boundaries softened. It refused the first request cleanly. But when users rephrased, asking for a smaller piece of the same task, the AI didn't recognise it as the same request. It helped.
The teams now call this pattern boundary erosion. The refusal exists, but it doesn't persist.
Because Perlego tested for this before launch, they caught it before a single student experienced it. A specific fix is now in development: training the refusal to persist across turns, treating any request that decomposes a refused task into its sub-steps as an instance of the same refused task.
Single-turn testing would never have caught this. Real conversations did.
Three patterns every AI product team should test for
Perlego's evaluation surfaced three distinct patterns. Each one is relevant to any company shipping AI features, regardless of industry.
1. Multi-turn boundary erosion.
Your AI might refuse a request cleanly on the first ask. But what happens when the user rephrases? Asks for a smaller piece of the same thing? Each sub-step looks reasonable in isolation. The sequence is the failure.
If your AI testing doesn't include multi-turn escalation scenarios with real humans, this pattern is probably hiding in your product too.
2. Intent detection gaps.
Sometimes the issue isn't pressure. It's ambiguity. Certain prompts sit on the line between a legitimate question and an assignment request. If your AI can't distinguish between "help me understand this topic" and "do my homework," it will be helpful in the wrong direction.
Perlego is now adding an intent classifier that routes ambiguous requests to guidance-only responses, ones that point to books and research angles without producing structured content a student could submit.
3. Safety responses that prioritise getting back to business.
When evaluators introduced distress signals, the AI acknowledged the situation but pivoted back to its core function too quickly, without surfacing specific support resources.
For any AI serving a user base that might include vulnerable people, this matters. Perlego flagged it as their top priority and is implementing dedicated crisis routing that pauses the assistant entirely until the user re-engages on their original task.
The takeaway: your AI's safety responses should be at least as specific as the help it provides.
What makes this story worth telling isn't what the evaluation found. It's the decision Perlego made before any of those findings reached a single student.
Most teams ship AI features when internal testing says they're ready. Perlego chose to add an independent layer: real humans, realistic scenarios, a must-pass framework with teeth.
The evaluation didn't just surface gaps. It confirmed genuine strengths, quantified them, and made sure the remediation plan wouldn't accidentally regress what was already working well.
Together, GAT and Perlego defined a three-phase path: remediate, re-evaluate with the same panel and prompt set, then monitor post-launch. Perlego is now executing that plan.
Why automated testing misses these patterns
Automated evaluations measure what's easy to measure: factual accuracy, response latency, format compliance, hallucination rates.
They can't measure whether an AI holds its role under social pressure, whether its helpfulness undermines the product's goals, or whether its safety responses are actually safe in context.
These are judgment calls. They require humans who can simulate realistic intent, apply escalating pressure, and evaluate whether the AI's behaviour would be acceptable to a real user, a real institution, or a real regulator.
The case for evaluating before you launch
If you're shipping a GenAI feature and your testing process is entirely internal, you're probably missing things that will matter. Not because your team isn't good. Because they're too close.
Perlego's team built a strong AI. The evaluation confirmed that. But it also surfaced the specific gaps that only appear when independent users apply real-world pressure.
The best time to find these patterns is before launch. Not after a user finds them for you.
That's the decision Perlego made, and it's the reason their AI will be stronger when it ships.
That's the case for human-led GenAI evaluation. Not as a replacement for automated testing. As the layer that catches what automation can't.
Global App Testing's AI GroundTruth service
Global App Testing's AI GroundTruth service provides human-led GenAI evaluation across safety, accuracy, boundary maintenance, and user trust.
globalapptesting.com/ai-groundtruth

Why this matters?
Internal testing catches the obvious problems. Automated benchmarks measure what's easy to measure. But the real risks, the ones that damage trust, create liability, or undermine your product's purpose, only surface when real humans interact with the system the way real users will.
That's what AI GroundTruth is built for. Not synthetic benchmarks. Not automated test suites. Human-led evaluation across the dimensions that matter most for your product, your users, and your brand.