skip to Main Content
Using-ai-to-rethink-and-improve-software-testing-–-devops-online

Using AI to Rethink and Improve Software Testing – DevOps Online

Author: Ash Gawthorp, Co-founder and CTO at Ten10 

Software testing has long relied on precision, process and pattern recognition. But as systems become more complex and interconnected, traditional quality engineering (QE) methods are being pushed to their limits. The rise of generative AI presents a real opportunity to evolve how teams approach QE, not by replacing testers, but by enhancing how they work.

AI is already helping teams save time and reduce repetitive effort across test creation, execution and maintenance. According to Gartner, by 2028, 90 percent of enterprise software engineers will use AI code assistants, up from just 14 percent in early 2024. As AI becomes more embedded across the development life cycle, the real challenge lies in adopting these tools responsibly, without losing the judgment and human insight that effective testing depends on.

Closing the Experience Gap

AI may be most transformative in how it supports learning and independence. For junior testers or those entering the field from non-traditional backgrounds, knowing where to start can be the biggest hurdle. They may need to consider testing for a missing password, an incorrect string, or edge cases they haven’t even thought of yet.

Treating AI as an interactive tutor, an always-available rubber duck if you like, can help bridge this gap. Testers can ask questions, get examples and explore variations without fear of getting it wrong, which matters in environments where confidence grows slowly and access to senior team members isn’t always immediate.

Of course, with this accessibility comes risk. Over-reliance can lead to surface-level understanding. If testers are just copying and pasting whatever a model gives them, they may not spot when it’s incorrect or irrelevant. To use AI effectively, testers now need to build a core skill in prompting thoughtfully and critically assessing responses, just as they would when learning testing frameworks or scripting languages.

Natural Language Testing as a New Frontier

One promising area of innovation is natural language test generation. Instead of writing code, testers describe what they want: “Reject additional cookies, confirm the message appears, click ‘Start now’ and perform a search.” From there, an AI system can identify the elements on the page, match them semantically and build a working automation script. It’s not flawless, the quality is variable and relies on strong prompt-engineering techniques to enforce abstraction through the page-object model and create useful code, but it’s a meaningful step toward lowering the barrier to entry.

Although interest in this capability is growing, many teams still face challenges in knowing where to start. At a recent software testing forum, one of our lead engineers demonstrated a practical application by running a live coding session that auto-generated Playwright scripts from natural language prompts using Model Context Protocol (MCP). The session attracted a packed room and sparked considerable engagement, not only because of the novelty of seeing the process in action, but also because it addressed a common gap: many testers are under pressure to adopt AI in their workflows yet lack clear, hands-on guidance. The demonstration showed that with the right approach, the concept is both accessible and achievable.

Used well, this opens the door to greater collaboration. Business analysts, product owners and other stakeholders can describe scenarios in plain English, while testers refine these into robust automation scripts. This creates a more inclusive testing process, bridging the gap between technical and non-technical team members.

Testing the AI Itself

As more systems embed AI into their core functionality, QA teams are being asked to test not just the code but the behaviour of the AI models themselves. This is uncharted territory, as outputs are probabilistic, not deterministic. The same prompt might produce a slightly different answer tomorrow. 

At Ten10, we’ve been exploring how to validate these systems. It’s no longer just about whether it works but whether it responds consistently, clearly and fairly. We use techniques like echo testing (rephrasing prompts to see if the answer holds up), semantic similarity checks and scoring metrics like BLEU and ROUGE to measure quality. Sometimes a second AI system, a judge model, is used to assess whether the answer makes sense.

Furthermore, significant consideration needs to be given to continuous monitoring. The assumption that testing once guarantees consistent behaviour is outdated. AI models can “drift” in production, making answers less relevant over time. Robust processes must be in place to detect this drift and take corrective action quickly.

This is still an emerging field, but one thing is clear: testing AI with AI requires a new mindset. Traditional regression tests often miss subtle formatting differences or shifts in meaning, so QA teams must start thinking more like researchers and less like those who simply follow a checklist. 

The Data Dilemma

The use of LLMs increases the risk of exposing sensitive data, not just in testing phases but also when running in production. As larger frontier models become more powerful and are tasked with handling more data, this risk becomes even more critical.  One common mitigation strategy is to use less powerful smaller agents for specific, well-defined tasks, orchestrated by another agent that allocates work between them. This limits the data these agents have access to and makes testing them and implementing guardrails around what they can and can’t do easier.

In testing, there’s also a risk of sending sensitive data to LLMs or even further downstream to other systems, if those guardrails are not properly implemented. One solution to this challenge is to use synthetic data for testing. While it is already common in regulated environments, it is often overlooked in broader contexts because generating high-quality synthetic data is extraordinarily difficult. It requires preserving complex multi-dimensional statistical relationships and implicit business rules across hundreds of variables while maintaining the delicate balance between privacy protection and data utility. All of this must be achieved at scale within the tight timelines demanded by modern Continuous Integration / Continuous Delivery (CI/CD) workflows, where updates are built, tested, and deployed in rapid cycles.

AI is transforming the way organisations address this challenge through techniques such as copula models, which capture the dependence structure between multiple random variables and their individual distributions, to automatically learn and replicate these complex data patterns. For example, if you have payment datasets where amount, currency, and processing time are related, a copula model learns these relationships and can generate millions of new synthetic payments that maintain the same realistic patterns—such as higher amounts take longer to process, or certain currencies appear together—without copying any actual payment records. This approach is particularly useful in performance testing, where datasets often need to be far larger than those available in the real world.

In practice, there are open-source Python libraries such as SDV (Synthetic Data Vault) developed by MIT for generating synthetic data using different techniques. These include classical statistical approaches like Gaussian Copula and AI/Deep Learning methods such as GAN-based models. This flexibility allows teams to choose between faster, more efficient methods or more computationally intensive approaches for capturing highly complex patterns.

Using AI to Strengthen Testing, Not Replace It

If there’s one takeaway from all this, it’s that AI in testing is not about cutting corners. It’s about enhancing judgment, expanding capability and reducing toil. But the fundamentals still matter. A poorly designed test suite won’t be fixed by an LLM, and a misunderstood API schema won’t be magically corrected by a prompt.

Successfully adopting AI in quality assurance is less about the tools themselves and more about the people using them. Testers who thrive in this environment are those who stay curious, think critically, and understand when to trust an AI-generated suggestion and when to dig deeper or reframe the question. AI can make testing faster and more flexible, but it still takes skilled people to make it meaningful.

Back To Top