AI & Accessibility Testing

The Evolution of Accessibility Testing: From Static Rules to Intelligent Automation

14 min read • 1,752 words

A comic-style illustration of Michaël Vanderheyden holding a futuristic digital magnifying glass with a divided lens. The left side shows standard code with green pass markers. The right side reveals a glowing purple and blue neural network overlay, symbolizing AI detecting semantic and layout issues. The entire background is solid white.

AI is everywhere, and test automation is no exception. Accessibility (a11y) testing is right in the middle of that shift. For years, we’ve relied on rule-based scanners that act like rigid gatekeepers: great at checking if markup and attributes are present, but still weak at understanding how a page actually renders, how layout affects meaning, and whether the content makes sense for real users.

We’re finally moving past “presence” and into “context.” That’s a big shift. Instead of running disconnected checks, we can connect semantic structure, visual rendering, and language quality in one pass and get feedback that is much closer to what a human reviewer would catch. In this area, AI improvements go beyond scanning: they improve the whole flow, from detection to triage to context-aware fixes.

Where AI Breaks the “Rule-Based” Ceiling

Standard automation struggles with WCAG success criteria that require human-level judgment. But AI shouldn’t replace static rules—static rules are fast and cheap, while AI in comparison is expensive. The point is that AI should handle what static rules can’t, and enhance them through context awareness.

First, let’s look at what’s already working. SC 1.1.1 (Non-text Content): AI experiments have already proven that we can validate whether alt text actually describes the image accurately, not just its presence. As I discussed in a previous post about AI-enhanced scanners, this capability moves us from checking “is there alt text?” to “does this alt text make sense?” Static rules nail presence; AI handles semantic match.

Screenshot of an accessibility tool reporting a mismatch error. It compares an image of a warehouse worker with the existing alt text "Parts being packed," flagging it as inaccurate.

Based on my experience, the following is a first batch of easy wins I could imagine. Some capabilities already exist in adjacent workflows and “just” need to be integrated into accessibility scanners, while others are more personal expectations based on AI strengths and potential for improvement. The list is, of course, not exhaustive, and I’m sure there are other creative ideas where AI could be a valuable enrichment.

SC 1.3.1 - Info and Relationships

The first easy win here is hierarchy integrity: does heading structure progress logically, and are levels skipped or missing in ways that break document structure? This is already something AI code reviewers can detect in HTML and Markdown markup today.

So again, this is less about inventing a brand-new capability and more about integrating an existing capability into accessibility scanners.

A natural extension would be detecting “fake” headings, either from rendered styles (text that visually looks like a heading) or even from heuristic signals like heading-like class names without semantic heading markup.

SC 3.1.5 - Reading Level

Cognitive accessibility matters, and this is another area where AI code review can already help today. It can detect abbreviations and acronyms in HTML or Markdown content, then suggest plain language alternatives or at least proper markup that links an abbreviation to its definition.

Like the previous example, this feels less like a greenfield problem and more like integrating an already useful capability into the broader accessibility test flow.

SC 1.3.2 - Meaningful Sequence

Modern CSS layout capabilities like Flexbox, Grid, and now grid-lanes aka. Masonry give us tremendous visual flexibility, but they also mean that markup order and visual reading order are no longer necessarily in sync. A grid item placed last in the DOM might render first visually, and nothing in the current tooling will flag that as an accessibility concern.

That disconnect is exactly what motivated browser engineers like Rachel Andrew to push for a proper reading-order solution before Masonry layout shipped. She wrote about the problem and the initial ideas to solve it in Reading order and CSS layout, which is worth a read if you haven’t seen it. The result is the new reading-flow property.

See the Pen Visual Order as Reading Order by th3s4mur41 (@th3s4mur41) on CodePen.

A minimal demo showing how reading-flow can align keyboard and screen‑reader navigation with the visual layout. The author info is visually placed under the heading, but semantically lives in the footer — reading-flow resolves the mismatch without changing the DOM.
Open “Visual Order as Reading Order” on CodePen if the embedded preview is not available.

I believe AI could detect mismatches between visual reading order and DOM order, especially in these complex layout scenarios. Historically, when layout and markup order diverged, the fix often meant restructuring the layout itself, a trade-off designers and developers often were reluctant to make. AI flagging the issue didn’t necessarily make it easier to resolve. But reading-flow is changing that: it gives developers a way to declare the intended reading sequence without touching the visual layout.

That makes AI detection of these mismatches much more useful in practice. It’s no longer just a warning with no good fix, but a prompt to apply a targeted, non-breaking correction.

To be clear: it’s not AI’s job to apply reading-flow. The role of AI here is detection and raising awareness, while the fix remains a deliberate developer decision.

SC 1.4.3 & 1.4.11 - Contrast

I do see potential for AI to improve contrast testing, but let’s be honest: this is the most fragile of all the checks in this list. That said, the same is true for manual contrast testing—it’s genuinely hard to evaluate in complex rendering scenarios.

Current static tools already struggle in some well-known situations: when the background isn’t a plain color, or when content and background elements aren’t in the same stacking context. As I described in a previous post on contrast tool limitations, these edge cases often produce incorrect results or are simply skipped. AI could potentially close those gaps by analyzing the actual rendered output rather than relying on computed CSS values alone.

The challenge is that this kind of pixel-level analysis across breakpoints and layout states would likely be fragile in early versions and require careful calibration to be useful rather than noisy.

A Word on False Positives

Based on what I’ve seen, AI-powered scanning enhancements are more likely to carry false positives than rule-based checks, especially in early versions. The ratio of false positives to real findings could be critical to adoption. Too many false alarms breed alert fatigue and rejection. This is why AI enhancements should always build on top of static rules, not replace them.

From Findings to Action: Triage and Fixes

Finding the bug is only 20% of the battle. The real “developer friction” happens when turning a list of violations into a clear, actionable plan. This applies whether findings come from automated scanners, manual testing, or formal accessibility audits. This is where AI has just as much to offer as it does in scanning.

Grouping and Prioritization

AI can even help before triage begins, by extracting violations from audit reports and documentation and turning them into prioritizable items in the team’s existing tracking workflow. Getting findings into the right place is half the battle.

But having the tickets is only the start. In a recent project, we had three separate tickets: one for a video player’s focus indicator, one for control verbosity, and one for a specific button. The issue was not the number of tickets, but the missing link between them. Without that connection, developers may work on each issue independently, which can lead to redundant work or even conflicting solutions.

AI excels at pattern recognition. It can look at those three violations, see they all originate from the same VideoPlayer component, and group them into a single “Component Refactor” task. This turns a cluttered backlog into a clear, prioritized roadmap.

A recent real-world example comes from GitHub itself. In Continuous AI for accessibility: How GitHub transforms feedback into inclusion, Carie Fisher describes how GitHub uses AI-assisted triage to enrich incoming accessibility reports with structured metadata, route them to the right owners, and connect external feedback to existing audit findings. That matters because prioritization is rarely just about severity in isolation; it is also about recognizing duplicates, clustering related signals, and weighing real user impact.

Context-Aware Fixes vs. Copy-Paste Disasters

We’ve all seen it: a developer sees an a11y error, copies the “Suggested Fix” from a static documentation site, and inadvertently makes the experience worse because the suggestion didn’t account for the surrounding code.

AI doesn’t just provide a fix; it provides a harmonious fix. By feeding the AI the specific code snippet, the WCAG violation, and custom instructions (such as general a11y best practices or framework-specific accessibility practices), it can generate a solution that respects your existing framework, state management, and ARIA patterns. It moves us from “potential fixes” to “ready-to-merge code.”

Bridging the Gap Between Audit and Code

Audits are often the “source of truth,” but they come with a major hurdle: testers are rarely developers.

Testers can describe what they expect—a keyboard-navigable media player, a properly announced button, a logical focus order—but it’s not their role to specify how that gets implemented. That’s the job of developers and accessibility experts, who need to weigh technical constraints, framework conventions, and sometimes the limits of native browser behavior.

A finding might lead to a small fix, a component refactor, or even a deliberate move away from a native element toward a custom solution. AI can help bridge that gap by comparing audit findings against the actual codebase and suggesting technically grounded directions, rather than leaving developers to guess from a tester’s description alone.

The “Gatekeeping” Challenge: Looking Toward WCAG 3.0

As Karl Groves pointed out in his What I Like About WCAG 3.0 article, WCAG 3.0 introduces nuances like clear language and algorithm fairness that are inherently resistant to traditional automation.

If we don’t evolve our tools, we risk pushing “gatekeeping” to the very end of the development cycle (the audit phase). To keep feedback loops short and “shift left” effectively, we must use AI to automate the “un-automatable.”

While AI will never replace the lived experience of a human tester, it can handle the heavy lifting across the entire flow: richer context-aware scanning, smarter triage, technically grounded fix suggestions, and a tighter bridge between audit findings and the developer’s workflow. The humans can then focus on what matters most: the complex, subjective nuances that truly define an inclusive experience.


What do you think? Are you ready to embrace AI-enhanced accessibility testing across the full workflow? And perhaps the more provocative question: would you accept some level of false positives in exchange for catching more issues early, or do you prefer the strict zero-false-positive approach that common current tools are built on? Let’s discuss.