How Do You Actually Conduct a Bias Audit?

Written by Shea Brown



Posted on 04/03/2026



In Article | Blog | Company

Bias audits for AI hiring tools are increasingly a compliance and procurement requirement. In this article, Shea Brown, CEO of BABL AI, explains how a bias audit works in practice, including how to choose the right metric, what data to use, how NYC Local Law 144 fits in, and how BABL AI conducts independent audits under ISAE 3000.

Bias Audits for AI Hiring Tools

The question I get more than any other, from prospective clients, in our training programs, and from people I meet at conferences, is some version of this: “How do you actually conduct a bias audit?” It comes up so often that I wanted to write a clear, practical answer (from my point of view, of course).

I’m the CEO of BABL AI, an independent AI assurance firm founded in 2018. We specialize in technical AI audit and assurance for high-risk AI systems, with a particular focus on bias audits for AI hiring tools under NYC Local Law 144 and the growing body of similar regulations in Colorado, California, and the EU. Our auditors are trained and certified through our own rigorous internal certification program (built to the standards expected by Big 4 auditors and regulators) and our engagements are conducted under ISAE 3000, the same international assurance standard used in financial auditing, which means our reports are designed to hold up to regulatory scrutiny and enterprise due diligence. We have conducted these audits for numerous vendors and employers over the years, and we conduct a significant amount of research into methodologies for identifying and measuring bias in AI systems. What follows is my attempt to answer some common questions we get in plain terms.

Because we work with vendors a lot, I’m going to tailor this to them; however, most of these points are relevant for employers as well.

The fact is, if you build or sell an AI hiring tool, a bias audit is no longer a theoretical exercise. Under recent regulations, it is a compliance prerequisite, and enterprise procurement teams are pushing vendors to conduct these audits even without direct regulatory pressure. But compliance aside, a well-designed audit also tells you something genuinely useful about how your tool performs in the real world.

What Does “Bias” Actually Mean in These Audits?

In the regulatory context, bias refers to systematic performance differences in an AI system across protected categories: race, gender, ethnicity, disability status, and similar classes covered by anti-discrimination law. A bias audit measures whether your tool produces materially different outcomes for different demographic groups.

In practice, I find it useful to think about bias in a somewhat broader sense. Any systematic shift in a performance metric (see below) across different segments of the input space is, structurally, the same measurement problem. That framing connects bias testing to the broader discipline of model evaluation and validation, and it means that a rigorous audit can surface both discrimination risk and general performance gaps simultaneously.

How to Choose the Right Performance Metric for Your Bias Audit

Every bias audit starts with a metric (well, it actually starts with a careful risk/impact assessment, but we’ll leave that for another article). For AI hiring tools, the most commonly used is selection rate: the probability that a candidate who applies is advanced, recommended, or selected by the system. But depending on what your tool does, other metrics may be equally relevant and useful, e.g.:

False positive rate (incorrectly advancing an unqualified candidate)
False negative rate (incorrectly filtering out a qualified candidate)
True positive rate / recall (how often qualified candidates are correctly advanced)
Score distributions across demographic groups

The right metric depends on what your tool does and what the downstream hiring decision looks like. This is one of the first conversations you should have, because the choice of metric directly impacts the quality of the audit; you must identify which metric is most defensible for your specific use case, and why.

In the case of NYC Local Law No. 144, the published guidelines require impact ratios derived either from selection rate or “scoring rate,” which is the rate at which a particular group scores above the median score of the overall applicant pool.

What Data Should You Use for a Bias Audit?

Once you have a metric, the next question is the one I find most practically challenging for vendors: What data do you actually have? The answer shapes everything else about the audit, and it varies more than people expect.

The best-case scenario is that you have real applicant data with self-declared demographic information. Some vendors get there through a special arrangement with employer clients who share their ATS data for audit purposes. Others build their own intake process, collecting consent from applicants to use their data specifically for bias testing. Either path can work well, but both require careful attention to data use agreements, consent language, and applicable privacy law.

More often, vendors come to us with little or no data that includes demographic information. This is common, and it is not disqualifying. In these situations, which NYC LL144 refers to as “test data” scenarios, you have a few options: you can attempt to infer demographic characteristics from available signals, or you can source synthetic or representative test data from elsewhere to run the audit against. Both approaches carry tradeoffs, and both require documentation of your methodology. I would also strongly recommend talking to a privacy professional before going down either path; the rules around demographic inference and synthetic data use are jurisdiction-specific and still evolving.

Beyond the demographic data question, there are additional scoping decisions that must be made and justified: which geographies does the audit cover, which industries, which job levels or functions? A tool used for hourly warehouse hiring behaves differently than one used for professional services roles, and the audit should reflect that. These are not just technical choices; they are decisions that affect whether the results are credible and defensible to a regulator or a sophisticated enterprise buyer.

The throughline across all of this is documentation. Whatever data you use and whatever scoping decisions you make, write them down and explain why. In my experience, regulators and clients alike respond well to good-faith efforts that are clearly reasoned and transparent. What they respond poorly to is a gap with no explanation. Do your best with what you have available; that genuinely matters.

Can a Bias Audit Go Beyond Protected Categories?

A rigorous bias audit need not stop at demographic slices. The same analytical method (measuring performance metric variance across input segments) applies to other dimensions that affect how your tool behaves in production:

Job description characteristics: industry, seniority level, how requirements are phrased
Geographic variation: does the tool perform consistently across NYC, California, and Colorado deployments? What about EU member states?
Data quality and input format: resume formatting, completeness, language

This is where bias testing and general model evaluation and validation converge. A systematic performance drop on a particular job type is structurally identical (from a measurement perspective) to a disparity across racial groups; it is a signal that the model is not behaving uniformly across its intended operating context. Both matter for vendor credibility, and both matter for the employers who are deploying your tool and relying on it to make consequential decisions about people.

Do AI Hiring Tool Vendors Need Their Own Bias Audit Under NYC LL144?

Another question I hear regularly is whether a deployer’s audit “covers” the vendor, or vice versa. In general, it does not. Under most new laws and regulations, both vendors and deployers bear independent compliance obligations. NYC LL144 places obligations on deployers (employers), but does allow for audits to happen at the vendor level under very specific circumstances; deployers give their historical data (from the use of that tool) to the independent auditors to analyze. Deployers who did not contribute their own data to your audit remain independently obligated, and your (vendor-level) audit does not automatically satisfy their requirements.

This distinction has real consequences for your sales process. Enterprise clients in NYC, California, and Colorado are increasingly requiring audit documentation as a procurement prerequisite. Following the December 2025 Comptroller audit, NYC’s Department of Consumer and Worker Protection (DCWP) has signaled a more proactive enforcement posture, which means the credibility and independence of your audit is now a due diligence question your buyers will ask directly.

From what I have seen, vendors who treat the audit as a checkbox tend to end up redoing it. Vendors who engage with it as a genuine assessment of their tool’s performance tend to get more durable value from it, both in terms of compliance and in terms of what they learn about their product.

How BABL AI Conducts a Bias Audit

At BABL AI, every engagement follows a consistent four-step structure under ISAE 3000 assurance standards. First, we scope the engagement by identifying the relevant normative framework (whether that is NYC Local Law 144, NIST AI RMF, EU AI Act Article 9, or another applicable standard) and agreeing the terms with the client. Second, we design our procedures based on a risk assessment of where material issues are most likely to arise. Third, we execute those procedures against BABL AI’s proprietary AI System-Level Assurance criteria, which are organized around three pillars: testing and evaluation (system definition, test data, metric definitions, and results), risk assessment (completion, analysis, and prioritization), and governance (accountable party, defined duties, and performance of duties). Fourth, we form our conclusion and issue a formal assurance report.

Engagements take one of two forms. In an attestation engagement, you have already tested your system and we independently verify your reported results. In a direct engagement, we independently measure and evaluate the AI system ourselves and report on those results. This is the more thorough path, and what organizations typically need when demonstrating genuine due diligence to an external audience. Most engagements run five to eight weeks from kickoff to final report, and our reports are designed to satisfy multiple regulatory regimes simultaneously so that vendors operating across jurisdictions do not need to repeat the exercise for each one.

The Bottom Line

A bias audit, at its core, is a structured measurement exercise: pick a metric, define your population, segment the input space, and measure systematic differences. The regulatory requirements under LL144, Colorado SB24-205, California’s ADS rules, and the EU AI Act each specify how that exercise must be documented and disclosed (and who’s allowed to do it), but the underlying logic is consistent across all of them.

For AI hiring tool vendors operating across multiple regulated jurisdictions, the most efficient path is a single audit engagement designed to satisfy all of them simultaneously, conducted by a credible independent auditor with a methodology that holds up to scrutiny.

Frequently Asked Questions:

What is NYC Local Law 144, and to whom does it apply?

NYC Local Law 144 requires employers and employment agencies that use automated employment decision tools (AEDTs) in hiring or promotion decisions affecting NYC-based candidates to conduct an annual independent bias audit and publicly post the results. It applies to the employers who deploy AEDTs, and it is enforced by the NYC Department of Consumer and Worker Protection (DCWP).

How often does an AEDT bias audit need to be renewed?

Under LL144, bias audits must be conducted and published annually. Colorado SB24-205 and California’s ADS regulations don’t actually require independent audits, though in the case of California’s regulations, they are strongly encouraged. This makes the auditor relationship an ongoing compliance partnership rather than a one-time engagement, which is something I always flag for vendors planning their compliance budgets.

Does my vendor’s LL144 audit cover me as a deployer?

Generally, no. A vendor audit covers the tool’s performance on the vendor’s historical data, potentially across many employers. If you, as a deployer, did not contribute your own workforce data to that audit, you likely have an independent obligation to conduct your own. Laws other than LL144 may also cover more protected characteristics, so even if you are covered by the vendor’s audit for LL144 purposes, that coverage may not transfer to other regulatory regimes.

What is the four-fifths rule in AI hiring bias testing?

The four-fifths rule, drawn from EEOC adverse impact analysis, holds that a selection rate for any demographic group that is less than 80% of the rate for the highest-selected group signals potential adverse impact. Under LL144, auditors must calculate and report these impact ratios by race, ethnicity, and sex. A ratio below 0.80 does not automatically mean a violation, but it is the standard threshold that triggers closer scrutiny.

What is an independent bias auditor, and why does independence matter?

An independent bias auditor is a third party with no financial or organizational relationship to the AI system being evaluated. Independence matters because regulators, enterprise procurement teams, and courts all treat self-reported testing differently from third-party verified results. Under ISAE 3000, the standard BABL AI uses, independence is a formal requirement, not just a preference. An audit conducted by the vendor’s own team, or by a firm with a consulting relationship to the vendor, will face credibility challenges in any regulatory inquiry.

What is ISAE 3000, and why does it matter for AI audits?

ISAE 3000 is the International Standard on Assurance Engagements for subject matter other than financial statements. It is the same framework used by accounting firms for non-financial assurance work. Applying it to AI audits means the engagement follows defined standards for evidence gathering, reporting, and auditor independence — which makes the resulting report more defensible to regulators and enterprise buyers than an audit conducted under no formal standard at all.

What is the difference between an attestation engagement and a direct engagement?

In an attestation engagement, the organization has already tested its system and documented the results; the auditor independently verifies those reported results against applicable criteria. In a direct engagement, the auditor independently conducts the testing and measurement themselves, then reports on those results. Direct engagements are more thorough and provide stronger evidence of due diligence; they are generally what organizations need when the audience is a regulator, an enterprise procurement team, or a sophisticated legal or compliance function.

What happens if a deployer fails to conduct a bias audit under LL144?

DCWP has enforcement authority under LL144, including the ability to issue fines for non-compliance. Following the December 2025 Comptroller audit, DCWP has signaled a more proactive enforcement posture, including using EEOC and FTC filings as leads for enforcement inquiries. Large, identifiable employers with high-volume NYC hiring are at the most material risk. Beyond regulatory penalties, non-compliance increasingly creates procurement friction, as enterprise contract requirements and vendor questionnaires now routinely ask for audit documentation.

Does the EU AI Act require bias audits for AI hiring tools?

The EU AI Act classifies AI systems used in employment, worker management, and access to self-employment as high-risk systems under Annex III. High-risk systems are subject to conformity assessment requirements before deployment, which include technical documentation, risk management systems, data governance standards, and human oversight requirements. While the EU AI Act does not use the term “bias audit,” the conformity assessment requirements closely parallel what a rigorous independent audit addresses, and US vendors selling AI hiring tools into European markets face these obligations directly. Note that, as of now, Annex III systems do NOT need an external conformity assessment; it can be done internal to the company.

Can a single audit satisfy compliance obligations across multiple jurisdictions?

Yes, if the audit is designed with that goal from the outset. BABL AI’s engagements are specifically structured to address the requirements of LL144, Colorado SB24-205, California’s ADS regulations, and EU AI Act Article 9 simultaneously. The key is ensuring that the methodology, documentation, and scope are calibrated to the most demanding of the applicable standards; which typically means the audit ends up exceeding the minimum requirements of each individual regime.

BABL AI is an independent AI assurance firm specializing in bias audits and technical AI assurance for high-risk AI systems. We have conducted audits for AI hiring tool vendors, large employers, and recruitment process outsourcing firms across the US and EU. If you have questions about your specific situation, contact us.

What’s New?

Stay up to date with the latest updates.

How Do You Actually Conduct a Bias Audit?

04.03.2026

Subscribe to our Newsletter

Keep up with the latest on BABL AI, AI Auditing and
AI Governance News by subscribing to our news letter