Algorithmic bias is pervasive in health care. It needn’t be

The World Health Organization issued its first global report on artificial intelligence in late June, highlighting concerns of algorithmic bias in health care applications of AI. It accompanies a growing number of news stories exposing AI’s shortfalls.

AI has come of age through the alchemy of cheap parallel (cloud) computing combined with the availability of big data and better algorithms. Problems that seemed unconquerable a few years ago are being solved, at times with startling gains — think instant language translation capabilities, self-driving cars, and human-like robots. AI’s arrival to health care, however, has been markedly slower. Perhaps the industry’s resistance to fundamental change is to blame, or its sluggish digital transformation. But the reason may be simpler: The stakes are much higher.

The complexity and criticality of health care’s issues far outweigh those of other applications. Nobody dies if an iPhone’s ad fails to inspire someone’s inner consumer. In fact, the arrival of AI may be less about health care being “late” and more about it finally being ready for health care.


That’s good, because the industry is at a tipping point. Health care organizations are confronting forces strong enough that may yield only when the power of AI is brought to bear. At the same time, the industry commands a level of spending that is unsustainable and converging into a crisis of access, affordability, and equity. Health care spending reached $3.8 trillion in 2019, the last year with full details, and is projected to top $6.2 trillion by 2028. A staggering 25% of this — currently $1 trillion per year — is wasted and does nothing to produce better health. And unacceptable health disparities exist in the U.S., as Covid-19 has both revealed and extended their depth as it devastated and disproportionately affected already disadvantaged groups and communities of color.

Health care’s AI algorithms

The transformation to value-based care is key to addressing these issues. AI will play an increasingly central role in targeting interventions, rewarding performance, and distributing resources, including funding.


Here’s the problem: If health care’s algorithms are biased — something that researchers at the Center for Applied Artificial Intelligence at the University of Chicago’s Booth School of Business have concluded — then AI solutions designed to drive better outcomes can make things worse.

The good news is that experts also say that although algorithmic bias is currently pervasive, it is not inevitable. And they are creating playbooks to define the processes and tools that can help measure and address it. The first step to combating bias is knowing what gives rise to it.

How bias creeps in

To identify bias, it helps to distinguish between two main types: subgroup invalidity and label choice bias. Subgroup invalidity occurs when an algorithm is aimed at an ideal target but fails to hit it for a certain group, perhaps because the data underrepresents them or it excludes specific factors that predict their risk. Label choice bias occurs when an algorithm is initially aimed at something other than an ideal target. That can happen when data are lacking about the ideal target that can be used to train the algorithm, or when a proxy is used, such as health care costs as a stand-in for an individual’s health risk. Label choice bias is far more common than subgroup invalidity.

When algorithms are designed to predict their ideal target, the specific calculations that assess accuracy and how well the model is calibrated for each subgroup help confirm the presence of bias. But those same basic checks fail when an algorithm’s target is a biased proxy. High accuracy is not a useful signal when the proxy itself encodes bias, which is why label choice bias often goes undetected.

It’s also important to distinguish between algorithmic bias and fairness. Fairness is less about the algorithm and more about how it is used and what decisions it informs. Reducing bias and differential performance may not be sufficient to eliminate fairness concerns, especially when predictions are used to prioritize who gets access to scarce health care resource and higher (or lower) scores clearly lead to favorable (or unfavorable) decisions. In such instances, it is important to assess whether groups are being prioritized in proportion to their actual need.

A common metric used to evaluate algorithmic fairness is called disparate impact. The metric calculates the proportion of a group that receives a positive benefit (offered a job interview, prioritized for care management, and the like) and compares the proportions between groups. The standard benchmark is the four-fifths rule: If one group receives a positive benefit that is less than 80% of what another group received, the algorithm or model is deemed unfair.

Disparate impact metrics, which are commonly used in U.S. labor law, are completely unsuited to health care. They assume that different groups have the same predicted event rate and assess fairness based on that assumption. Using such a metric in health care could lead to nonsensical results and policies like because we should expect similar rates of breast cancer in men and women, policies and programs should strive for men and women to receive the same number of mammograms. In health care, the ability to handle differences in event rates is paramount. Moreover, while some differences are expected — like the number of mammograms by sex — they are at times the very issue under scrutiny, such as pregnancy-related deaths by race.

In our submission for the Centers for Medicare and Medicaid Service’s Artificial Intelligence Health Outcomes Challenge, my colleagues and I at demonstrated a new fairness metric called group benefit equality (GBE) that explicitly addresses these and other shortcomings. Group benefit equality accommodates and adjusts for differences in disease prevalence that can exist by age, sex, race, geography, or other variable and quantifies the extent to which groups are — or aren’t — allocated resources in proportion to their actual need. This helps health care organizations adjust program criteria and enrollment thresholds, if necessary.

For example, in the model we submitted for the challenge, which predicted the risk of 12 adverse events, including unplanned hospital or skilled nursing facility admissions, serious fall-related injuries, and hospital-acquired infections occurring over a 30-day period, we highlighted two groups out of 11 with GBE values substantially less than 1.0, meaning their predicted risk was less than their actual risk. If such a model was then used to prioritize care management, those groups would be underrepresented in the high-risk group and targeted less often to receive additional care support services. We then demonstrated how the GBE metric could be used to calculate an appropriate adjustment for each group’s enrollment so the disparity would disappear. In this example, these two groups would each attain GBE values of 1.0 if — instead of targeting their top 5% of risk — they targeted the top 7.7% for the first group and the top 6.3% for the other.

The importance of explainable AI

Despite the pervasiveness of bias, experts have identified several ways to counterbalance it. Some say the key is having AI that is explainable, meaning it moves away from simply giving risk scores. Instead, it surfaces and quantifies the specific factors that contribute most to a person’s health risk and that best explain why they got a particular score. Explainable AI gives health care organizations visibility into the inner workings of their AI systems, allowing them to design and implement unbiased algorithms and to fix issues that arise. They need the ability to measure when bias is present, know what factors drive it, have the ability to test and choose less-biased alternatives, and continuously monitor algorithms for drift, a degradation in accuracy or bias due to changes in data, populations, or business practices. In health care, such capabilities are must-haves, not nice-to-haves.

Timnit Gebru, a computer scientist and advocate for diversity in technology and a co-founder of Black in AI, once wrote that “in the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet describing its operating characteristics, test results, recommended usage, and other information” and made the point that something similar should exist for AI.

Health care should insist on it. Health care organizations are under tremendous pressure and face unprecedented challenges, many of which will require AI-enabled solutions. They need AI they can trust to help serve their missions to improve health, reduce costs, and promote equity. To do so requires AI algorithms that are fully transparent, deeply explainable, completely traceable, and able to be audited. Anything less is unacceptable.

Carol McCall is the chief health care analytics officer at