We use cookies to provide you with the best possible website experience. This includes cookies that are necessary for the operation of the site, as well as cookies used for anonymous statistics, comfort settings, or displaying personalized content. You can decide which categories you want to allow. Please note that depending on your settings, some features of the website may not be available.

Cookie settings

These necessary cookies are required to enable the core functionality of the website. Opting out of these cookies is not possible.

cb-enable
This cookie stores the user's cookie consent status for the current domain. Expiry: 1 year.
laravel_session
Stores the session ID to recognize the user when the page reloads and to restore their login session. Expiry: 2 hours.
XSRF-TOKEN
Provides CSRF protection for forms. Expiry: 2 hours.
AI as the evaluator: Do algorithms replicate human bias?

AI as the evaluator: Do algorithms replicate human bias?

IZA@LISER Network | May 12, 2026
New research reveals that while AI mimics the human habit of being "too nice" in subjective reviews, it significantly outperforms us when evaluations are grounded in objective data.

AI is often assumed to be entirely objective, but a new study by Rainer Michael Rilke and Dirk Sliwka provides the first systematic evidence on how large language models (LLMs) behave when evaluating human performance—and whether they replicate or reduce well-known biases commonly observed when human managers rate employees.

Why AI hesitates to give low ratings

The authors show that when performance information is subjective or ambiguous, LLMs tend to behave much like human supervisors: they avoid the lowest rating categories, cluster heavily around the midpoint of the scale, and display a clear tendency toward leniency. This becomes especially visible when the model is asked to rate S&P 500 CEOs. Even when instructed to assign 20 percent of CEOs to each rating category, the LLM almost never uses the lowest category, mirroring the reluctance of human evaluators to issue very negative assessments.

Judging groups vs. individuals

When testing whether LLMs become more discerning by evaluating several individuals at once rather than one at a time, the results mirror decades of psychological research on human raters. The model becomes more differentiating when assessing groups of three or five CEOs simultaneously. Ratings spread out more, and relative differences become clearer. Yet the fundamental leniency persists, suggesting that the model’s learned habits—shaped by overwhelmingly positive or neutral human-written texts—continue to dominate whenever objective standards are missing.

The job application experiment

To introduce clearer benchmarks, the researchers also tested the AI on job applications whose quality levels were artificially constructed. An LLM evaluated these applications without knowing their true quality. Once again, individual evaluations show strong leniency and limited use of the lower categories. Comparative evaluations, however, lead to more variation and better alignment with the intended distribution, especially when the rating scale explicitly ties each score to a percentile range. Still, the model remains hesitant to classify any application as belonging to the bottom 20 percent, even when prompted to do so.

The power of objective data

The most decisive evidence comes from a controlled experiment in which human raters evaluated workers based on noisy but objective performance signals. Here, the LLM receives exactly the same information as the human evaluators. In this setting, the model performs remarkably well. It produces ratings that are substantially more accurate than those of human raters, shows no leniency bias, and closely approximates the mathematical ideal that represents the best possible use of the available information. Unlike humans, the LLM is unaffected by whether its rating influences a worker’s bonus, indicating that it does not display the social concerns or favoritism that often distort human evaluations.

What this means for management

Taken together, the findings reveal a clear pattern. When performance is subjective and evaluators must rely on general impressions, LLMs reproduce familiar human biases. When performance information is structured, comparable, and at least partly objective, LLMs can significantly outperform human raters. They process information more consistently and without social or emotional distortions. The results highlight both the promise and the limitations of using LLMs in organizational performance management. They are not a remedy for the challenges of subjective evaluation, but they can meaningfully improve accuracy in settings where objective signals exist and can be systematically interpreted.

Related news

Communications
Mark Fallak
mark.fallak@liser.lu
+352 585-855-526
World of Labour
Olga Nottmeyer
olga.nottmeyer-ext@liser.lu
+352 585-855-501
Network Coordination
Christina Gathmann
christina.gathmann@liser.lu

The IZA@LISER Network is a global community of scholars dedicated to excellence in labor economics and related fields, now coordinated at the Luxembourg Institute of Socio-Economic Research (LISER) following its transition from Bonn.

About IZA@LISER Network
Contact
IZA@LISER NETWORK (Current Site Operator):

Luxembourg Institute of Socio-Economic Research (LISER)
11, Porte des Sciences
Maison des Sciences Humaines
L-4366 Esch-sur-Alzette / Belval, Luxembourg

IZA Institute (In Liquidation):

Forschungsinstitut zur Zukunft der Arbeit GmbH i. L.
Schaumburg-Lippe-Str. 5-9, 53113 Bonn. Germany
Phone: +49 228 3894-0 | Fax: +49 228 3894-510
E-Mail: info@iza.org | Web: www.iza.org
Represented by: Martin T. Clemens (Liquidator)