AI as the evaluator: Do algorithms replicate human bias?
AI is often assumed to be entirely objective, but a new study by Rainer Michael Rilke and Dirk Sliwka provides the first systematic evidence on how large language models (LLMs) behave when evaluating human performance—and whether they replicate or reduce well-known biases commonly observed when human managers rate employees.
Why AI hesitates to give low ratings
The authors show that when performance information is subjective or ambiguous, LLMs tend to behave much like human supervisors: they avoid the lowest rating categories, cluster heavily around the midpoint of the scale, and display a clear tendency toward leniency. This becomes especially visible when the model is asked to rate S&P 500 CEOs. Even when instructed to assign 20 percent of CEOs to each rating category, the LLM almost never uses the lowest category, mirroring the reluctance of human evaluators to issue very negative assessments.
Judging groups vs. individuals
When testing whether LLMs become more discerning by evaluating several individuals at once rather than one at a time, the results mirror decades of psychological research on human raters. The model becomes more differentiating when assessing groups of three or five CEOs simultaneously. Ratings spread out more, and relative differences become clearer. Yet the fundamental leniency persists, suggesting that the model’s learned habits—shaped by overwhelmingly positive or neutral human-written texts—continue to dominate whenever objective standards are missing.
The job application experiment
To introduce clearer benchmarks, the researchers also tested the AI on job applications whose quality levels were artificially constructed. An LLM evaluated these applications without knowing their true quality. Once again, individual evaluations show strong leniency and limited use of the lower categories. Comparative evaluations, however, lead to more variation and better alignment with the intended distribution, especially when the rating scale explicitly ties each score to a percentile range. Still, the model remains hesitant to classify any application as belonging to the bottom 20 percent, even when prompted to do so.
The power of objective data
The most decisive evidence comes from a controlled experiment in which human raters evaluated workers based on noisy but objective performance signals. Here, the LLM receives exactly the same information as the human evaluators. In this setting, the model performs remarkably well. It produces ratings that are substantially more accurate than those of human raters, shows no leniency bias, and closely approximates the mathematical ideal that represents the best possible use of the available information. Unlike humans, the LLM is unaffected by whether its rating influences a worker’s bonus, indicating that it does not display the social concerns or favoritism that often distort human evaluations.
What this means for management
Taken together, the findings reveal a clear pattern. When performance is subjective and evaluators must rely on general impressions, LLMs reproduce familiar human biases. When performance information is structured, comparable, and at least partly objective, LLMs can significantly outperform human raters. They process information more consistently and without social or emotional distortions. The results highlight both the promise and the limitations of using LLMs in organizational performance management. They are not a remedy for the challenges of subjective evaluation, but they can meaningfully improve accuracy in settings where objective signals exist and can be systematically interpreted.
Related news