Gen AI Could Fix Performance Reviews—or Make Them Even Worse

Summary:

Generative AI can greatly improve the value of performance reviews, but most companies are just using it to produce polished versions of traditional narrative reviews more quickly rather than improve them. A better approach is to use gen AI to surface direct evidence of performance—decisions, influence, mentorship, and problem solving embedded in everyday work. Done well, AI could shift reviews from persuasive storytelling to verifiable behavioral evidence, creating more accurate, transparent, and development-oriented evaluation systems.

Enterprises are rapidly deploying generative AI systems to streamline performance reviews. Citi’s Performance Assist pulls data from across its organization to draft evaluations. JPMorgan’s LLM Suite supports the writing of year-end reviews. Boston Consulting Group’s internal AI assistant reportedly cuts review-writing time by 40%. These systems demonstrate impressive capability, but so far, most organizations are using them to more quickly produce more polished versions of traditional narrative reviews.

Polished, however, is not the same as reliable. By smoothing out how managers describe performance, these systems can make evaluations feel more consistent and credible than they are, masking the inconsistencies and blind spots that have long defined them. Yet the same technology could do something far more valuable: shift performance reviews away from narratives about work and toward direct evidence of work in action: what people actually did, decided, and influenced.

Deficiencies That Gen AI Could Address

Performance review narratives have always suffered from inconsistent evaluation and incomplete evidence. Research shows that different managers often describe identical performance in dramatically different ways, shaped by personal relationships, selective memory, and storytelling ability. When organizations have tried to make evaluations more objective, they have leaned on what is easiest to measure: outputs, metrics, and formal deliverables.

These approaches often miss the higher-order contributions that define exceptional performance: the strategic insight that redirects a failing initiative, the mentorship that accelerates others’ growth, the conflict resolution that keeps teams moving forward. These capabilities remain largely invisible, creating a persistent gap between what organizations can measure and what actually drives success.

Gen AI could be used to surface these contributions—I discuss how in this article. But most organizations are overlooking this opportunity and are pointing AI in the wrong direction.

The Wrong Approach

To ease the burden of performance evaluations, many companies are channeling AI into drafting review narratives. On the surface, this seems like progress. In reality, it risks amplifying the underlying problem. Documents produced with AI tend to converge toward the same fluent, confident tone. The variation that once distinguished careful evaluation from generic praise is collapsing into a persuasive but standardized voice.

This homogenization affects presentation, not substance. Managers are still working from incomplete observations and subjective impressions. AI simply makes all narratives sound equally convincing. The result is reviews that appear more dependable than they are, making poor information harder to detect.

A Better Path

Instead of helping managers write more compelling stories about performance, AI could help them surface and examine actual episodes of work. The same technology that is making weak evidence sound stronger could make strong evidence easier to find.

What does that shift look like in practice? Instead of asking an AI assistant to help compose a more compelling paragraph about an employee’s “strategic thinking,” an organization asks AI to surface the decision memos, project pivots, and cross-functional emails where that strategic thinking becomes visible. The performance review becomes anchored not in evaluative language but in primary source material. Managers and promotion committees examine the artifacts themselves: the original documents where judgment was exercised, influence was demonstrated, and outcomes were shaped.

Not “demonstrates strategic leadership” but the specific memo where a flawed assumption was surfaced and corrected. Not “navigates ambiguity” but the postmortem where a failed initiative was restructured. Not “shows exceptional cross-organizational influence” but the verbatim directives that drove a regional restructuring.

Behavioral evaluation becomes practical at scale: Thanks to AI, the previously high cost of retrieving and analyzing the pertinent evidence has collapsed. AI systems could analyze patterns of employee interaction to surface evidence of higher-order competencies that traditional metrics miss entirely. They could examine communication networks to identify employees who consistently help others solve problems, analyze decision-making patterns in project discussions to spot strategic thinking, or map influence flows across email threads and meeting transcripts to reveal leadership in action.

This shift is no longer theoretical. Alongside the AI deployments already underway, a parallel trend has been quietly laying the groundwork: More organizations are experimenting with evidence-based evaluation. Sales organizations have long evaluated individual reps against system-captured pipeline data—quota attainment, win rates, deal velocity, activity volumes—pulled directly from CRM dashboards. Amazon restructured its Forte review process to require its corporate employees to submit three to five concrete accomplishments, such as projects delivered, goals met, initiatives launched, or process improvements, as a key input of self-assessments.

Such efforts represent progress, but they remain constrained by what is easiest to measure. Metrics capture outputs, not the deeper contributions that shape outcomes: architectural foresight, mentorship, and collaborative leadership, a gap AI can now begin to close.

Tapping the Potential of Gen AI

The building blocks are demonstrably in place. Here’s what senior leaders can do now to turn what is possible into a reality:

Reframe the performance conversation around consequential moments rather than trait labels and assertions of excellence.

Instead of asking managers to describe an employee’s “leadership,” “strategic thinking,” or “ability to navigate ambiguity,” performance reviews should ask a different question: What moments in this person’s work most clearly reveal those capabilities? A single consequential episode—where an employee challenged a flawed assumption, redirected a failing project, or aligned stakeholders around a difficult trade-off—often reveals more about capability than a page of evaluative language. AI can help identify these episodes by scanning project records, communications, and artifacts to surface the inflection points where judgment mattered most. The performance discussion then shifts from debating adjectives to examining evidence: what decision was made, what reasoning supported it, what alternatives were considered, and what happened as a result.

Monday morning action: Ask different questions in your next performance conversation. Instead of “Rate your strategic thinking,” ask “What moment this quarter best revealed your strategic thinking?” Instead of “How would you assess your leadership?” ask “When did your influence most clearly change the direction of a project?”

Direct AI tools already deployed across the enterprise to surface behavioral evidence rather than polish narrative.

Most enterprises already deploy AI assistants embedded in everyday work tools—Microsoft Copilot, Google Workspace AI, Claude Cowork, or proprietary internal knowledge assistants. Today these systems are typically used to summarize documents or help managers produce more polished narratives. Organizations should instead direct them toward a different task: analyzing patterns of employee interaction to surface evidence of higher-order competencies.

Monday morning action: Use your existing AI assistant to search for moments where employees influenced decisions or helped solve problems, rather than asking it to draft review language. Try: “Find examples where [employee’s name] changed the direction of a project or helped someone solve a technical problem” instead of “Help me write a performance review for [employee’s name].”

Build governance that balances transparency with employee control and prevents surveillance drift.

The shift toward AI-curated behavioral evidence requires governance that balances verification, employee control, and clear boundaries.

Verification: AI should serve as a curator that points to verifiable sources, while humans retain all interpretive judgment. Every AI-surfaced piece of evidence must link directly back to its source artifact so managers can independently verify what occurred.
Employee control: Give employees control over their own evidence portfolio. Here’s how that could work: AI systems would identify potentially relevant behavioral episodes and present them first to the employee, who would then choose which pieces of evidence to include in their performance review.
Clear boundaries to prevent scope creep: Performance reviews should rely on formal work artifacts—design documents, project retrospectives, client proposals, technical specifications—rather than casual communications or private messages. Organizations should specify which systems are in scope, how far back evidence extends, and which artifact types are appropriate. AI should never generate performance ratings or automated decisions, only curate evidence that humans interpret within broader context.

Monday morning action: Define which communication channels and document types are fair game for performance evidence and which are off-limits. Make it explicit: “Performance reviews can draw from project documents, meeting notes, and design decisions, but not from Slack direct messages (DMs), personal emails, or casual conversations.”

. . .

These steps create a path from today’s measurement limitations to tomorrow’s organizational possibilities. Organizations that make this choice will not just improve their performance reviews. They will build systems that recognize, develop, and reward the full spectrum of human capability that drives sustainable success.

The technology exists. The organizational precedents are emerging. The only question is whether leaders will use AI to perpetuate a broken system or transform it into something that finally captures what makes their best people exceptional.

Explore AAPL Membership benefits.

Topics

People Management

Strategic Perspective

Critical Appraisal Skills

Our Favorite Management Tips on Giving FeedbackThe Pros and Cons of Continually Assessing PerformanceIt’s Hard to Use AI as a Team. These 3 Practices Can Help.

Gen AI Could Fix Performance Reviews—or Make Them Even Worse

Deficiencies That Gen AI Could Address

The Wrong Approach

A Better Path

Tapping the Potential of Gen AI

Recommended Reading

Career & Learning

Leadership Library

Membership & Community

About AAPL

LEADERSHIP IS LEARNED™

For over 50 years.

CONTACT US

CONNECT WITH US

LOOKING TO ENGAGE YOUR STAFF?