Evaluation · Series Part 2 of 6

Have You Stopped Thinking?

A Parable of Fool’s Gold

The second in a series of six articles exploring evaluation, measurement, and why we keep getting it wrong.

I want to start with a thank you.

After publishing my first article introducing the TriAxis Model last week, Steve Beckley (Director and Co-founder of J21A) left a comment that stopped me in my tracks. Not because he disagreed. Because he agreed, and in doing so, illustrated something more important than anything I had written.

Steve observed, thoughtfully and correctly, that some training (compliance and IT security, for instance) seems primarily aimed at Level 3 Learning, while the work his organisation does needs to shoot for Level 4 Results, with measurable economic benefit. He concluded generously that the TriAxis Model could work as well as, or better than, Kirkpatrick’s framework for encouraging evaluation.

Steve, thank you. Genuinely. Because your comment, sharp and well-intentioned as it is, perfectly demonstrates the trap I want to explore today.

Without necessarily realising it, you mapped different types of training to different levels of evaluation. As if the level you target is somehow a fixed property of the training category itself. Compliance training lives at Level 3. Commercial training lives at Level 4.

That assumption is Kirkpatrick’s ghost. And it’s haunting us.

Have You Stopped Thinking?

I ask that with affection rather than accusation, because the honest answer for most organisations is: yes. Somewhere between designing the training, delivering it, and filing the evaluation form, actual thinking quietly left the building.

And the culprit, more often than not, is the KPI.

Key Performance Indicators are one of the most widely used and least examined tools in organisational life. They feel rigorous. They feel scientific. Numbers on a dashboard suggest control, understanding, mastery. What they actually represent, when nobody is paying attention, is a proxy. A shortcut. A convenient stand-in for the thing you actually care about but cannot easily quantify.

Proxies are not inherently dishonest. Sometimes they are the best available tool. The problem arrives when the proxy gets promoted, quietly and gradually and without anyone signing off on the decision, from indicator to predictor. From “this suggests something about performance” to “this is performance.”

At that point, you are no longer measuring reality. You are measuring the measurement.

Goodhart Saw This Coming

In the 1970s, economist Charles Goodhart observed something troubling about monetary policy: the moment a government started targeting a specific economic indicator to control inflation, the relationship between that indicator and inflation broke down. The act of targeting it changed the behaviour it was supposed to measure.

Marilyn Strathern later distilled this into the formulation that should be tattooed on every manager’s forearm:

“When a measure becomes a target, it ceases to be a good measure.”

It sounds almost too simple. It is not. It is one of the most reliably destructive forces in organisational life, and it operates whether you are aware of it or not.

The examples are almost comedically instructive.

When the British colonial administration in Delhi offered a bounty per dead cobra to reduce the city’s snake population, local entrepreneurs responded with characteristic ingenuity: they started breeding cobras to collect the reward. When the scheme was cancelled and the cobras became worthless, the breeders released them. The cobra population ended up larger than before. The metric was consistently met. The actual problem got worse.

Soviet factory managers given production targets measured in tonnes manufactured enormous, unusable nails. When the target was switched to number of nails, they produced vast quantities of tiny, unusable tacks. The target was always hit. The shelves stayed empty.

Wells Fargo set aggressive targets for the number of new customer accounts opened. Staff responded by opening millions of fraudulent accounts in customers’ names without their knowledge or consent. The metrics looked extraordinary. The underlying purpose — genuine customer relationships built on trust — was being systematically destroyed. The eventual scandal cost the bank billions, and its reputation something considerably harder to price.

In each case, nobody set out to be dishonest or stupid. They were, in fact, being entirely rational: optimising for the thing they were being measured on.

The stupidity belonged entirely to whoever designed the measurement system and forgot to keep thinking about what it was actually for.

The Intellectual Skirt Problem

Here is a more contemporary example, drawn from a conversation I had this week that I have not been able to stop thinking about.

There is a recruitment technique circulating, apparently used in live interviews, where candidates are asked to open their ChatGPT account on the spot and request an analysis of what kind of person they are, based on their conversation history.

It seems clever. I will grant it that. Your AI history is, in theory, an unfiltered record of how you actually think: the problems you bring to it, the questions you ask, the way you frame things when nobody is performing for an audience. Unlike a CV or a personality questionnaire, you cannot retrospectively curate it to present your best self.

But here is the problem: this technique is looking up the candidate’s intellectual skirt and assuming it will reveal the whole person.

What it actually reveals is their ChatGPT usage. Which is not the same thing at all.

The heavy, purposeful user — someone who uses AI as a genuine thinking partner for complex professional and intellectual work — will produce a rich, revealing, analytically useful history. There are probably fifteen of these people in every hundred candidates.

The light or instrumental user treats AI like a slightly more fashionable Google alternative. Their history tells you they asked about the weather, checked a recipe, and once asked it to fix a cover letter. You learn almost nothing.

Contrast this individual, then, with the category nobody designing this technique apparently considered: the person whose history is sparse or absent because they find consumer AI tools redundant. Because they have moved beyond them entirely. Because they are running their own locally hosted AI infrastructure — open-weight models, custom fine-tuning, their own architecture — and have no particular need for ChatGPT.

The technique would identify that person as an AI non-user. Yet they might be the most technically sophisticated candidate in the room.

The proxy collapsed. The measure became the target. The interviewer saw in the mirror exactly what they expected to see, and missed the reality entirely.

Back to Steve: and Back to the TriAxis

Steve’s instinct that IT security compliance training “lives at” Level 3 while commercial training “lives at” Level 4 is not wrong, exactly. It reflects how most organisations use Kirkpatrick in practice. But that is precisely the problem.

IT security compliance training absolutely should be evaluated at the economic benefit level. The cost of a data breach is quantifiable.

The reduction in incidents post-training is measurable.

The assumption that compliance training only needs to demonstrate knowledge retention lets organisations off the hook for asking whether it actually works: whether the behaviour change it produces is sufficient to reduce real-world risk at a scale that justifies the investment.

Kirkpatrick’s hierarchy encourages organisations to stop climbing when the numbers look acceptable. Level 2 scores are fine, nobody is asking about Level 3, so we file the evaluation and move on. The measurement became the target. Goodhart’s ghost smiled quietly and went back to work.

The TriAxis Model does not solve this problem by itself. No framework does. What it attempts is something more modest and more important: it separates the dimensions of evaluation so that you cannot hit one number and call the job done. It asks different questions simultaneously and independently, making it structurally harder to optimise for a single proxy at the expense of the whole picture.

But the framework is only as good as the thinking behind it. Which brings us back to where I started.

The Real Question

KPIs are not inherently fool’s gold. They are a tool. The fool’s gold is the assumption that having them means you understand what is happening: that the dashboard is the reality rather than a simplified, gameable, potentially misleading representation of it.

The cobra breeders were not stupid. The Soviet nail manufacturers were not stupid. The Wells Fargo employees were not stupid. They were all responding rationally to the incentive structure they had been given.

The question worth asking, in your organisation, your training programme, your recruitment process, your performance management system, is whether anyone is still thinking about what the numbers are actually for.

Or whether the numbers have quietly become the point.

Julian Franklin is an L&D practitioner with more than 30 years of experience and someone who thinks rather too hard about things most people accept without question. The TriAxis Model is available for discussion, challenge, and improvement, preferably by people who have read this far.