AI vs Human Creativity: Study Shows Generative Models Beat the Average Person - podcast episode cover

AI vs Human Creativity: Study Shows Generative Models Beat the Average Person

Mar 16, 2026•25 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

A large-scale study from the University of Montreal tested advanced generative AI against more than 100,000 people using the Divergent Association Task.

Models such as GPT-4 outperformed the average human in generating original word associations, marking a major milestone in machine creativity. However, they fell short of the top 10% of highly imaginative individuals—especially in complex domains like poetry and storytelling.

The results suggest that while AI is becoming a powerful creative assistant, peak human innovation remains unmatched.

This episode includes AI-generated content.

Transcript

Speaker 1

Welcome to the Sentient Code, where intelligence is engineered, autonomy is emerging, and a line between human and machine grows thinner. Each episode, we decode the algorithms, explore the robotics, and examine the ideas shaping the future of artificial minds.

Speaker 2

If you sit down at a terminal right now and you try to force a genuinely original concept into existence, you are acutely aware of the cognitive friction involved.

Speaker 3

Oh absolutely, it's exhausting, right.

Speaker 2

It takes time, it takes actual metabolic energy. You're consciously fighting against every cliche and predictable thought pattern your brain has accumulated over a lifetime because.

Speaker 3

Your brain wants to take the path of least resistance exactly.

Speaker 2

But now consider the reality that while you are wrestling with that single initial concept, a generative model is instantiating complex narrative architecture, synthesizing completely divergent ideas and mapping out one hundred different structural angles in the time it takes you to blink.

Speaker 3

It's unsettling. It immediately forces an uncomfortable evaluation of our own cognitive utility. Really, it does.

Speaker 2

If the algorithm architecture can iterate at that scale, and speed. Does that mean the underlying mechanics of human imagination are becoming obsolete?

Speaker 3

That is the exact question we're looking at today.

Speaker 2

We are mapping the exact shifting boundary between biological intelligence and artificial generation. And we aren't just philosophizing here. The empirical data that emerged in January twenty twenty six provides a definitive, massive scale evaluation of this exact boundary.

Speaker 3

Yeah, the scale of this is just staggering.

Speaker 2

We're looking at a direct, quantifiable showdown one hundred thousand human minds evaluated directly against the architectures of GPT four, Claude and Gemini.

Speaker 3

And the scale of that January twenty twenty six data is what fundamentally shifts this whole conversation from theoretical philosoph to hard, quantifiable cognitive science.

Speaker 2

Not just a parlor trick anymore, not at all.

Speaker 3

This evaluation represents a heavily vetted structural analysis. It's backed by a massive collaborative infrastructure at the University of Montreal, Concordia University, the University of torontomus Osaga, the Quebec Ai Institute, and Google Deep Mind all involved.

Speaker 2

That is some serious institutional weight.

Speaker 3

Right, when you have that level of backing, spearheaded by principal investigators like Professor Kareem Jervi alongside Antoine Belmore, Peppin, Franz la Pass and deep learning pioneer Yoshua Benjio, you are no longer just testing whether an AI can write a clever email.

Speaker 2

You're putting the very architecture of thought under a microscope.

Speaker 3

Exactly. The researchers established a strict baseline of cognitive evaluation to determine precisely where the synthetic architecture outperforms the biological one and crucially where it catastrophically fails.

Speaker 2

And to understand the parameters of that baseline, you really have to look at the specific cognitive battlefield they mapped out. They didn't just ask the models to solve math equations or write code.

Speaker 3

No, that's too easy.

Speaker 2

The entire evaluation hinges on the hard distinction between convergent and divergent problem solving.

Speaker 3

Which is a vital distinction to make.

Speaker 2

Yeah, conversion thinking is it's straightforward optimization. It's synthesizing existing data to isolate the single objectively correct conclusion. Deductive reasoning basically.

Speaker 3

And neural networks have excelled at convergent optimization for years because there's a right answer to find exactly. If you define a closed system with a definitive optimal state, algorithmic processing will always outface human calculation. It's just a matter of compute. But divergent creativity is an entirely different neurocognitive mechanism.

Speaker 2

It's not about finding the one right answer.

Speaker 3

No, it's not about narrowing the operational parameters at all. It is about taking a single initialization point and exploding outward.

Speaker 2

I love that phrasing. Exploding outward.

Speaker 3

That's what it is. Divergence requires the generation of highly diverse, nonlinear, and statistically improbable concepts. It is the absolute core mechanism of innovation.

Speaker 2

So how do you even measure that? Because convergent thinking is easy to grade, you either got the math problem right or you didn't right.

Speaker 3

But to measure divergence, the researchers utilized a highly standardized metric called the divergent association TAST or the d IS developed by Jay Olsen.

Speaker 2

The methodological mechanics of the D are fascinating just because of how strict they are. The task requires the subject to generate exactly ten lexical items ten words.

Speaker 3

Just ten words, so it's incredibly simple.

Speaker 2

It sounds so simple, but the critical constraint is that those ten words must demonstrate maximum semantic dissociation.

Speaker 3

Meaning they have to be as unrelated as possible.

Speaker 2

Right. You have to produce ten words that are as completely unrelated in meaning and categorical classification as mathematically possible, and you only get about two to four minutes to complete.

Speaker 3

It, which is precisely the logistical efficiency that allowed the research to capture a statistically massive sample size. One hundred thousand human participants is a staggering data set.

Speaker 2

It really creates a robust population level baseline to benchmark GPT four Claude and Gemini against.

Speaker 3

And the scoring of those ten words is entirely objective. It relies on analyzing the semantic distance between the concepts in high dimensional vector space, so.

Speaker 2

They basically map out how far apart the words live in the human language network exactly.

Speaker 3

To illustrate what an APEX score looks like in this framework, we can examine a remarkably high scoring lexical sequence that was actually captured during the evaluation.

Speaker 2

Oh, I have the sequence right here. Yeah, listen to the conceptual jumps required to produce this sequence.

Speaker 3

Ready, go for it.

Speaker 2

Galaxy, Fork, Freedom, Algae, Harmonica, quantum nostalgia, Velvet, hurricane, photosynthesis.

Speaker 3

It's almost dizzying to listen to.

Speaker 2

I really want to pause on the sheer cognitive friction of that specific sequence. Just transitioning from galaxy macroscopic astrophysical structure directly to fork, a localized utilitarian tool.

Speaker 3

It requires a massive reallocation of conceptual processes.

Speaker 2

Yes, your brain has to entirely abandon the spatial and semantic network it just activated for galaxy. And it doesn't stabilize either.

Speaker 3

No, it immediately forces another.

Speaker 2

Jump from Fork. The next item is freedom, jumping from a tangible physical object to an abstract socio political state and then immediately to algae, a concrete biological organism.

Speaker 3

That constant jarring reallocation is exactly what the DDA measures. It is testing the subject's ability to evade categorical cluster.

Speaker 2

Categorical clustering.

Speaker 3

That's the trap, right, It is the ultimate trap. In standard neurocognitive functioning, Categorical clustering is the default state. The human brain is biologically wired from metabolic efficiency.

Speaker 2

We want to save calories always.

Speaker 3

If you access the concept of a specific fruit, say an apple, the neural pathways connecting to other fruits or perhaps agricultural concepts are preactivated. They're already warmed up, so.

Speaker 2

Your brain wants to say orange or banana next.

Speaker 3

Exactly, It require significantly less energy to cluster related items than to force the network to retrieve an entirely dissociated concept from a distant semantic.

Speaker 2

Neighborhood, like jumping from apple to carburetor.

Speaker 3

Right. Overcoming that biological inclination for efficiency is the hallmark of advanced cognitive divergence.

Speaker 2

That makes perfect sense when you map it onto real world innovation, because you know, the d is technically just a linguistic constraint task, but it operates as a proxy for complex problem solving across entirely disparate disciplines.

Speaker 3

It's a foundational cognitive mechanical. Yeah.

Speaker 2

The neurocognitive processes required to bridge the semantic distance between galaxy and fork are the exact same mechanisms required to synthesize conflicting variables in an engineering crisis or to develop a completely novel economic model.

Speaker 3

If your neural architecture can sustain that level of divergence, you have the capacity for high level innovation. Period.

Speaker 2

So the critical data point becomes the performance of the generative models against that one hundred thousand person human baseline. How did the machines actually do?

Speaker 3

The statistical conclusion drawn from that data establishes a mathematical juncture that the principal investigators identify as a Turing point.

Speaker 2

A Turing point that sounds ominous.

Speaker 3

It's a massive milestone. Generative AI systems, specifically, the architectures of GPT four, Claude, and Gemini now consistently surpass the median human output in divergent linguistic creativity. Wow, the algorithms are mathematically superior at this specific divergent task than the exact midpoint of the human statistical distribution.

Speaker 2

That is a staggering realization. If you are operating at the statistical average of human brainstorming, the algorithm has already eclipsed your baseline capacity.

Speaker 3

It's faster and mathematically more divergent than the average person.

Speaker 2

But and this is the massive caveat that the analysis of the data stratifications by Belmar, Peppin and Lapasse reveals the machines don't just infinitely scale upward. They hit a definitive performance ceiling.

Speaker 3

And that ceiling is perhaps the most critical finding of the entire evaluation.

Speaker 2

Walk us through that. Where do they hit the wall?

Speaker 3

When the researchers isolated the comparative data to examine only the top fifty percent of human participants, the algorithmic superiority vanished. It just disappeared completely. The aggregate scores of that upper half of human subjects entirely. It clicks to every single artificial model tested.

Speaker 2

So if you are in the top half of creative thinkers, you are still beating the most advanced AI on the planet.

Speaker 3

Yes, and it gets even more pronounced. Yeah. When you analyze what the researchers tur of the decile gap that's the top ten percent of highly creative individuals, the quantitative disparity is profound.

Speaker 2

The machines can't even get close to them.

Speaker 3

The models cannot reach the baseline of that top human decile, and mathematically, the statistical gap between the machine's computational limit and human apex creativity is actually expanding.

Speaker 2

I'm so curious about the mechanics of that failure. If the AI can process billions of parameters and map the entire vector space of human language. Why does it mathematically fail to cross that docile gap? Why does human cognition win on the apex.

Speaker 3

It comes down to the fundamental difference between probabilistic calculation and what the researchers call the semantic leap.

Speaker 2

The semantic leap, yes.

Speaker 3

High level human intellect makes intuitive, nonlinear jumps across concepts that entirely bypass statistical probability.

Speaker 2

Because we aren't just doing math in.

Speaker 3

Our heads, right. Generative algorithms at their core optimize for expected value within the latent space. They calculate the mathematically most probable token sequence based on their training weights.

Speaker 2

There are prediction engines.

Speaker 3

Exactly, but true apex creativity that semantic leap is inherently about identifying the connection that is statistically highly improbable yet profoundly meaningful once established.

Speaker 2

So the AI is almost too logical for its own good.

Speaker 3

The algorithm is constrained by its own predictive optimization. It cannot easily prioritize the improbable without descending into chaotic noise.

Speaker 2

But evaluating the solely on the generation of ten isolated words invites a pretty structural critique doesn't it. Generating single words in a vacuum is a very specific type of constraint.

Speaker 3

It is, it's highly artificial, right.

Speaker 2

So how does this mathematical limitation translate to complex, extended creative generation like doing actual work in the real world.

Speaker 3

Here the researchers absolutely anticipated the limitations of the data as a standalone metric. To validate the existence of that computational ceiling, they transition the methodology from isolated lexical tasks to highly structured, context dependent modalities.

Speaker 2

They leveled up the testing exactly.

Speaker 3

They benchmark the models using three advanced writing challenges, first haikus, second cinemat plot summaries, and third full short fictional narratives.

Speaker 2

Those three modalities require entirely different constraint satisfaction mechanisms. A haiku is a rigid three line structure requiring strict select constraints.

Speaker 3

You're forcing the AI to optimize for syllables rather than just semantic.

Speaker 2

Meaning, yeah, that's a totally different math problem for it. And then cinematic plot summaries demand narrative arcs, thematic cohesion, and structural resolution all within a highly concise format.

Speaker 3

Right, you need a beginning, middle, and end that actually makes sense together.

Speaker 2

And authoring a full short fiction narrative tests the system's ability to sustain an architecture over thousands of tokens.

Speaker 3

And what became glaringly evident across all three of those modalities is that the statistical limitations observed in the simple ten word dat scaled directly into these richer formats.

Speaker 2

So the AI struggled with the longer.

Speaker 3

Formats too heavily. When forced to sustain structural coherence, emotional resonance, and thematic depth over an extended sequence, the artificial models suffer from acute computational degradation.

Speaker 2

Computational degradation that is a crucial concept to dissect here. It isn't just that the AI gets tired like a human would.

Speaker 3

No, not at all. It's a mechanical failure of the transformer architecture itself.

Speaker 2

Because generative models rely on localized predictive text generation.

Speaker 3

Precisely the attention mechanisms, the actual mathematical functions that determine how much weight a specific word should have on the next word being generated. They are highly optimized for the immediate context window.

Speaker 2

It's only really looking right in front of its own face.

Speaker 3

Exactly when the model is generating token number two thousand, the mathematical influence of the thematic setup established in token number ten has been massively diluted.

Speaker 2

It essentially forgets why it started the story in the first place.

Speaker 3

It loses the thread of the global narrative architecture because it is hyper focus on the statistical probability of the immediate sentence it is constructing right now.

Speaker 2

Contrast that localized prediction with human episodic memory. When a skilled human creator constructs a complex narrative, they are not linearly guessing the next probable word.

Speaker 3

No, they possess a hierarchical mental model of the entire global architecture simultaneously.

Speaker 2

Right A human embeds multi layered significance because they can draw upon lived episodic memory and apply complex sociocultural contextualization.

Speaker 3

You can construct a thematic metaphor early in a structure with the explicit, predetermined intention of resolving it much later.

Speaker 2

But the AI can't do that.

Speaker 3

The AI cannot possess a predetermined intention. It is a sequential prediction engine. It completely lacks the episodic memory required to generate genuine, layered emotional resonance.

Speaker 2

That introduces a highly relevant variable regarding how we actually interact with these models on a daily basis. If the fundamental architecture is a rigid sequential prediction engine, can the operational parameters be modulated to force a higher degree of divergence?

Speaker 3

You mean, can we pop the hood and change how it thinks?

Speaker 2

Yeah, we know. These systems aren't entirely static.

Speaker 3

They are highly mutable, actually, and the primary mathematical adjustment is the parameter known as temperature. Yes, temperature directly modulates the probability distribution of token selection within the large language model.

Speaker 2

So what happens in a low temperature state?

Speaker 3

In a low temperature state, the algorithm is constrained to prioritize the highest probability tokens. The output is conventional, highly predictable, and the semantic distance between generated concepts is minimized.

Speaker 2

It plays it safe exactly.

Speaker 3

It prioritizes structural safety.

Speaker 2

So low temperature is the optimization for convergent tasks synthesizing data without taking any statistical risks, But adjusting to a high temperature state flattens that probability curve.

Speaker 3

It does you are fundamentally altering the loss function to penalize the most obvious choice.

Speaker 2

You're forcing the algorithm to select lower probability tokens. It forces exploratory, unconventional associations.

Speaker 3

But here's a crucial question. Does merely flattening the probability curve actually equate to the semantic leap we discussed earlier or is it just introducing mathematical randomness?

Speaker 2

That is the big question. To high temperature equal creativity or just chaos?

Speaker 3

That is the exact distinction. The researchers emphasized. High temperature alone does not generate APEX creativity. It simply introduces chaos.

Speaker 2

So it just spits out gibberish if it gets too hot.

Speaker 3

Essentially, yes, the critical non negotiable variable is the human operator. Sophisticated prompt engineering is required to provide the structural parameters that harness that high temperature state, and.

Speaker 2

The evaluation provides a brilliant empirical example of this dynamic through the use of etymological prompts.

Speaker 3

Yes, the etymological prompts are fascinating.

Speaker 2

I found the mechanics of that so interesting. The researchers didn't just instruct the model to be more creative or increase your creativity metrics no more. They provided a structural constraint that forced the model to actively process lexical origins and morphological structures. They basically instructed it to evaluate the historical root pathways of the language itself.

Speaker 3

By imposing that specific structural constraint, the human operator forces the algorithmic processing out of its standard predictive vernacular pathways.

Speaker 2

It can't just guess the next conversational word anymore, right.

Speaker 3

The model is forced to intersect completely different conceptual nodes in its vector space academic, historical, and structural data rather than standard conversational probabilities.

Speaker 2

And this human guided intervention resulted in highly unpredicted associations that significantly elevated the creativity metrics of the output.

Speaker 3

It perfectly illustrates what the researchers term the dependency paradigm.

Speaker 2

A dependency paradigm.

Speaker 3

Yes, it is a strict master servant dynamic. The algorithmic architecture is entirely dependent on precise, imaginative human guidance.

Speaker 2

So the machine cannot achieve meaningful divergence in a vacuum.

Speaker 3

Absolutely not. The absolute boundaries of the generated output are dictated entirely by the sophistication of the human operator's initial contextual parameters.

Speaker 2

And that dependency paradigm exposes one of the most significant dangers in our interaction with generative systems, the illusion of intellect.

Speaker 3

It is incredibly easy to mistake structural coherence for authentic conscious thought.

Speaker 2

Because it sounds so confident and grammatically perfect.

Speaker 3

Exactly, we must maintain a rigorous distinction between synthetic reasoning and biological comprehension. Consider the data point regarding the GPT three architecture demonstrating the statistical capacity to match collegiate level logic scores.

Speaker 2

When you look at the mechanics of that, it is a perfect example of the illusion. The model matching those logic scores is engaging in pure syntactic manipulation.

Speaker 3

It's just shifting symbols around, right.

Speaker 2

It is shifting symbols based on highly complex, mathematically optimized rules, but it possesses zero biological comprehension of the abstract concepts those symbols represent.

Speaker 3

It doesn't know what a logic puzzle actually is.

Speaker 2

No, it is executing a function without any internal representation of meaning.

Speaker 3

And this lack of internal representation becomes a severe liability when generative models are deployed in professional settings without rigorous human oversight.

Speaker 2

The Evaluation details specific documented failures.

Speaker 3

Of this yes, notably the phenomena of chat GPT generating entirely fictitious references in medical research contexts.

Speaker 2

Let's really examine the underlying architecture of those medical hallucinations, because it isn't a glitch.

Speaker 3

No, it's not a bug.

Speaker 2

It's the model functioning exactly as designed. When a generative model produces a fictitious medical citation, complete with a fabricated doi, a plausible author list, and a non existent journal title, it is doing so because the architecture prioritizes structural plausibility over factual accuracy.

Speaker 3

Precisely, the model's primary objective function is to generate an output that statistically resembles its training.

Speaker 2

Data it wants to blend in.

Speaker 3

It is analyzed millions of medical papers, so it possesses a flawless mathematical map of what a medical citation is supposed to look like.

Speaker 2

Structurally, it knows the exact formatting.

Speaker 3

It generates a structurally perfect imitation because that reduces its mathematical loss function, but it possesses no biological comprehension of truth.

Speaker 2

And no independent mechanism to verify if that synthesized string of characters actually maps to external reality.

Speaker 3

It mimics the structural shape of wisdom without possessing the capacity for factual verification.

Speaker 2

And the research explicitly highlights the real world implications of prioritizing that structural plausibility, particularly regarding the perpetuation of demographic biases and clinical decision making.

Speaker 3

This is a critical mathematical vulnerability. If the historical data utilized to train the model contains demographic biases regarding clinical outcomes or diagnostic frequencies.

Speaker 2

Which we know historical medical data absolutely does.

Speaker 3

Right, then the algorithm will inherently synthesize and reproduce those biases.

Speaker 2

It just echoes it back.

Speaker 3

It does so because those biased correlations are statistically probable within its latent space. It lacks the moral or biological comprehension to to recognize the bias as an error.

Speaker 2

It just sees a pattern and repeats it exactly.

Speaker 3

It simply executes the statistical reproduction of its training environment.

Speaker 2

Which brings us back to the broader socioeconomic apprehension surrounding this technology. There's this pervasive anxiety that if these models can execute both convergent optimization and baseline of vergent generation, at this massive scale, human intellectual labor is facing absolute obsolescence.

Speaker 3

People are definitely worried about being replaced.

Speaker 2

Professor Derby directly addresses that apprehension, though by articulating a completely different paradigm. He calls it the utility framework.

Speaker 3

Yes, the utility framework. He explicitly refutes the concept of AI as a replacement mechanism or a direct competitor.

Speaker 2

So what is it then?

Speaker 3

Within the utility framework, generative AI is classified as a cognitive prosthesis.

Speaker 2

A cognitive prosthesis I like that it is.

Speaker 3

A highly advanced tool engineer to amplify human imagination, but it remains fundamentally inert without human structural parameters prompt engineering and oversight.

Speaker 2

To its function, and that leads directly into the AI paradox.

Speaker 3

The paradox is fascinating.

Speaker 2

We are currently observing a massive structural shift where routine, lower level divergent tasks are being easily and efficiently automated. You don't need a human to brainstorm ten basic marketing taglines anymore.

Speaker 3

The machine can do that in two seconds.

Speaker 2

Right, But precisely because that baseline tier of generation is now ubiquitous and automated, the market and intellectual premium placed on APEX human creativity is increasing exponentially.

Speaker 3

The automation of the baseline does not devalue the APEX, it isolates it as the only remaining differentiator.

Speaker 2

If everyone has access to the same algorithmic baseline, the only way to stand out is to jump past it.

Speaker 3

Exactly because generitive models suffer from computational degradation over extended architectures, and because they fundamentally lack biological comprehension and episodic memory, the human cognitive capacity to master complex, resonant global narratives is more article now than at any prior point in history.

Speaker 2

The bar has just been raised dramatically. This necessity's a fundamental revaluation of our educational infrastructures.

Speaker 3

It absolutely has to change.

Speaker 2

If the algorithmic baseline mathematically exceeds the median human capacity for divergent tasks, an educational model optimized to train individuals merely to reach that median is totally obsolete.

Speaker 3

We can't just train people to be average anymore.

Speaker 2

No to navigate an architecture driven future. The explicit focus must be on cultivating the human capacity for the semantic leap. We have to push cognitive development into that top decile where the generative model structurally cannot follow.

Speaker 3

That is the definitive reality mapped by this evaluation. We have empirical quantification of machine creativity.

Speaker 2

Now the numbers are in.

Speaker 3

They are We know the exact touring point where the algorithm defeats the human median in lexical association, but we also have a mathematically proven computational ceiling.

Speaker 2

The machines in a wall.

Speaker 3

They do. The artificial model are measurably and distinctly subordinate to human ingenuity in complex, structurally rich modalities. They are masters of syntactic manipulation, but they require the structural parameters of a biological mind to achieve true innovation.

Speaker 2

Which leaves you with a profound structural question to evaluate as you integrate these generative systems into your own complex cognitive workflows.

Speaker 3

What's the takeaway here?

Speaker 2

As the convenience and speed of artificial generation becomes ubiquitous, are we going to allow our baseline of creative thought to homogenize? Are we going to let ourselves be constrained by the exact same algorithmic probabilities as everyone else.

Speaker 3

That's the danger of the default settings.

Speaker 2

Right or and this is the alternative. Will the demanding necessity for highly sophisticated high temperature prompt engineering inadvertently cultivate a radically new specialized stratum of human cognitive diversity?

Speaker 3

Are we going to learn to think weirder just to guide the machines better?

Speaker 2

Exactly? That is the boundary we are currently navigating. Thank you for joining us on this exploration of the architecture of thought. Keep analyzing the structures around you, keep pushing your cognitive boundaries, and we will see you next time.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android