‘Speech computations of the human superior temporal gyrus’ with Eddie Chang

Stephen Wilson

00:06

Welcome to Episode 23 of the Language Neuroscience Podcast. I'm Stephen Wilson. Thanks for listening. Today I have a return guest, my friend Eddie Chang, Professor and Chair of Neurological Surgery at the University of California, San Francisco. Eddie is a neurosurgeon and neuroscientist who studies the brain mechanisms of speech, perception and production, primarily using electrocorticography. That is, direct recordings from the

00:29

cortex during surgery. This is a technique that Eddie has been at the forefront of developing, and it offers incredible spatial and temporal resolution. Eddie is doing some of the most original and exciting work in our field. I've asked him back to talk about his new review paper with first author Ilina

Bhaya-Grossman, entitled

00:45

Speech Computations of the Human Superior Temporal Gyrus, which just came out in Annual Review of Psychology. This fascinating paper is a synthesis of many of the most important studies from the Chang lab over the last decade or so. Okay, let's get to it. Hi, Eddie. How are you?

Edward Chang

01:03

Good. Hi, Stephen, great to see you.

Stephen Wilson

01:05

Good to see you again, too. So, as you know, because we've just been messing around with it, I'm joining you from the special location of my daughter's closet. (Laughter) Because my neighbors, who are very dear friends of mine, are getting an extra part built onto their house, and it's a little bit noisy today. So I had to relocate.

Edward Chang

01:27

Awesome.

Stephen Wilson

01:28

How about yourself?

Edward Chang

01:29

Ah, I’m in my office, and it's just a regular foggy day here in San Francisco.

Stephen Wilson

01:35

Yep. Like all the days, and what do you been up to lately?

Edward Chang

01:40

Um, just our usual thing, taking care of patients in the neurosurgery ward and on other days in lab, trying to make progress.

Stephen Wilson

01:51

Uh huh.

Edward Chang

01:52

Slow and steady slog.

Stephen Wilson

01:54

Yeah, not that slow.

Edward Chang

01:57

Sometimes it feels like that for sure.

Stephen Wilson

01:59

Yeah. I hear you. Okay, so today we're going to talk about your recent paper, in Annual Reviews of Psychology. Um, can you tell me the name of your co-author?

Edward Chang

02:12

Sure. Yeah. The co-author is Ilina Bhaya-Grossman.

Stephen Wilson

02:16

Aha, and is she in your lab currently?

Edward Chang

02:19

Yeah. So, Ilina is a graduate student in the lab. She was a linguistic center graduate at Berkeley and she joined our lab about two years. And I think that, it was the perfect timing, because we’ve been working on speech perception related work for about, you know, 12 years now. And it was a good time to sort of do a synthesis of all the findings that we've had, and try to put it all together in a

02:49

review. And a great project actually, for graduate students to take on because you know, a lot of it is to look at the literature, think big picture about what are the most important questions, and then try to put it together.

Stephen Wilson

03:03

Yeah, I liked it, too. It's like a really nice kind of roadmap to the work that you've done over the last decade. And a little more than a decade, kind of focused on your lab, but bringing in stuff from other labs too. Definitely an ECOG perspective, but again, bringing in data from other methodologies. So, I think it's a really neat paper, and I'm looking forward to talking about

03:28

it. I thought we could start by, I kind of want to start at the end a little bit, and then kind of go back and fill in the details. And by the end, I mean, the theory that you guys ended up proposing. So, you kind of have this different take on

03:41

speech perception. It's not a hierarchical take, in contrast to the main models, I was wondering if maybe you could start by outlining, like the big picture, theoretical position that you're putting forward in this paper, and then we'll kind of go back and talk about some of the studies that inform that position.

Edward Chang

04:00

Sure. So this review is about, essentially, you know, our idea of encoding of speech in the human Superior Temporal Gyrus. And we think that this is a really, really special area for speech processing, primarily the intersection between the auditory cortex system and the

04:25

linguistic language one. And it really summarizes a lot of the empirical work that's been done from the lab, trying to build up, you know, looking critically at theories of speech perception and trying to see if that sort of fits with the data that we see when people are actually listening to speech and doing

04:47

perceptual tasks. And really one of the motivations for this is that, as desperately hard as we’ve tried to look for clear examples of the standard classical feedforward hierarchy, that we think, that we think, and have really conceptualize is that, the key process for moving from sounds to words to meanings, it doesn't really look

05:13

like that, really at all. And, to me, that's been a surprise over time, you know, just to continue to talk about that, and to be able to continue to look for it, but not not really see it. As an example, and we can get back to this later, it's just not clear that we see single neuron or single electrode responses to things like invariant representations

05:39

of words. And so I think that these observations and more pose deeper questions actually about the validity of the classic, you know, feedforward hierarchical model, that you go from low level sound representations into higher levels of abstraction, in language using like a primary feedforward process across these different levels of representation. One of the things that we recognize was that, that actually has been a relatively successful model for explaining vision.

06:19

Yeah, so in this review, I think we're proposing

Stephen Wilson

06:19

Yeah. that we should be thinking about perhaps alternative mechanisms, like we call recurrent ones that are really more time based and thinking about things that maybe the auditory system may be more specialized for. Right. Um, and so your model doesn't really shy away from different levels, though, right? I mean, you don't, you still acknowledge that there's differences between, you know, say, acoustic phonetic features, and phonemes

06:49

and lexical representations. But instead of locating them in a sequence of brain areas, you have a different idea of where in the brain they are represented.

Edward Chang

06:59

Yeah, that's exactly right. So, you know, one potential sort of spatial layout of the classic model of auditory word recognition is that you, as you go from one piece of cortex to the next, you get this like increasing abstraction and level of representation. So , in the paper we basically described that, you know, one popular way of thinking about this is that, essentially, the whole subcortical system is involved with low level sound processing, including the primary auditory

07:33

cortex. And then by the time you get to the STG, is doing something like spectral temporal, or phoneme extraction and then you'd go from syllable, maybe in the STS and then at another level, like word representation, MTG. So that would be kind of like a quite literal translation of what we've, you know, what we think is going on in the ventral stream of vision, just that you just go from one level of representation from one piece of cortex to the next adjacent one.

08:06

And what we are, yeah, go ahead Stephen.

Stephen Wilson

08:09

I mean, I was just gonna say, like, yeah, there's just these classic figures that I think we've all seen that kind of show this, you know, how the receptive fields get larger, and the nature of the visual representations get, you know, more complex from, from dots to lines to shapes to entities, as you progress anterior along that ventral visual stream, and I think what you're getting at is that we've all in, in the language world, kind of hoped that one day we would have a model like that.

08:35

And you're just kind of coming around to the view that that's not the way it's gonna be.

Edward Chang

08:42

Well, I think that I'm very influenced by the data. And we've looked very hard for something that is a clear proof of evidence for that, and I think, as a field, we've kind of like bought that, that idea hook, line, and sinker, you know, as, as a basic approach to it. But I don't think we've done enough to really think about what are the alternatives to that? And, again, the evidence for it, I think, is not not

09:09

fantastic. And so, it really got us thinking about what are the alternatives and the the kind of models that we're really trying to explore right now are ones that really think about how can you have multiple levels of representation actually within one given area. In this particular review, we're really thinking about the Superior Temporal Gyrus and also a the

09:31

Superior Temporal Sulcus. A lot of our work has naturally focused on STG because it's where we've done most of our work and where we have most of our recordings for the STS is in the immediate adjacent area that probably has some overlapping but also distinct processing.

09:44

But the, the basic idea is that instead of thinking about these levels of representations sort of changing from one brain area to the next, but really Thinking about the levels of representation changing as a function of time and as processing within a given area.

10:08

And so, the way that that can happen is that, as sound comes into an area, for example, the Superior Temporal Gyrus, there is a level of feature extraction, extraction of the phonetic elements, the phonetic features in particular, both phonetic and prosodic, as well as some other acoustic landmarks that we'll talk about later. But it also has mechanisms of memory, and to some degree prediction that are integrated

10:41

with those inputs. And so, the, the detection and processing of those features with context memory and prediction, then are feedforward, you know, like are feedback recurrently, within the same part of brain to generate and interpret the next bit of information that comes in. So that’s something we broadly call recurrent. And that's a very different kind of representation than one that is just sort of looking at larger receptive fields and integrating more

11:12

information. At a computational level, it could look the same, because if you unroll a recurrent system, it could look like, you know, it's just integrating more information over time, but the way that it does it, I think, is really fundamentally different and the predictions that you would have about how these things lay out across brain area, are also

11:32

different. So, this is something that we're exploring pretty deeply and we're trying to understand, you know, how, how this could happen, you know, at a mechanistic level.

Stephen Wilson

11:49

Right. Okay, so that's a good outline of the, sort of big picture of the theory. And in in your paper, you kind of get to that, at the end after going through some of the important empirical observations. So let's kind of step back now into that part of the paper and talk about some of these findings, that ended up

12:08

motivating this model. And I guess I've been, you know, as I tried to think about it, it seemed to me that one of the central themes of the paper and of your work in general is, all the different ways in which the STG is not just representing sounds vertically but is shaping them in ways that are influenced

12:32

by the linguistic system. And so when we talked last year, we talked about your first paper on this topic, which was back in 2010, on categorical perception, and you do mention that again, here, you know, you've got this evidence, you know, when you present people with ‘ba’, ‘da’, ‘ga’ continuum, how the neural representations in the STG, kind of fell into the three categories, rather than, of the, you know, the phonemes that are perceived rather than just kind of representing all the scale

13:06

points. So I guess we won't kind of talk about that again, because we, well, actually, maybe we should just kind of brush up on it, because it is kind of the the entry point into this paper too. So, do you want to kind of just recap on that one?

Edward Chang

13:22

Sure. I think it's an important starting point and l think it's very relevant because it's shaped our perspective on this over time.

13:28

You know, what we did back then was, I think, a very entry level, look into what would you do, what would you look for, for example, if you had a high density, high density recordings from STG and this was one of the first studies methodologically that was done actually using a high density ECOG recordings actually, during awake brain surgeries to record from the Superior Temporal Gyrus, where people are listening to speech sounds and the high density part actually turns out to be very

14:00

important, because it allowed us to change the nature of the research from kind of like the broader questions like is this area involved in speech perception to asking more finely detailed questions like what's being coded by, you know, a millimeter size piece of cortex and how does that compare to the, the next site that's only a

14:23

couple millimeters away. Looking at this as a population thing, so, in both spatial and temporal resolution, so, for us the natural starting point was to looking at some tasks that historically have been very relevant for psycholinguistics and one of the central ones was categorical perception really trying to understand some properties of nonlinear, you know, nonlinear perception basically, where where you can take a stimulus, you can vary some parameter of it, like you

14:56

know, the formant left to starting point. For example, for these consonants like ‘ba’, ‘da’ and ‘ga’, and what you can see really clearly on a perceptual level is that people will really perceive three consonants, even though the stimulus is parametrically and continuously

15:14

varied. And so, that was the first clue was that, you know, that things are not vertically represented, and actually was the first evidence of many other kinds of evidence that we've seen since then that what we see in the STG actually, really corresponds really well to what we perceive and what we experience.

Stephen Wilson

15:41

As opposed to the physical stimulus?

Edward Chang

15:44

As opposed to the physical stimulus. That's exactly right. And so a couple examples of that would be, for example, when you're listening to two pieces of speech from different speakers that are mixed and overlapping, and you pay attention to just one of them, the STG is really selectively processing, the one that you're paying attention to. That's like a high level attentional filtering for what

16:11

we perceive. There’s other things like about how we normalize for things like speaker identity, and other kinds of categorizations. Or another example is, if you do phoneme restoration, you mask a phoneme in a word and, like ‘faster’ and ‘factor’, if you mask that middle phoneme out with noise, you ask what people hear on single trial basis, about 50 to 50% of time, though, report one word or the other.

16:44

And we can see an ECOG recordings that what they perceive actually corresponds to whether the brain is activating that ‘ka’ sounding factor or this, the fricative ‘S’ sound in ‘faster’. So, I think over and over again, we've seen really clear evidence of what's happening in the STG really does correspond to what we, what we experienced in perception.

Stephen Wilson

17:11

Yeah, let's, let's dig into all of those in a little bit more depth. Because you know, I think that, you know, you just talked about like four or five papers, each of which are important in their own right. And I think our listeners would be interested in some of

17:21

the details there. So maybe we could start with speaker normalization and maybe you could talk about that in the context of your study on vowel representations, and um, the way that you saw evidence for representations of distinct vowels, but were also modulated by the speaker in such a way that the representations capture vowel identity rather than just the simple formance.

Edward Chang

17:48

Right, so um, kind of early on, we teamed up with a linguist, Matthias Sjerps from Amsterdam and he was able to join us in collaboration with Keith Johnson at Berkeley. And he'd been very, very interested actually, in this phenomenon called speaker normalization.

18:06

The basic idea with speaker normalization, is that, because of the vocal tract properties, all of us as individual speakers, like you and me, Stephen, we have our you know, voice properties that are governed by the pitch of our voice, but also the shape of our mouth and that's what gives rise to some of this, some of the specific acoustic attributes of

18:28

our speech. And there's this thing that's very, very interesting and important about, about speaker normalization for vowels, which is that for the most part there, you know, the shape of your mouth, kind of overlaps in the cues that are, you know, related to the shape of your mouth, and overlap with some of the cues that are generated when you speak

18:56

different vowels. And that poses a really interesting problem to the perceptual system because, when you're trying to understand what someone is saying, to some degree, you want to get rid of the speaker identity related stuffs, just so you can extract

19:12

the vowels themselves. And it turns out through, you know, years of second linguistic studies, that's a really robust phenomenon that allows us to extract vowel identity, while sort of ignoring or, you know, not processing, I wouldn't say not processing but sort of, able to hear vowels, irrespective of that speaker identity. And in order to do that, what you have to do, is to essentially contextualize the speaker identity, and then process the formant information from the

19:53

vowel. So formants are the really important resonance frequencies that are created by the shape of the vocal tract when we create vowels. And you have to do like a very quick computation that essentially normalizes for who's speaking, making, you know, basically making an adjustment for the formant properties of given speaker in order to arrive at the correct vowel identity.

Stephen Wilson

20:19

Yep, I just want to kind of make it even more concrete. So, you know, I think many of our listeners will know this already but you know, the primary, most vowels can be discriminated pretty well, by their first two, formants. They are called F1 and F2. So, you can almost imagine all the possible vowels falling into this two dimensional space based on the first and second

20:37

formants. But the tricky thing is, as you're alluding to, is that formant frequencies depend not just on vowel identity, but also on the length of the speaker's vocal tract is probably the most salient sort of non linguistic influence on them, right? So at a simple level, what you have to do is not just interpret the formants in absolute space, but kind of relative to the speaker’s vocal tract length, which I guess the listener has to infer based on other aspects of their speech, right?

Edward Chang

21:10

Exactly. Exactly. It's an example of taking contextual information about the speaker identity, essentially, the cue of vocal tract length, and incorporating that in the computation of vowel identity. So it's not just about the formants themselves, it's about this additional information about the speaker identity that helps us normalize for the vowel.

Stephen Wilson

21:38

So can you tell me how you set up the experiment to look at how this kind of information is represented in the STG?

Edward Chang

21:46

Well, so the basic idea is that there's a psychophysical experiment where speakers with, you know, we generated these synthesized speech sounds in order to make them really well conformed. And we again, created a continuum, you know, in, we can define some psychometric function, basically, for identifying one vowel. For example, ‘oo’ versus ‘o’ and we identify sort of like, the identification boundary between those two vowels across all these stimuli.

22:20

Importantly, these are embedded in the context of let's say, a sentence. So the speaker can first hear, you know, the preceding speech, and then they have to identify sort of like the vowel that's embedded in it. It's that context of the surrounding speech that gives listeners a cue to serve, what's the context? What's the baseline? And we run the same thing, while, we run it, basically, while we're recording

22:46

the brain activity. And so, the basic approach is to see if we can correlate what we're seeing the psychophysics with what we see in the brain activity and, and in that particular study that Matthias carried out, there actually wasn't really good correspondence between what the listeners perceived and what we saw in the neural activity.

Stephen Wilson

23:09

Right, so the STG activity kind of followed the perceptual judgments about the vowels, rather than following the physical formants that were contained in the vowels.

Edward Chang

23:23

That's exactly right. And so, it’s results like that, and more than have contributed to us. Thinking about the STG is a really important area for phonological processing. And what I mean by phonological processing, refers to that this is an area that's important for speech sound processing. It's not just spectro- temporal. It's not just acoustic. In a lot of ways people use those terms to

23:55

describe clinical low level. But in reality, it's a really high level of extraction and computation that's necessary to do these things. And so, we use this term phonological in a

24:09

very, very specific way. And to me, it's also important because it has a lot of resonance, even to a lot earlier work where Wernicke had described and postulated that, you know, the Superior Temporal Gyrus and the posterior temporal lobe had a really important role in this quote, unquote, sound word image that this is an area that was really important for the generation of not just the sound, but the sound word, or the object level representation

24:35

of words. Not the low level, not like the high level linguistic, but really it this word form representation. So that's what we refer to as phonological. And the key evidence for me is that the neural recordings there really correlate with perception.

Stephen Wilson

24:59

Yeah, I think you've, you show it in lots of different ways. And let's talk about the second one that you mentioned briefly just there, because I think it's a lovely experiment by Matt Leonard and your group on phoneme restoration. So this one's from 2016, and you kind of run through it real quickly, you know, five minutes ago, but can

25:22

you flesh it out a little. So you talked about how, you know when, when a, when a noise interferes and blocks out a phoneme completely, listener are not even able to report which phoneme was blocked out, they restore the missing phoneme, and are almost like, barely aware

25:40

that something was missing. And then you can create really interesting experiments around this by constructing situations where there are actually two possible completions and the example that you gave was ‘faster’ and ‘factor’, which actually doesn't work at all in my dialect, because they have different vowels. But if it's in your dialect, you would say ‘faster’ and ‘factor’ and they’re minimal pairs. And so you can block out the ‘seh’ or ‘keh’ and you, and then see what

26:11

people perceive. Right?

Edward Chang

26:17

Yeah.

Stephen Wilson

26:17

So yeah, so you want to take it from there and kind of tell us more about the neural and behavioral findings?

Edward Chang

26:25

Sure. So previously we were just talking about a kind of nonlinear processing that generates behavioral effects like normal speaker normalization. Matt’s paper on phoneme restoration, I think alludes to another really basic computational property in STG that has to do with dynamics. And what I mean by dynamics is something that has to do with a contextual processing that is time dependent and his paper really

26:56

gets into this a bit. But yeah, so the basic idea is Robert Ramirez, you know, decades ago, did this classic experiment that described that speech perception is really robust, you can mask out certain sounds, you can actually delete them, you don't even have to mask them, you can actually delete them and replace them with noise and people can actually perceive the word in

27:20

its entirety. In fact, with the stimuli that we created, that's what we did, we actually spliced out some of the segments and replaced it with white noise and you can get a clear perception of the word and sometimes strangely, you actually hear the noise burst, like independently, it's like your auditory system is processing a word plus a noise bursts. Sometimes it's so robust that the noise bursts actually, you perceive to follow the word.

Stephen Wilson

27:52

Wow!

Edward Chang

27:52

It's like so robust, you know, what our, our auditory system, it’s like, a mechanic taking, taking apart the word and then reassembling it and taking the noise out putting in one…

Stephen Wilson

28:06

Yeah.

Edward Chang

28:07

Yeah, and then…

Stephen Wilson

28:08

And that reminds me of some of those old dichotic listening experiments, you know, where they would take like, one particular formant transmission and play it through one ear that was missing, that was missing from the other ear and what would be heard was the reconstructed combined stimulus and also separately a chirp.

Edward Chang

28:23

Yeah.

Stephen Wilson

28:23

Right? So it's kind of along those lines.

Edward Chang

28:25

It really is. And that's another classic example of how robust this is, and how really remarkable it is in the perceptual system to do that. So what Matt did, which I think was quite ingenious, was not to just do that effect, he really wanted to prove that this could correlate with the single trial dynamics. And he wanted a more interesting behavior. So, he actually add, you know, did something new, which was he created a series of word pairs.

28:53

So ‘factor’ and ‘faster’ is one where you have one minimal pair that's different between the two words that you can mask out and other examples like ‘babies’ and ‘rabies’. Okay, so just two words that you can just replace one phoneme with the other. And then ask….

Stephen Wilson

29:15

‘Babies’ and ‘rabies’, yeah, both things that can complicate your life. (Laughter)

Edward Chang

29:19

Really different words, really different meanings from just a switch of a single phoneme. And what he found was that, you know, in many cases, that when you mask this and you take a word that it's masked, on a given trial, a single trial, subject would hear one word, and on the other trials, they would hear the other word, even though it was the same stimulus.

Stephen Wilson

29:51

Did you need to do anything to kind of push them in one way or the other or It was just kind of stochastic?

Edward Chang

29:59

A little bit of both. So, what we found was that this doesn't work for all words. For example, there were some word pairs where we masked it, and they always work in one form or the other, for whatever reasons. So there was this smaller subset, where, we felt that it was kind of ambiguous,

30:16

you know. So we really focused on that and I think that it was really critical that on a single trial level that we could look at those trials and what was happening in the brain when people were reporting ‘factor’ versus when they were reporting ‘faster’. And because of previous work that we had done in lab, we knew that we could find electrodes, for example, that were tuned encoded, fricative information like the ‘S’ sound and faster, or the plosive sound the ‘K’, velar

30:48

plosive sound in ‘factor’. And so we were looking for that, we were looking for evidence that when you perceive these sounds, do you activate those, those phonetic features, even when they're not actually in the stimulus and I think the thing that was really cool was the Matt showed that it really does activate those things, even though they're not actually in the stimulus at all.

Stephen Wilson

31:12

And the timing of the, that activation was like…

Edward Chang

31:17

What, what was interesting about that was that the timing of the activation and what we call the restoration, for the deleted phoneme actually happened, sort of where you would expect it in the word, you know, in its context and a word. So you found it, for example, in ‘factor’, ‘faster’ right in the middle and you know, where we predicted that, that kind of restoration. So it was happening

31:41

in real time online. It wasn't something like the STG was putting it in after the fact, after thinking about it and guessing. It was happening online, as we were listening, or as the subjects were listening to the sounds. And related to that question about timing, we looked at a separate question, which was….

Stephen Wilson

32:06

Hang on, before we get to the separate question, I mean, that's really interesting, because I think that when this effect was first talked about, in the psycholinguistic literature, I think there was a lot of discussion as to whether it was kind of like an active process or a post hoc interpretive process, right? And it seems like you have very clear evidence that it's very much of an online process.

Edward Chang

32:28

That's right. The debate was really is this like post, post processing versus online processing? And why that's a really important distinction is, in order to do online, you actually have to use contacts and predict, you know, to some degree, actually, what's going to happen. And so that's what led to the second set of analyses that Matt did, which was to understand using a

32:52

decoding approach. If you look at the population activity across, not just STG, but all across the brain, when, when was it clear that someone was going to say, or report back ‘faster’ or ‘factor’ as what they heard? And the thing that I thought was really quite interesting and surprising at the time, was that we could predict what words someone was going to perceive on a single trial basis, even before they heard that critical missing deleted phoneme.

Stephen Wilson

33:30

Yeah, I loved that finding. I thought it was, it's, it's so sci-fi. It’s like, it's like, we've gone beyond mind reading and now we're gonna read like, what you're gonna think in 300 milliseconds.

Edward Chang

33:42

Yeah, I think you're right. It was hard to understand at the time. And you're right in what it is, is it is predicting, you know, and I think that the interpretation that we have, and we're still trying to understand it more is that, there is information about how this information unfolds is not necessarily just like an instantaneous serial readout of the input. But in fact, the system is making predictions.

34:11

And it's doing it way ahead of time, like perhaps even like the very first, you know, phoneme in the word, it's starting to make some predictions about what it's going to hear. So we started thinking about this in terms of dynamical systems.

Stephen Wilson

34:32

Hang on, which, you have to, which brain area did you find these predictive signals in first?

Edward Chang

34:37

Yeah, great question. So what we found was that it was not just the STG. We were predicting, it might be just the STG. We in fact, looked over everywhere, used a decoder over them. And we looked at the weights that were really relevant for it. And it turns out that additional areas were very relevant and one of them happened to be frontal cortex. In particular, like areas of what we now call the mid precentral gyrus, areas that you Stephen very early on identified have auditory processing

35:08

properties. It’s hard to know whether this is truly causal for this, but activity in those particular brain areas actually could predict what word someone was going to hear later on.

Stephen Wilson

35:22

Oh, really? I, I didn't realize, so it was kind of that dorsal precentral area more so than Broca's area?

Edward Chang

35:29

I would say, you know, in the in the paper we interpreted it is, like at the time, and I think because of how we were thinking, we didn't really understand what was happening in that dorsal auditory area, and it was, it was definitely both, but, it's actually both. it both, both contained information.

Stephen Wilson

35:48

Okay, that's fascinating.

Edward Chang

35:51

Yeah, so in that particular case, it may be that, you know, it's not just STG, but it's actually STG in combination with some other areas that may have some like, generative phonological functions that help guide, you know, our predictions for, for speech.

Stephen Wilson

36:08

Yeah, I mean, like, you know, both you and I are very temporal centric and our concept of language in the brain. But I think we both, you know, would give some nod to the traditional concept of frontal expressive function and temporal receptive function, and this kind of plays into that.

Edward Chang

36:24

Yeah, I think one of the reasons we are, is because we spend a lot of time also thinking about like, the causal things, like what happens when you lesion in the frontal areas to speech perception versus the temporal. It’s clearly more, more impactful when that you know, when things are happening to STG especially on the left side. There, there is this real primacy of the left STG.

Stephen Wilson

36:48

Okay, so I think you were, before I asked the anatomical question you were going to talk about, you know, what it means that this prediction happens in advance?

Edward Chang

37:02

Well, I think what it means is that, instead of thinking about the STG and related speech areas, it’s like passive speech detectors, like feature detectors, it alludes to this fact that what's going on is very dynamic and very

37:17

contextually dependent. Because, if you can essentially restore a phonetic and activated phonetic element representation, even when it's not in the stimulus at all, and you can, you know, and you have the context to do that, it means that the processing is not happening, like, in this kind of very simple, instantaneous, you know, you hear the feature, and then it activates, and then it's gone. What it means is that we're holding on to the information, we're using that memory, we're

37:53

actually making prediction. And it can be very, very powerful, like the trajectory of, of a sequence of sounds and words gets started, and that, that object starts moving and it has its trajectory from memory, let's say, and it's looking for evidence to fit something like that. And then, if it's not there, it can still continue even without the direct evidence

38:23

to support that phoneme. So I think it's an important observation about, like, the nature of the computations there, and really is different than this idea that like, we're just doing this millisecond by millisecond feature detection, we're like doing a lot more we're integrating over time.

Stephen Wilson

38:41

Right. Yeah, and I think all three of these studies that we've talked about, kind of paint this overall picture of this area as being a lot more dynamic, a lot more linguistic, than just a simple spatiotemporal detection array,

38:57

right? I mean, whether it's, you know, categorical perception, adjusting to the boundaries of the phonemes, whether it's speaking normalization, whether it's filling in missing phonemes using information from other parts of the language network perhaps, it's all kind of painting this picture of a much, you know, much more linguistic, lateral STG than perhaps we had in mind a decade ago. Is that a fair summation of what you're saying?

Edward Chang

39:32

Yeah, I think it is, I think it's a lot more interesting and complex than just a station, you know, in a hierarchical model that's just doing detection of different levels of representation. There is computation that's going on and in ways that just don't fit the classic model. So, I think that, that gives us a lot of clues as to you know, what are the next steps in thinking about

39:59

the nature of processing. So, you know, our work, I would say has, like, I alluded earlier to this idea that there's this

40:10

primacy of the left STG. And what I mean by that is that, there is something very important that's happening in the left STG that is important for speech perception that, the evidence that to me is very salient that supports that is that when I electrically stimulate the left STG, and not the right STG, especially in the mid and sometimes the posterior STG, you can interfere with

40:41

speech, perception. Stimulate in areas of the mid STG in people's, you know, like, ‘What did you, What did you say?’ type

thing like, though they’ll say

40:51

‘I heard something, but I couldn't, I couldn't understand the word’ and you don't see that effect on tone perception for example, in that area. Dana Boatman did some incredible early experiments that showed this. We've replicated a lot of this…

Stephen Wilson

41:09

Have you published, you've published that stuff, yet? You're….

Edward Chang

41:13

Published some of it with Matt Leonard and we're looking at it and even more granular since now with Deb, Deb Levy. So we're following that up, we're trying to understand that even more, I was actually very surprised to find that even Penfield had not really describe the effects of left STG stimulation on speech perception or even auditory comprehension.

Stephen Wilson

41:36

No, he looked much more at production, you know, he didn't really get much into perception. I mean, maybe George, George Ojemann maybe did more, touching on it, but….

Edward Chang

41:47

Yeah, I mean, the tasks that they were using, were really like, really just three or four. it was counting, you know, picture naming, maybe some repetition, actually they didn't do a lot of repetition. It was reading. And so those are the main tasks that were being done. Actually, if you look really carefully, there was not a lot of work done on speech perception as such.

Stephen Wilson

42:09

I mean, it was really Boatman like you said that that was the most strongest evidence of this from the classic literature. Is twenty years ago classic now? I think it is. (Laughter)

Edward Chang

42:20

There is like, hundred year old classic and then there is twenty year classic.

Stephen Wilson

42:24

Yeah, so um, you know, it's fascinating that you're saying that stimulation really only interferes with perception when it's to the left, because as far as I know, from reading most of your papers, a lot of these effects that you describe with ECOG are pretty bilateral, right? You half the time you barely even report what hemisphere you're recording from. Is that true? Is all of this, are all of these ECOG findings, similar in the left and right hemisphere?

Edward Chang

42:51

It's true. Stephen, it's absolutely true. So…

Stephen Wilson

42:56

Okay, so why, so do you, have you thought about why it is the only, only the left can interfere?

Edward Chang

43:00

I mean I have thought about it. Like, probably everybody, you know, who studies fMRI, in comparison to what you see with stroke patients, which is there is this pretty big disconnection between what you see is activated, versus what we find to be causally involved, like, you know, essential for the processing. So with both ECOG and fMRI, you can see a lot of bilateral activation of speech, and, but the lesion literature, and also the stimulation literature is a lot

43:33

more lateralized. And so, I think that this is a really fascinating question. My own perspective on it is that we, we have bilateral representations. And the non dominant ones are kind of like vestigial. They have a lot of the same machinery that can process in extreme situations where, you know, let's say someone loses one side, maybe the the right side can take on that function, because it's got machinery, but the left one is maybe what we prefer to use and is the

44:06

default. And there may be some time limits actually on reorganization activation on the right side, so that, you know, that's just my way of thinking about it really through the lens of making decisions. Like, when can we remove these areas in people having brain surgery for example.

Stephen Wilson

44:24

Yeah. Because you'd be very reluctant to remove somebody's left STG, wouldn't you?

Edward Chang

44:29

I'd be very reluctant to do that.

Stephen Wilson

44:32

Yeah.

Edward Chang

44:34

We do it routinely on the right side.

Stephen Wilson

44:36

Right. Yeah. I mean, without even doing it awake, right? I mean, you'll do, you’ll do an asleep craniotomy on the right side.

Edward Chang

44:45

Yep. Patient right after surgery. understand every word you say. Just thank you, goes home the next day and it's just very dramatically different.

Stephen Wilson

44:54

Yeah. Is it possible? You know, from my perspective, I see this bilaterality in the STG too with fMRI, like you mentioned. I’m sure you know that I'm gonna ask you about the STS, because in the STS I do not see symmetry, right. So in the STS is where I start to see asymmetry, right, I see activation of the left STS

45:16

but not the right. I mean, it's a matter of degree, there's some in the right, that makes me wonder to what extent could, could these findings in the lateral STG, be sort of back projected from the, from the more linguistically specialized circuitry in the STS, that, could it be feedback in the same way that, you know, we know, through other sensory systems that there's, you know, kind of feedforward and feedback kind of in equal amount. Is it possible that, that that's what's going on?

Edward Chang

45:52

I definitely think that, that is possible. And I think your work has been really instructive to me, because we don't really sample from the STS that well. The nature of the ECOG recordings we do is really on the superficial brain surface, not the gyral surface and not the sulcal surface that's deep in the sulci. And so, obviously with imaging, you have much better access to them. My own personal feeling is that there is something different

46:24

about the STS. Like, I think that it does have maybe more visual inputs than the STG itself, even though we can see visual inputs to STS, STG during reading and other things as well. So I think the STS does have some different inputs. But whether the STS is like necessarily a higher station in a hierarchy, could be true, I don't think it necessarily is

46:51

true. I think that, at least in my worldview right now is that there are probably overlapping functions between the two and that, that they're both really important for these processes. I don't see it as like there's a huge difference. And I don't think that, for example, when you look at architecturally and in other ways that you see something that is radically different in the STS in terms of the kind of computations that it does versus the STG.

Stephen Wilson

47:28

Yeah. Do you ever get a chance to record from STS in any special circumstances?

Edward Chang

47:33

We have. About five years ago, I had a case where we removed a tumor, a low grade tumor from the middle temporal gyrus, in someone who was diagnosed with a low grade glioma in the posterior middle temporal gyrus. And during the surgery, the awake surgery, we had a really unique opportunity to put down electrode array over the superior bank of the STS.

Stephen Wilson

48:08

The inferior bank was removed as part of the

Edward Chang

48:11

The inferior bank was removed as part of the surgery? surgery. So the tumor had occupied the MTG and the inferior bank of the STS. And, okay, number one, she was able

Stephen Wilson

48:23

Okay. to talk quite well, despite that.

Edward Chang

48:30

So I don't think that the MTG necessarily was a critical and was able to comprehend quite well there too. So we did do a recording there and we did analyze it in just preliminarily, the stuff that we saw there was really not that much different than what we see in the STG. So it's not like all of a sudden, we're seeing like words, encoding. That’s number

48:55

one. Number two is that when I stimulated there, it had some very similar effects to what was happening to one we stimulate in the lateral STG, you know, the gyral surface. You’d see a lot of effects on repetition, you would see a lot of effects on, to some degree on, like phonological perception, that kind of thing. And so, it seemed more similar than different. And we would need a lot more data I think. We’re slowly accumulating

49:24

the data to understand this. But we would need more data to try to look at this more granularly. But at least by now, it looks more similar than different. I think it's a really important thing to figure out.

Stephen Wilson

49:35

Yeah, and it's a fascinating and unanswered question, because, yeah, your, your technique that has the spatial and temporal resolution that we need, has that limitation of being primarily, in most cases is restricted to the surface, right? So we're just gonna have this challenge.

Edward Chang

49:56

Yeah. I do have to say though, like, I'm moving been away from thinking about how these, how this process, let's say of speech processing or speech comprehension works in terms of like these anatomical boundaries like STG versus STS versus MTG whatever, to thinking more empirically based, like, what is the data show us in

50:19

terms of their separations? And that analysis to me has been, it's not mutually exclusive but I think we've learned a lot of new things from looking at the data more and showing how if you look at the variance in the data, what does that teach us about how some of these areas differ?

Stephen Wilson

50:40

Yeah, I guess like on that point, one of my takeaways from your paper was that it wasn't so much, you are still working with the units of that everybody else is working with. You know, acoustic, phonetic features, phonemes, words, you’re not trying to tie them to anatomical locations. But you do make the point that they can be realized simultaneously in the same region that at different spatial scales? Can you kind of talk

51:08

about that concept? Because I think that's maybe what you're getting at now.

Edward Chang

51:15

Yeah. So this actually goes back to the very beginning. So when we first were getting into this, we were, like we said, we looked at this categorical position, perception effect. And then we very quickly went to natural speech, natural continuing speech and with some of the studies that Nima Mesgarani worked on when he was, he was really my first postdoc in the lab, incredible colleague

51:40

and partner and friend. So what he had done was, we looked at natural speech and the natural, you know, the sort of like, natural starting point was to say, okay, do we find evidence for phonemes? Like, that's where we started our journey, we're looking for phonemes, could we see any evidence, you know, that phonemes were this fundamental

52:06

unit of speech. And what we found was we never really found at the level of ECOC, you know, and ECOG is probably recurring from 10s of 1000s of neurons under like a one and a half millimeter sized electrode on the brain surface. We never really found invariant representations of phonemes themselves at a single phoneme. And what we found instead, was evidence of like an acoustic

52:27

phonetic feature. So, acoustic phonetic feature in the sort of jargon that we're using now is that, it’s like, refers to this intermediate level of representation. It clearly is tied to high level auditory acoustic features. But it's phonetic because it's very specific to properties of speech. Like these kinds of things that we've been talking about, like categorization, or normalization, all these other things. They are linguistically relevant, they're speech relevant. And so that's why we

53:10

call it acoustic phonetic. It's not that we're trying to avoid, you know, pinning it down, I really think that it lies at that level. It’s hard for us to say whether it's one or the other. But one of the things that we realized was that features among other things, are, again, what we have found empirically, and we went in looking for phonemes didn't find them, we found levels of representation at the level of

53:35

features. And so, over time, we've started to think that this question about this long standing question about fundamental units of speech that's been like, you know, sort of like a holy grail kind of question in linguistics for decades, like, what is the composition of unit? Like, what is this fundamental thing that, that we should be focusing on? And people have tried to study that behaviorally for, for decades. But over time, I think it's a neurobiological question.

54:08

I think that it's really ill posed. Because, you know, what if there really is no fundamental unit, who says that actually, there has to be one? Is there a law? Is there a rule that says that that must be the

54:21

case? And over the last couple of years, I've really shift my framework, my reference point, to really avoid that kind of question, to really thinking about instead of trying to pin down, like this one unit, whether it be phonemes, syllables, features, words, whatever, we should really be trying to explain speech behavior at all of these levels, like how does the brain level data account for how we have perception to feature,, which we know are relevant to speech

54:56

perception. But we also know that there are things that are relevant at the level of phonemes. And so, that's where these questions about the neural code are really important. And that's really the core of what my lab is doing right now is understanding the neural code. What I mean by that is how neural activity expressed both in terms of spatial coding and temporal coding, but also single, like electrode versus the population level coding works to account for speech

55:30

perception. So, we're really trying to solve the connection between how the electrical activity that's expressed at the level of neurons and populations of neurons, give rise to, you know, the experiences and perception of speech. And so, with regard to this specific question, features we can see clearly encoded at single electrode at the ECOG level. We don’t see phonemes. However, you, we can see how you could generate a phoneme level response by looking at the population of features selective

56:11

local sites. So, it's, it’s not an issue of, it's an issue of like local versus population coding. So if a level of local, we see things that are these acoustic phonetic features and other acoustic cues, but at the population, then you can see things that are at the level of phonemes and that, that theme, I think, is starting to become more and more clear, as we look at even higher levels of, let's say, abstraction, things like syllables, and potentially even words, that are …

Stephen Wilson

56:52

Yeah, it's fascinating. It's just a different way of thinking about a hierarchy. I mean, you could call it non hierarchical, I guess, or you could just call it a different way of thinking about a hierarchy. Like instead of thinking about a sequence of regions, you're kind of thinking about a series of scales, spatial scales, and probably ultimately temporal scales as well, right?

Edward Chang

57:15

Yeah, that's exactly right. It's a, it's a little bit orthogonal actually, to thinking about this. In the classical serial feedforward way, it's just not compatible with that. It’s, it's telling us that you can have levels of representation within the same area, just by configurations and the level of how closely you look versus how global you look at the level of processing.

Stephen Wilson

57:40

Yeah.

Edward Chang

57:41

And I think that the wider linguistic endeavor, the, you know, sort of like linguistic behavioral, or psychology, endeavored to figure this out stopped was because, there was good evidence that as listeners, we actually use all of those things when, when we listen. There are correlates, and they do have, like psychological and mental

58:03

realities to them. So there was never really, you know, really defined, I think even behavioral and, and so it's really shifted my thinking about our objective with the neural analysis.

Stephen Wilson

58:19

Yeah. Okay, so the, the last part of your paper that we haven't really talked about yet, is to do with these coding of amplitude transitions and onsets. Um, can you talk about how that fits into the, this coding framework you've been talking about?

Edward Chang

58:38

Okay. So again, a lot of our approach has been

Stephen Wilson

58:39

So it’s coding the onset of speech? empirical. And what I was referring to earlier, it's like, you know, we look at how the brain is processing all these different elements of speech. Let's use that as the guide.

58:58

Let's try to understand the variability, you listen to natural speech, you'd look at the normal responses, and it's instead of imposing, you know, like a model, let's say on that, try to understand the variance, the structure of the neural responses when somebody is listening to a series of different sentences. And so, in addition to a lot of our work that's really been focused on, you know, explicit features like pitch, prosody, you know, the phonetic features, etc., that’s

59:28

really model driven. A complementary approach is something that's more data driven, right? So what's the structure in the data telling us about how these things are organized? And so, one of the things that we discovered using a data driven approach, using a technique called unsupervised learning, that attention just gives you sort of like, what are the big structures of data? What are the patterns that are really driving a lot of the variance in

59:57

the data? If you run that kind of analysis, one of the things that we found was that there was this zone in the posterior STG that was clearly physiologically different than what was happening in the rest of the STG. And when we looked at what was different about it, was that its primary response type was that it was very strongly activated at the beginning of each sentence. Okay, so that was the thing that was interesting about….

Edward Chang

01:00:37

Yeah, it's like coding the onset of speech. And it's not on during, like, let's say, the ongoing sustained part of it.

Stephen Wilson

01:00:45

Okay.

Edward Chang

01:00:46

What we found was that you have this zone in the posterior, it’s really the posterior superior part of the STG, that strongly is activated by the onset of speech in a lot of the areas around it are coding, the ongoing, the sound, the ongoing sound. So this is something we never had any hypothesis about. It's like the data is saying, like, if you want to understand me, I'm going to tell you, this is the biggest thing that stands on data.

Stephen Wilson

01:01:20

So there's like a lot of variance here in …..

Edward Chang

01:01:23

A lot of variance. This is like one of the first two dimensions of structure, when you look at the variance explained in the STG, is that there is this one zone. And it didn't have to be that way. So for example, you could have seen this response type, kind of like scattered around the STG with analysis we did. We just put in all of the individual electrode responses as independent, you know, data points and then what it showed us was, yeah, there's this one response type that

01:01:52

looks like onset. And then of course, when we looked at where those electrodes were, they all clustered in this one particular area close to your superior temporal gyrus. Okay, so, then the next question is, what the heck does this mean? Okay, so that's the challenge when you do something data driven, which is, you know, it will tell you something, something important about the structure, but the interpretation, I think, can be

01:02:15

challenging. And so, well, what we found was that, and we model this in different ways but in all of our STGs, there's this one zone that is really, really strongly activated when the onset of speech happens. What I mean by onset, and something that I think is really important about the physiology we see there, is, it’s not just the onset of speech as the signal, but the onset of speech that's following a period of silence.

Stephen Wilson

01:02:53

And how much silence do you need to get this neural response?

Edward Chang

01:02:56

Like 200 to 300 milliseconds.

Stephen Wilson

01:02:58

Okay, so like a new phrase?

Edward Chang

01:03:00

Yeah. So basically, the transition from silence to the onset of speech, strongly activates this. And it turns out, this is not a speech selective thing. Okay. So we characterize this with other stimuli, including backwards speech and other stimuli. This is like a general acoustic thing. So, we started thinking about, like, what could this mean, and one of the things that we think it's really important for, is really helping to initialize not just speech processing, but any kind of

01:03:29

stream of sound, right? So like, computationally, actually, it's very important to know when something is beginning because you can initialize the computation there. In many different levels, you know, the acoustic level, if you're starting to predict like, the phonetic information that's coming. In a syntactic level, you have to know when you're going to begin and when we're going to not begin and the brain has an acoustic level of representations for trying to

01:03:56

figure that out. So, what that means is that sounds that are occurring at the beginning of the sentence are encoded very, very differently than sounds that occur later on in the sentence.

Stephen Wilson

01:04:12

Uh huh.

Edward Chang

01:04:13

And why that's important is, is that it's an acoustic landmark that gives you temporal context.

Stephen Wilson

01:04:22

Okay.

Edward Chang

01:04:23

Like a cue that's telling you something just began and the information that you're going to follow is not the beginning of this stream of information. You know, it happens here, and the brain has a system to be able to do this. Now, what followed from that, in what Liberty Hamilton and Eric Edwards had shown was that, in fact, if you looked at the response there you could decode like up to 90% and sometimes even higher, when a center was beginning just from this

01:05:02

response alone. So it turns out it's a really, really robust way for the brain to code, some important information about when speech begins. But not just speech, I think any sort of acoustic stream of information.

01:05:16

I got really interested in this idea of like the onset of speech and what is, what makes this special and a new postdoc had joined the lab named Yulia Oganian and I asked Yulia to think about how to follow this up and we came up with a new series of experiments where we were asking, essentially, what makes an onset? You know, what, how fast does the onset have to happen? How much silence do you

01:05:49

need, etc.? And what Yulia basically just discovered was that, in order for an onset response to happen, it really does need to come from silence, number one. You can’t have some like, preceding, you know, noise sound or something else. It really does require the silence, to see those common responses. But what was an unexpected thing that we also found was that, you know, she created this set of stimuli that had different ramp times different intensity of

01:06:18

ramp times. Some were really slowly ramping up, others were ramping up really quickly. And what she found was a different set of responses not in this onset area. But in surrounding areas, and especially the middle Superior Temporal Gyrus, that were very much cued to, how quickly the amplitude envelope or the amplitude information was, the signal intensity, ramps up, and we started to look at, you know, how does this, why is

01:06:49

this relevant to speech? Well, what we found was that if you look at when the intensity in the speech signal ramps up really quick, so for example, if you look at a speech waveform and look at the envelope, and you know, speech envelope panel looks like it's got a lot of

01:07:11

peaks and valleys. And so, what we found was as the intensity goes up towards the peak, not the peak itself, but as it's ramping up to it, something we call the peak rate, because it's the peak derivative of the amplitude of peak, yeah, the peak derivative of the amplitude envelope. The steepest part of that is where you're kind of climbing up towards the peak. It turns out the middle STG has a population of neurons, it's very, very sensitive to that property.

Stephen Wilson

01:07:46

Okay, and that would seem to have relation to the timing of syllables, right?

Edward Chang

01:07:53

That's right. So, we, we looked at when these events were happening, and it turns out the peak rate event happens at a very specific time point in the syllable, which is the transition point from the onset to the nucleus.

Stephen Wilson

01:08:10

Yeah, that would make sense acoustically.

Edward Chang

01:08:13

So, yeah. So acoustically, there is this thing called the sonority principle in linguistics and phonology. And the basic idea behind the sonority principle is that it's very important to the construction and understanding how syllables work is that basically across all languages, that within a syllable, you go from low intensity to high intensity, it peaks at the nucleus, which is the vowel and

01:08:37

then kind of goes down. And that, that intensity, or sonority principle, governs essentially the sequencing of the consonants and the vowels in a given sequence. They have to be, you know, sort of, according to this sonority trajectory. So, things started to click once we started to make those connections, that this new acoustic cue, that we call peak rate, turns out to be really really handy way of figuring out this transition point between the onset consonant and the

01:09:19

nucleus vowel. And it's a very different kind of theory about how syllables work. Because a lot of us kind of think of, or at least I thought of going into this, the syllables for like a different bigger unit that were essentially diphones or triphones, you know, very specific sequence of, you know, the onset consonants, the vowel and the coda consonants, right?

Stephen Wilson

01:09:47

Yeah.

Edward Chang

01:09:47

Like that, there’s this unit that you know, that you know, that has all of its sort of like phoneme constituents that the brain is responding to as a unit. Um, and I think what we, what we're thinking about this now is that the syllables really an emergent property of this ongoing phonetic level feature processing in STG, that we described earlier work, in combination with a temporal landmark feature, this peak

01:10:21

rate. What's really, you know, cool about peak rate, is it tells you when the syllables are. So, it tells you, you know, this one discrete event, the peak derivative of the ampitude envelope, called peak rate. It tells you when the syllable happens. So that's time information. But then the magnitude of the peak rate, it tells you whether it's stressed or not.

Stephen Wilson

01:10:47

Oh, okay.

Edward Chang

01:10:50

So it's like this really important, we think of a really important phonological cue. So, the syllables that have a really high peak rate, are stressed syllables and the ones that have lower peak rates are the unstressed syllables. So, as far as to when and whether the syllable stress.

Stephen Wilson

01:11:12

It seems like this, these neurons could provide some kind of skeleton for the whole speech signal almost.

Edward Chang

01:11:19

That's what we put forward in, in our paper was that, that this is potentially a scaffold, like a prosodic, syllabic level scaffold, that, information is happening in parallel with and that a syllable is not like a different part of the brain that is sensitive to the specific sequences. But the syllables, an emergent property of this temporal landmark in combination with the phonetic features.

Stephen Wilson

01:11:54

Yeah, that makes sense to me. I've always wondered about whether how privileged syllables should be as linguistic units. I mean, even things like the sonority hierarchy that you talked about, it’s not hard to see how it sort of falls out as a consequence of the fact that, you know, there are vowels and consonants and that you have to like, open and close your mouth. I know it's not that simple but I think this makes sense.

Edward Chang

01:12:20

I think that the reason why I think it actually has deeper significance is that, in speech it’s not like a random concatenation of the segmental units of phonemes. Right? It's not random at all. It's extremely far from and there's this really precise, there's just like structure that would cause syllables. So the syllable structure does govern speech, and every language like there is a set of rules about the sequences that things can occur.

Stephen Wilson

01:12:51

Yeah.

Edward Chang

01:12:51

And I think that…

Stephen Wilson

01:12:53

And most languages are a lot more rigid than English, that's for sure. I think as an English speaker, it's, it's easier to think it's like a free for all. Whereas like, there are so many languages that only allow consonant, vowel, syllables for instance,

Edward Chang

01:13:06

Right. Yeah, that's true. So, what we're trying to do is essentially connect, like, in providing essentially, like an acoustic theory of, of how you know, so both acoustic like, I guess, neurobiological theory of how we can use acoustic cues to generate something like a syllable. And interesting if you, if you try to figure out and you look at older or there's, let's say neurobiologically, let's say oscillations theory or others, there’s only so many theories about how we chunk at the

01:13:36

syllable boundaries. There's this idea that, for example, data rhythms are a good marker of syllable boundaries. I think, an emerging idea from some of that data, is that one of the thing that's really driving the quote, unquote, oscillation, or just energy or the energy, and timing of the phase information in the oscillations is, is about this peak rate, activating the local field potential and

01:14:10

organizing around that. So in those theories, the model is that, the oscillations are encoding the syllable boundaries. It turns out acoustically and also behaviorally, we’re really not that good at detecting where the boundaries are. But we're reasonably good at actually knowing the number of syllables and whether they're stressed or

01:14:33

not. And so, it’s something that we're really interested in pursuing more, that, that the syllabic processing is really about detecting this, this acoustic event and phonetic features around them. I think it's this bigger concept of how temporal integration, you know, happens. It’s not just the serial, again concatenation of like, random vowel and consonant elements, but there’s structure.

01:15:07

And I think there's a reason for that structure, which is that, the rate at which speech change happens in speech is so fast. Like when you think about phonetic segments, it's so quick on the order of 10s of milliseconds, some people have argued that, like, you really can't sequence if you were to look at a random sequence of different sounds in that way, that you need these temporal landmarks, in order to preserve

01:15:34

the sequence information. Like if you don't have the landmarks, we as listeners aren't able to actually extract the, the higher order abstraction over time, because we need the landmarks in order to do that. And that, without that, this, the individual elements may get scrambled in perception.

Stephen Wilson

01:15:55

Yeah, it sort of brings back, you know, bring it back to what we were talking about before about to what extent we can analogize our whole field on vision, right? Like, in vision, okay, the world is moving too, but like, it sort of makes sense to think about, like, the perception of a static scene right before, whereas in speech, there's just no real analog, like, you know, you could have a representation of a

01:16:16

word, and that'd be awesome. But like, 50 milliseconds later, you're gonna need a representation of another word. So, you know, there's no getting away from this sort of temporal dynamic nature of it.

Edward Chang

01:16:27

It's so fundamentally different in the nature of it. And I bet you that, I think as time goes on, you know, vision was going to move to more of like, recurrent models, as we move away from core object recognition, which is, you know, just looking at pictures and trying to identify them to thinking about more naturalistic things like moving images, and how things naturally

01:16:47

moving in the real world. You know, if, if we were to think about an analogy to this, even envision, we know that the visual system actually is very sensitive to these kinds of spatial contrast cues, like the, the contrast between the edges, for example, of an object and the space around them are very,

01:17:10

very salient cues. And I think the kinds of things that we're seeing, like is onset detector, or these peak rate things, they're really important cues for telling us when things are happening, and they give us the context of when the phonetic information is happening.

Stephen Wilson

01:17:33

Yeah. So in, I guess, in wrapping up this review paper, maybe the last thing we could talk about, you, you sort of talked about this old debate of whether speech is special, you know, contrasting Liberman and Mattingly with their idea that, you know, there's kind of this hyper modular processor versus Diehl, Holt, Lotto, etc., arguing that speech perception is really just a subtype of general auditory perception, and you try to propose a third way, can you kind of tell me about, about how

01:18:09

you see that debate being resolved.

Edward Chang

01:18:14

I think this is a classic situation where you've got two extreme views, right? Like that speech is special, and there's something really dedicated to speech processing around the other extreme, it's like, nothing special about it. It's just like, everything that we're seeing is a part of

01:18:34

general auditory processing. And I think that a lot of the data that we see is that there is potentially this middle row, that on the one hand, a lot of what's happening, does have general auditor principles, like non linearity in sensory processing, or these dynamical processes, I don't think are necessarily specific to speech. But the areas that do make its specific are the feature, like where are those parameters? And where are the boundaries? And how are they being shaped? Where

01:19:15

did the boundaries lie? There, I do think it really is actually quite specialized in, in, for speech, and I think therein lies some of the things that we can see in terms of the specializations. Like why are some of these areas really, strongly responsive to speech over non speech sounds? And because I think it has to do with some of these statistical properties that have to do with the parameters. So I don't think a general auditory model can

01:19:44

explain this. On the other hand, a lot of what we see it like, what we think is this high level of speech extraction can be explained by complex on processing.

Stephen Wilson

01:20:00

Yeah. That makes sense. Well, the floor of my daughter's closet is starting to get to my not that young, not very foldable legs. (Laughter) So, this was a lot of fun and I'm really glad we got to talk in some more depth about the stuff that you've been up to recently.

Edward Chang

01:20:27

Thanks, Stephen. Thanks for the opportunity to talk about this synthesis of, you know, just a lot of work has gone on in our lab and many others that have inspired our work over the last 10 years.

01:20:39

It's been kind of a special experience actually work on this particular review that you asked us to, you asked me to talk about, because, you know, I think as we do our work, it's oftentimes focused on individual papers, just incrementally thinking, what's the next interesting step, but until you have to do something like this, it's hard to put it all

01:21:02

together. And it was a fantastic experience working with Ilina on this and really important for me scientifically to think about this first chapter of the lab and and where we are right now.

Stephen Wilson

01:21:18

Yeah, it's a great paper. Yeah, so take care and I hope to catch up with you before too long.

Edward Chang

01:21:27

Thanks Stephen. Can’t wait to see you again in person.

Stephen Wilson

01:21:30

Yeah, I wonder when that will be?

Edward Chang

01:21:32

Hopefully, hopefully soon.

Stephen Wilson

01:21:34

Yeah. All right. Well, see you later.

Edward Chang

01:21:39

Yeah, see you.

Stephen Wilson

01:21:39

Bye. Okay, well, that's it for episode 23. Thank you very much to Eddie for coming on the podcast again. I've linked the papers we talked about in the show notes and on the podcast website at langneurosci.org/podcast. I'd like to thank Neurobiology of Language for supporting transcription, and Marcia Petyt for transcribing this episode. See you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript