Efficient Machine Learning Systems for Signal Processing - podcast episode cover

Efficient Machine Learning Systems for Signal Processing

Jul 16, 20251 hr 3 minSeason 1Ep. 5
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Nir Shlezinger and Yonina Eldar discuss model-based deep learning, an approach combining signal processing principles with modern data-driven techniques. They highlight its advantages in designing efficient systems by integrating physical models and mathematical structures, addressing challenges like computational complexity, interpretability, and adaptivity faced by traditional deep learning in signal processing applications. The discussion covers methodologies like deep unfolding and augmenting classic algorithms such as the Kalman filter and Viterbi algorithm.

Episode description

In this episode of the IEEE Signal Processing Society podcast, Nir Shlezinger from Ben-Gurion University and Yonina C. Eldar from the Weizmann Institute of Science discuss the design of machine learning systems that are inherently efficient. 

 

Nir Shlezinger and Yonina C. Eldar

Nir Shlezinger is an Assistant Professor in the School of Electrical and Computer Engineering at Ben-Gurion University of the Negev, Israel. His research spans signal processing, machine learning, and communications. He has been recognized with several prestigious awards, including the IEEE Communications Society Fred W. Ellersick Prize and the 2024 Krill Award.

Yonina C. Eldar is a Professor at the Weizmann Institute of Science, where she heads the Center for Biomedical Engineering and Signal Processing. She is also a member of the Israel Academy of Sciences and Humanities and an IEEE Fellow.

In this episode, Dr. Shlezinger and Dr. Eldar engage in a rich discussion on model-based deep learning—an approach that combines classical signal processing principles with modern data-driven techniques. This framework promotes efficiency not only through computational improvements, but by designing learning algorithms that naturally align with physical models and mathematical structures. They explore the key principles behind this methodology, its practical advantages, and its growing impact across a range of signal processing applications.

 

 

Transcript

Welcome and Episode Introduction

Hello everyone and welcome to our digital life. a podcast series by IEEE Signal Processing Society. Founded as IEEE Society in 1948, the Signal Processing Society is the world's premier association for signal processing engineers and industry professionals. In this series, we'll have candid discussions about developments in various areas of signal processing, which are the underlying ubiquitous technologies for today's modern society. In this fifth episode, Mir Schlesinger.

from Ben Gurion University and Yonina Elder from the Weizmann Institute of Science discuss the design of machine learning systems that are inherently efficient. Welcome. Welcome to Our Digital Life. It's a podcast series by the IEEE Signal Processing Society. My name is Neil Schlesinger, and I would first like to thank the Signal Processing Society for giving us this stage and this opportunity.

It's a great honor to hear beginning this podcast and even a greater honor to be doing it alongside with Yonina Eldal. Yonina, it's a pleasure to have you here. Thank you so much, Neil. The honor is all mine. And it's great to have this opportunity to discuss a lot of the work going on in this area, and especially a lot of the work, Neil, that you're leading in this area together. So I'm really excited to be here. I'm looking forward to the discussion.

Understanding Efficient Machine Learning

Great. So today's episode is entitled Efficient Machine Learning for Signal Processing. So this is obviously... topic that we've been hearing quite a lot about. You don't have to be a signal processing scientist to be hearing about efficient machine learning. You can just open every public media outlet and you'll be hearing about the need for these kinds of things.

really need to do is put things in proper context. Yeah, so that's a really good point, Neil, because there's a lot of discussion today in the machine learning community and deep learning community about how to make... these really large architectures easy to deploy. There's a lot of discussion about large language models, LLMs, of course, that we all got accustomed to using through ChatGPT and other commercial tools that are available to us.

But that's actually not what we're going to be talking about today. Yeah, exactly. So this is not really going to be the topic of our podcast. We're going to look on a very specific form of efficient machine learning. which we humbly believe is of great interest to the listeners of this podcast, to specifically people who are interested in signal processing. And this is a form of what we call model-based deep learning.

So maybe just to put things from a high-level perspective, what we're going to talk about today is not about how you can take some big large language model or some big deep learning architecture. and try to compress it or train it more efficiently, we're going to take a slightly different paradigm and we're going to look at how one can design machine learning algorithms comprised of both the architectures and the training data.

and do it by imitating or drawing inspiration for how classic algorithms in signal processing operates.

So essentially, this is what we like to call with the relatively non-conventional terminology of model-based deep learning. Right. So model-based deep learning, the way we're referring to it, is really trying to integrate... signal processing, statistical signal processing, optimization theory, some more traditional model-based ideas that we're all familiar with and that many of the algorithms told, let's say, a decade ago used as a basis for the operation with data-driven techniques.

Like Nir, just like you said, rather than taking, let's say, an LLM or some architecture and trying to deploy it efficiently, we're going to restructure the networks to begin with so that already in the architecture design, and in the metric for design, and of course in the training, we're already incorporating model-based ideas. So we're incorporating whatever we know about the system, we're incorporating whatever we know about...

the data, we're incorporating the physics of the problem already into the design. So we're not taking an existing network and then trying to make it efficient, but rather... using models, signal processing, statistical ideas to design architectures to begin with that take them into account. And in that way, they're going to be more efficient to begin with. They'll use training data more efficiently.

And most importantly, in systems, engineering systems, where we have a lot of knowledge, we have a very structured way of incorporating that knowledge rather than doing it by post-processing, but by inherently integrating it. into the architecture, into the optimization, and into the training techniques. And that's basically what we're referring to in a nutshell when we say model-based deep learning. Yeah, so that's all like, I guess, very broad level. And I guess...

a high-level description. So we're trying to make this informative and helpful for signal processing researchers and practitioners. So we're going to go a bit further into details. So maybe before we start explaining how one can implement...

AI, ML, and Deep Learning Explained

and design model-based learning algorithms. Maybe let's start with a bit of motivation for that. So, I mean, first we need to understand, I mean, why do we need this? Why is this relevant? So we know that machine learning today is getting a lot of popularity and we're somehow interchangeably using the terms AI, which is all over the media these days, deep learning.

machine learning. So maybe we can start by maybe putting things in context and in the correct terminology. Yeah, that's really a good idea. Thanks, Nir. So I think, and of course, we're going to talk about this very briefly and very abstractly, but... AI is really more of the broad concept of how do we use machines to kind of try and simulate and emulate human intelligence. So it's a very broad concept. It's not necessarily a specific computational tool, but a concept of using machines.

to simulate human intelligence. When we think of machine learning, that's really a subset of AI that really focuses on how do we use data to do learning. And machine learning, again, is a traditional field that's been studied. For tens of years, there's many classical algorithms in machine learning. SBN is probably one of the popular ones where we use data in order to, for example, classify data or perform classification or many other types of tasks.

So that's what we refer to as machine learning. Deep learning, at least the way we're kind of viewing it, is a more specific computational tool where we're using highly parameterized neural networks.

in order to compute the tasks that we want to compute so we could think of deep learning as a specific architecture or concept within the broader field of machine learning, which sits within the broader field of AI, where deep learning, at least if we look at it in the past decade, is really referring to the specific type of architectures.

that we're used to using today. And we're using a lot of training data and a lot of different computational tools in order to train them in a very structured, parameterized form. So I guess you said it's like from the last decade. So we like to think of deep learning as if it's like something that is very new, but this whole idea of using neural networks and these forms of brain-inspired computing.

to design algorithms that learn their mapping from data is actually not that new. It dates back, I think, already 80 years ago. So, I mean, one can possibly trace the origin to the Pitts McAuliffe. model for the artificial neurons. So, Pitts was a logician, McAuliffe was a... and they came together with a very simplified mathematical model for the most basic unit in the human brain, which is the neuron. And based on that, it started triggering some interest in designing...

software or systems that imitate the operation of the human brain based on how one models mathematically the operation of the artificial neuron. So there was a lot of hype, I guess, like in the 1950s. There were walks back. Frank Rosenbach, which gained a lot of attention, who showed the ability to identify patterns with these kinds of machinery. But if one looks at what the systems that were built back in the 1950s, it looked like...

huge machines with tons of wires. And at some point, I guess it became one of those candidate machine learning architectures in parallel to other forms of machine learning models. I think the moment where, I guess, the deep learning comet really hit planet Earth was in 2012 in the ImageNet competition. where the AlexNet architecture which displayed a form of a convolutional neural network that was trained using a parallel processing unit, really...

knocked down the stage in this competition. They bait all their competitors and they kind of prove that there's something about this, that if you can take a lot of data and you have sufficient computational resources... You can carry out tasks that beforehand were considered very, very difficult. And, you know, we're seeing it all over today. So this notion of deep learning is actually not that new.

Deep Learning Paradigm and SP Challenges

But most of the revolutions and the real great achievements are mostly from the last decade or so. Yeah, that's a really good point. And a lot of that comes from the fact that we have tremendous computational power today that we didn't have before.

We have a lot of training data that we didn't have before. But I think just to emphasize, and this is repeating a little bit, Neil, what you said, but to emphasize and put it in the context of what we're going to talk about in model-based deep learning. In a sense, the way deep learning is viewed today, the first step is really choosing an architecture. So there's architectures, you mentioned AlexNet, there's ResNet, there's LLNs, there's many different architectures that over the years.

people have come up with. We know of course that there's many different software packages that will do this for you. You don't even have to do it yourself. There's a variety of different architectures that have become popular and they each have their different trade-offs. They're good for different types of problems, different types of data. And in a sense today, deep learning is really about choosing an architecture and the architecture...

could be, you know, very complex. So each layer in this architecture is typically very simple, right? We could have a convolutional layer, for example, which is just performing a convolution. We could have skip connections or not. You know, we could have...

various different types of basic nonlinearities. So each layer is actually very simple, but the way they're structured together, the interconnections between them, the skip connections between them, the number of these layers, this is really defining the overall architecture.

So basically today in deep learning, you choose an architecture. And again, many of these are predefined. And that's really the first point. Then the second point is using trainee data and typically huge amounts of trainee data. in order to train this network, learning the parameters from the data. And this is typically done by empirical risk minimization. So we have some function that we're going to use.

So often if we have a lot of training data, for example, we could just, the function could just be a very simple error function that's matching between the output of our network and the true data if we have training data. We could add different regularizers. If we know something about the problem or about the data, we might want to add that as a regularizer. We might want to regularize the weights in the network. So we use different forms of empirical risk minimization.

with different regularizers either on the weights of the network or on the output itself. And that's the training phase. So we use that in order to learn the parameters of this large network. And today, in fact, for many different architectures that are available, there's actually pre-trained networks. So if, for example, you want to perform some image classification task.

There's many pre-trained networks that were trained, you know, on mute servers with millions of training points. And you can either use that as it is. Or you might want to use that and fine tune it, add a few layers or fine tune to your data. But there's a lot of available architectures, a lot of available architectures that were trained with different empirical risk functions.

And this is really what we refer to today when we're talking about deep learning. So I think that's kind of, you know, the essence of it. And the nice thing is that, of course, this mathematical justification that can show, and it also makes intuitive sense.

that this type of abstractness using a lot of data and kind of these building blocks that you just concatenate in any of them could actually be proven mathematically to be able to approximate any function. So you could take an arbitrary mathematical function. So for example, if you have some, you know, complicated f of x, you can show that by adding enough of these layers and training them with enough data, you will actually be able to approximate any abstract function.

And that's kind of the mathematical justification that's used for a lot of these networks. Of course, although mathematically you can make that statement, in practice... it's not so easy, right? In practice, we have to use a lot of training data. Of course, we're just guaranteed that there is some way of approximating the function, but of course, we're not guaranteed that our specific network...

is going to approximate that function. So these are some of the issues to take into account. But there is the mathematical foundation that's very important. Now, I think this touches on a very important point that I think I would like to highlight. Why, for us, as signal processing researchers, why not just to adopt architectures from deep learning and just use them to replace the algorithms that we're currently feeling comfortable and are working with?

On a personal level, I can tell my own personal story. My PhD I did was in information theory and communication theory. And when I started my postdoc, it was in early 2017, my postdoc supervisor, you may know her, she encouraged me to start looking into deep learning because she understood where the wind was blowing.

And on my side, I started looking on deep learning for signal processing and several issues that became... that may not be that dominant in traditional deep learning domains, such as languages, image processing, became very obvious when you try to apply them to signal processing and communication tasks.

So first, there's obviously what you, Janina, talked about, the need to train, the fact that it's very complex, highly parametrized. And, you know, in ASIM signal processing, we usually, often we implement our algorithms. And devices that are hardware limited, they may be mobile. And that is a very big constraint. Also, in terms of timing and latency, we're sometimes having to carry out tasks that have to be carried out very, very rapidly.

We're processing data from sensors. We may have to do things in the orders of milliseconds. I would also say that the notion of adaptivity is something which is quite critical. We're dealing with dynamic world, with tasks that change all the time. Now, translating English to Chinese probably doesn't change over time all that often, though.

You know, some people may argue that language evolves, but not in the quantities of time in which the dynamic channels and sensing tasks evolve each time that the environment changes. And another thing that... comes into mind is the notion of interpretability. We do like to understand the algorithms that we're working with. We like to understand what goes on there under the hood. We like to be able to explain them

to associate them with a very concrete mathematical signals and something that has an operational meaning. Now... That's not something that we usually have in deep learning. Deep learning usually provides us with architectures where what we can really assign a concrete meaning to is the input and the output. And what goes on between?

is essentially a sense of features that are relatively hard to understand and describe. So I do think that these main challenges of complexity, training, adaptivity, and interpretability...

Traditional Signal Processing Approaches

are to some extent very challenging and limiting factors, and ones that consider us to maybe take a different approach. Now, in order to understand a different approach, maybe we need to discuss about how we used to solve problems in signal processing before.

we had this new powerful tool in our toolbox called Big Learning. So this is really important to understand in order to understand the model basic learning. So if we think about how traditional signal processing... approaches different problems, right, in image, in speech, in radar, in any communication, in any other application, what we typically do is that we model

the problem that we have, right? So we have some sort of model for the data that we observe, right? In a typical signal processing problem, you're given data, and it could be even a single data point. And from that data, you want to extract some parameters, right? So that would be a typical signal processing study. And a very important feature of essentially...

all or most signal processing and statistical signal processing algorithms is that to begin with, we have some assumption that relates the data we're observing to the parameters we're interested in. So we have some function that's describing the data as a function of those parameters. And once we have that description, we could set up a relevant metric, which could be based on statistical considerations of the data. It could be based on just the deterministic channel that we know.

and then adding noise in it. So we'll get into that later on, but there's many different ways of setting up signal processing problems, but all of them share the fact that we're assuming some known model. between the data that we're observing and the parameters that we're interested in, where part of that model, of course, could be statistical to define the parameters that we're not sure about deterministically. Once we have this description, we set up a metric, right? It could be some...

error function, some norm, some metric that will give us the difference between the parameters we're going to estimate and the desired parameters. And then typically that's set up as an optimization problem. And then we can use conventional optimization tools in order to go ahead, solve that metric and get our desired parameters. And again, of course, there's many different variations to that theme, but essentially on a very abstract level.

I think most of the large body of work in signal processing and statistical signal processing can be viewed as setting up a relationship between the unknown parameters and the data, setting up a metric. that describes how well our outcome, our estimated parameters, are going to either approximate the data or approximate the unknown parameters if we have a statistical model, and then using optimization tools to solve that metric.

So it relies heavily on a model because without that model, we don't have a starting point and we don't have a metric. And of course, if we're solving it using optimization tools, the different optimization tools in order to solve that problem, solve the metric that we define over the model.

So there's a mono, there's a metric, and then there's the optimization tools to solve them. So actually, I think the optimization tools can be kind of divided into, so maybe we can do this through as an example, okay? So let's say a very classic signal processing example, we have a set of sensors.

We're measuring signals from the sensors and we're trying to recover some information from the sensors. For instance, it can be the location of the emitting sources. It can be the signal that was transmitted, maybe even generate some kind of an image as we do in ultrasound. And then I guess the mathematical relationship between what it is that we're trying to recover and what it is that we measure and the objective that we define to ourselves, it takes off how we're going to solve this.

For instance, if what we're trying to recover, we can have a faithful model that relates it to what we measured as some kind of jointly Gaussian signals, then we can do something very simple and get a closed form expression using, you know, very standard stochastic signal processing techniques. And we may have a more complicated model, but that would allow us to formulate an objective. Let's say our signal is a very high dimensional one, but we assume it's sparse in some domain.

then we can formulate an optimization problem, which we know can be guaranteed to solve this problem, and then tackle it using some iterative optimizer. proximal gradient descent, we all have iterative self-thresholding algorithms, everyone's favorite for sparse recovery tests. And there's also the family of algorithms which are just plain heuristics, or now heuristic algorithms, they usually do stem, for instance, from...

the statistical model describing the data. I don't know. For instance, in array processing, some of the most widely used SAP-based methods are essentially based on heuristics. They don't stem from a concrete optimization problem and then deriving a solution based on that. They are followed from some understanding. We understand that...

If we're trying to localize sources, then it makes sense that the steering vectors would be orthogonal to the noise subspace matrix, which is the basis for many algorithms in array signal processing. So with that in mind, maybe that comes the question. So why to replace them? Why to augment them? Or why to even use machine learning in single processing?

So that's a really good point. And I think, you know, we could first zoom in to the different approaches that you already mentioned, and maybe on a high level, describe maybe some of their limitations.

But there's also a more fundamental issue that I'll get to at the end. So, you know, roughly speaking, you mentioned a few approaches. There's some problems where we could really simplify the assumptions and get, you know, very simple closed form solutions. So those are great. They're easy to implement.

But of course, they rely on very simplifying assumptions. And when those assumptions are correct, definitely, I think even today, we should use them, right? If there's a simple solution and it works well, then, you know, at least in my opinion, there is no reason to go to deep learning.

But the issue is that often it doesn't work well because it relies on very simplifying assumptions that may not fit the data that we actually have. There's, of course, more elaborate methods that you mentioned using various different optimization techniques. And, you know, those could be very slow. They could take a long time to converge. They may not converge. Very often the problems that we have are not convex and the methods don't converge or they depend very heavily.

the starting point. So they might not be very robust. They might be very sensitive to different choices of parameters. So with very complicated iterated solvers, we have various different issues of robustness and dependence on the initial conditions. convergence time, complexity. So all of these could be issues when we're looking at iterative solvers. When we're looking at various different heuristics that are also very popular in different areas.

Again, those can have the same problems as the iterative solvers, but of course, in addition, they don't guarantee optimality, right? So at least with closed-borne solutions and the various different solvers, at the end, there's different, we know, or at least could down. what we're going to converge to, we could say something about the solution we get to. Whereas in heuristic methods, of course, we lose octonality, which is one of the appealing features of signal processing methods.

Why Model-Based Deep Learning Matters

I think of the most important. So, you know, there's different drawbacks to different techniques, but at least in my view. One of the biggest motivators to model-based deep learning and one of the biggest drawbacks of these different signal processing methods is that often we just don't have a good and accurate model. So signal processing is super powerful.

for problems where we do have good models. And again, putting aside if we solve them using one method or another, and there's different trade-offs, but at the end of the day, with the computational power we have today, with our knowledge and optimization problems, with our knowledge and...

robustifying optimization problems. In my opinion, if we have a relatively good model, then using signal processing and optimization tools is still really a very good method to go about them. The issue is that today, We look at problems that are more and more complicated. We want to solve tasks that we never dreamt of solving before. And a lot of these tasks just don't have a good model.

Now, of course, if we have no model and we have no way at all of describing the relationship between our data and the problem we want to solve, then using model-based deep learning is going to be an issue, right? But very often we do have approximating models, right? So we have...

We could often have a good description of the noise. So we have some model of the noise. We could often have a relatively fair description of the forward model. By forward model, I mean the relationship between the choices of parameters and the data that we're going to generate.

So often we do have descriptions or we can approximate the forward model on the noise, but it's not very exact. And in my opinion, this is really where model-based learning could be powerful because we could use the benefit. of both worlds. The benefit of having at least partial models or approximated models for both the description of the forward, the relationship between the parameters and the data, and the noise. But then we can use the power

of learning from data and the power of deep learning in general in order to improve these methods and get good results even when the models are not exact. And then, of course, an added value is that often this will lead to methods that are computationally much more efficient. much more robust. So not only do we overcome the modeling issue, but we can also overcome a lot of the computational issues as well at the same time by combining the power of models with learning from data.

Okay, so maybe just to put that in a proper frame before we go into concrete methodologies, I think we can say that classic signal processing on its own is not one that does not use data at all. It just imposes a statistical model and uses the data to kind of find the best model that creates this data. For instance, we want to...

We have some measurements from a channel. We're trying to measure the noise level. Then we're substituting this model into this statistical model into our algorithm, which relies on this model. The deep learning approach, on the other hand, does not impose any model at all, any statistical model, but rather uses data to find directly the best mapping from the input to the output based on the data that was observed before.

Now, in that sense, I think we can relate those two paradigms to the classic approaches of what's called generative learning versus discriminative learning. So generative learning, if you look on the more classical machine learning literature dating like 25 years ago. It was more about using data in order to learn the distribution that relates the data, that describes it in the best way.

So for instance, when we're today using the term generative AI, we're usually talking about how you can learn to generate new samples from your data and implicitly learning the underlying distribution. Discriminating machine learning, on the other hand, It's about directly learning and mapping from your input to the output without having to go through how those inputs were created from the labels. So in that sense, I think a lot of what we're trying to do in model-based machine learning...

is to try to take our classic algorithms from signal processing and use them as some form of an inductive bias, namely converting to machine learning architectures. But instead of using our data to learn the statistical model, We're using our data by viewing our algorithms as discriminating machine learning models. Namely, we're using our data to tune our algorithm to operate the best it can.

Deep Unfolding: Algorithms as Networks

on the given data set. So, so far, I think it's been relatively, you know, hand-waving, I would say. And so model-based deep learning is not just an idea. It's a set of methodologies. It's a relatively broad set of methodologies. And there's actually a very broad spectrum. We have in our papers, we have these diagrams, which we always show on how different approaches vary in their specificity and their abstractness. So I think the one we should start with, it's probably...

the most well-structured model-based learning methodology, which is that of Deep Unfold. Now, Deep Unfolding... is a relatively established model-based deep learning methodology. It's a fact, ancient history in deep learning terms. It dates back to the work of Gregor and Lecon from 2010, which is, you know, it's pre-AlexNet.

And this form of deep unfolding is a form of model-based deep learning, which if we discuss model-based deep learning is a lot about converting classical algorithms into machine learning algorithms. Here the idea is do the family of... signal processing algorithms which we are focusing on are essentially on iterative optimizers. So the overall overarching idea of B-pan folding is to take an iterative optimizer, fix the number of iterations, and then just...

Treat each one of those iterations as a layer of a neural network. So to understand how deep unfolding works, let's go back to some of the things we already mentioned before. of how, in general, an optimization-based algorithm works. So typically, when we set up a signal processing problem as an optimization problem, we have an objective function, right? That's, of course, the essence.

of what we're going to be looking at. So we have an objective function and that objective function itself often has parameters. So if we're adding a regularizer, let's say we have a least squares objective. So, you know, the norm of y minus ax. And then we add a regularizer on X. So we want X to be sparse, for example. We'll add an L1 regularizer. Or we want X to be norm-bounded. We'll add an L2 regularizer.

And when we add that regularizer to the objective function, we typically have a parameter that weights the regularizer with respect to the error metric, right? So we'll have lambda, the L1 norm of x, for example. So the objective function itself. will typically have parameters in it, and we have to decide how we're going to fix those parameters in a typical optimization problem. So we have the metric with its parameters, and now we're going to use a solver.

to solve that metric. So we could use, you know, gradient descent, proximal gradient, ADMN, and many other choices of iterative algorithms. And those algorithms themselves have hyperparameters. associated with them. So even in let's say the simplest form of gradient descent, right, we look at the previous iteration, plus we move in the direction of the gradient, but we typically scale that move.

by some weighting factor. Let's call it mu, for example. So that mu is a hyperparameter of the algorithm. So at the end, when we're setting this up as an optimization problem, we have parameters that come with the objective itself, and those are kind of essential to the objective. And then we have hyperparameters that come from the iterative solver that we choose to use. So these, in general, are parameters that we have to choose.

in this process and now when we talk about deep unfolding what we're basically doing is we're using one of these solvers and we're setting this up as an optimization problem so if i have an up objective function and I choose my favorite solver, I'll get an iteration of the solver, right? So I have a relationship between the output of the solver and the previous step. And that iteration I could think of as a layer in a network.

So basically, a layer in the network now is actually one of the steps of my iterative solver. Now, clearly, that layer is going to depend on the objective that I chose, and it's also going to depend on the solver that I chose. Okay, so now I have a layer in a network that depends on my objective, where my model comes in. It depends on the solver, where again, there's some modeling approaches there. And now the question is, how do I build a network from that layer?

Deep Unfolding: Learning Hyperparameters

We could do that in several different ways, and I'll turn it over to Neil to explain in more detail. Yeah, thanks. So I think deep unfolding based on this rationale is a relatively powerful tool. I think it's not fully understood and fully appreciated how diverse... and abstract and different levels does it provide? And I think based on the description that, Yonina, you gave just a few minutes ago, we can actually pinpoint several different ways to do deep unfolding.

So the first one, I guess the simplest one, you can call it shallow unfolding or hacking the algorithm or whatever, but it's essentially about taking the hyperparameters. of the optimization algorithm. Now, those hyperparameters, those step sizes, if we go into like optimization textbooks, they usually give some conditions such that if you set it, I don't know, to be, I don't know.

smaller than one over the maximal eigenvalue or something like that, then the algorithm converges and this is how we should tune it. But in practice, tuning these hyperparameters... has a very dramatic effect when you're operating with a fixed number of iterations.

So while, you know, guaranteeing asymptotic convergence, the actual value may not be that dramatic. But if you're already saying, OK, I have a problem. I want to solve it only with five iterations. Perhaps if I have some tool allowed me to properly tune those hiker parameters. I can get a notable improvement. I may be able with five iterations to imitate the performance of the algorithm that required, I don't know, 1,000, 10,000 iterations to run.

So the most basic approach, I would say, is just to take what we have already tuned by hand, like using some graduate student descent algorithm, and now tune it with data using stochastic graduate descent algorithm. So maybe I think we can do this in a concrete example. So actually, this is something that we've been working on quite extensively. For instance, we had one of the problems we've been looking into was a problem of a hybrid beam forming.

which is a problem where you have an array, an antenna array, you have some parameters of it that you want to tune. Specifically, it's one of those arrays where you have some interface between the analog part and the digital part, but this is technical details. In the end of the day,

It's an optimization problem, which dictates how you're going to adapt your antenna architecture to the given environment. So this is an optimization problem. You can actually formulate it in closed form and try to solve it with some algorithm, let's say. projected gradient descent or whatever. But still, when you're trying to solve this algorithm, you need to come into conclusion that this algorithm has to be...

resolved over and over and over again this optimization problem each time that the environment changes, which can be on the order of milliseconds. So you don't only want to be able to solve this problem, you want to do it very rapidly so that your solution would mean something in the real world.

Now, we can do it in, let's say, the order of coherence duration of a wireless channel. So in that sense, being able to limit the amount of iterations is a very, very powerful tool. So here, for instance, you can just... fix the number of iterations in advance, treat your hyperparameters as trainable parameters, and then just use past data in order to find the hyperparameters which best suit this data. Now, there's no notion of convergence here, okay? Everything is...

Nothing is asymptotic here. It's a very predefined and fixed number of iterations, but this allows you to dramatically reduce the overall number of iterations and dramatically reduce latency. Now, people may say, okay, that's kind of a hack. I would say that there's more than just this low latency gain that you get based on this approach. For instance, one of the things we've also recently shown...

is that even if those computations, even if iteration on its own is complicated, let's say the hybrid beamforming example I just discussed, some of these computations involve matrix inversions. So even, you know, each iteration on its own can be something which is complicated. Okay.

So you can approximate these computations and then use the fact that what you're learning based on may not necessarily be... what you're actually computing, you're taking gradients with respect to in the algorithm to find the hyperparameters which result in the approximated algorithm best suiting your data.

And that's pretty cool because that gives you now, if you get like two orders of magnitude reduction in latency, you can get even three orders of magnitude reduction in latency. And there's also additional gains, like if you compare it to, you know. black box architectures where you usually have to train them for a fixed input size. And then if the input size changes, you need to do some tricks in order to cope with that or even train a new neural network. Here is an algorithm.

You completely preserve your entire algorithm that you started. You know it. You trust it. You understand what it does. And it's invariant of the dimensionality. So we can learn once and then apply for hybrid beamforming of an array of one side, hybrid beamforming of an array of another side. Now, reducing iterations is not just about latency. So we also use optimization a lot in distributed settings, where each iteration is actually exchange of messages between agents. So reducing...

reducing iterations actually boils down to reducing communications and saving power and coming up with something that you cannot just cope with, but let's say, okay, I'll just buy a more sophisticated hardware and be done with it. So I think there are several... gains of just this very, very simple form of unfolding that you can get, even with this, you know, super simple, just, you know, tweak whatever it is that you tweak before by just checking in simulations.

with an automatic solver that does it for you from data. But that's not the only way of doing a folding, right?

Deep Unfolding: Adaptive Metric Learning

Yeah, so the cool thing is that beyond just learning hyperparameters, you could actually use unfolding to, in a sense, get an adaptive metric. So one of the issues that we talk about, which is really, really key...

is that we often don't have an exact metric, and that comes from the fact that we don't know the exact regularizer that we want, and we don't know the exact forward model. Now, the really cool thing of Deep Unfolding is that we could, in a sense, adapt the metric by learning from data. So how do we do that? Well, we spoke about before how we start with a metric and we get a layer in the network. Let's just be concrete, take the really well-known example of Lista.

which is the learned iterative soft thresholding algorithm. And this is really the original algorithm by Gregor and Lacoon, which looks at ISTA, a very, very popular algorithm by Ingrid Dambushes. And from, you know, over 20 years ago, Iterated Stop Press Looking Algorithm, which basically what ISTA solves is a linear least squares model. OK, so it approximates, you know, y equals ax and assuming that x is sparse.

It's solving a least squares metric, the square norm of the error between y and ax, and adding an L1 regularizer to x. And this is a very popular method if you use proximal gradient to solve that objective. You end up getting an algorithm, which basically at each iteration, what it does is it moves in the direction of the gradient, which involves knowing the forward model A.

And then it takes a soft threshold operation, which comes from the proximal gradient over the O1 norm. Okay, so each iteration looks like a set of linear blocks that are implementing the gradient of the squared error of y minus ax. Of course, with an appropriate step size. And then we have soft thresholding coming from the proximal gradient that's taking the L1 norm into account. And this is a super popular algorithm used, you know, in many, many different applications.

Of course, this works well if indeed we have a simple linear model and we have an L1 regularizer or sparsity assumption on X. Now, when we do unfolding, instead of just learning the hyperparameters, we can also learn... the soft threshold parameter. Okay, so that basically means that instead of having a fixed lambda plus the L1 norm of x, we're actually learning lambda in each layer.

But we could go beyond that and we can actually learn the linear blocks. So rather than keeping the linear blocks fixed, which assumes that we know the gradient and each integration, if we learn them, what we're essentially doing, and of course, we learn them different in each layer. What we're essentially doing is we're learning this hierarchical decomposition of the simple objective we started with, right? So we started with a simple linear objective and an L1 regularizer.

But if I do 10 layers where I'm learning lambda and I'm learning the gradient, I'm learning the linear blocks of the gradient, then if I try to concatenate it back together, it is no longer a simple gradient of a linear function. And it's no longer a simple proximate gradient of an L1 regularizer. So essentially, it's as if I started from my original problem, but learned the metric adaptively from data as I'm going through the optimization.

And this is really super powerful because it means that you could start from a simple linear approximation and simple sparsity assumption on the data. But then as you go through the method, it's able to adapt to the more complex.

and nonlinearities features of the data. And that's very, very hard to do by hand. So taking the data in advance and trying to learn a model to approximate it would be very difficult. But here, just starting from a simple linear objective... simple L1 regularizer, I have the power to build it up hierarchically so that essentially I'm using a much more complicated nonlinear function to describe the data, but in an adaptive controlled form that I understand.

And the other beautiful thing is that if I was just using Lista, typically I would need, you know, 10,000, 20,000 iterations. But when we use unfolding, we typically look at a very small, finite number of iterations.

In most of the applications we looked at, for example, we never used more than 10 iterations. So 10 iterations of the simple building blocks where we have, you know, simple linear blocks that are approximating the gradient, simple soft threshold being approximating the proximal gradient. Simple form, 10 layers like this, gives it tremendous adaptability to the actual data that we have and does it in a very computational, efficient way. That's a very nice way to put it, I think.

I would also say there's one more gain of this form of unfolding. Because, I mean, you can look at what you end up getting by parameterizing both the objective and the hyperparameters. It will start looking like a set of...

very, like something that looks more like a neural network. Like if you look at Lista, it's affine transformations followed by an activation. It's a very specific activation. It's the self-thresholding one, but it really looks in the end of the day like a neural network. So one can say... So why not just use a neural network with 10 layers instead of 10 iterations of Lista? There's one key point here, which is the fact that you actually understand what should be your initialization.

You know that if you configure your parameters in a specific manner, what you end up with is the original solver. So you can already start with something that you trust and then use data in order to only improve it. which is, I think it's a very considerable gain, particularly for those of us who have practiced what it takes, the agony of training deep neural networks and, you know, observing those training curves, keeping your, you know, fingers crossed and wishing that it goes down.

Here you have something which allows you to start from a very principled point and make your training much easier. And we will talk a bit more about why training...

Deep Unfolding: Greedy Approaches

unfolded architectures is super easier. So I guess so far we discussed just one way of learning hyperparameters, another way which kind of learns surrogate objective per iteration. Is there any other form of deep unfolding worth discussing, you think? So I think a really important version of deep unfolding, which is very important in signal processing and communication algorithms.

is looking at algorithms that are based on different greedy approaches. So we have many methods in signal processing that are not necessarily directly solving an optimization problem, but rather they're addressing a problem that has some concrete objective. in a greedy form. So if we go back to lista, which is an algorithm for solving sparsity problems, another popular approach to solve sparsity problems is, of course, matching pursuit or orthogonal matching pursuit.

Where in each iteration, we kind of look for the strongest contribution to the data. And then we look for the influence of that particular element in X, remove it and continue. We have a very, very similar approach when we look at communication problems. For example, interference cancellation or soft interference cancellation, which works in a very, very similar fashion when we have a multi-user channel.

We basically want to find the contribution of the different users, which is hard to do simultaneously, right? Solving the optimization problem all at once, knowing when each user transmitted, when we have interference between the different users.

Instead, what we could do is look for the strongest user, right? If it's much stronger than everyone else, for example, we can understand this intuitively. We can find the strongest user, find their contribution, remove that contribution from the data, and then go back to the next user. So of course, there's principled ways of doing this in a more sophisticated fashion that also takes into account, you know, the different probabilities of the data, maybe the different...

power or assumptions of power on the different users, but essentially follows the same principle of a greedy approach or successive interference cancellation, right? We find the strongest user, which is the interference, we cancel it and continue.

so there's a lot of methods like this in signal processing optimization communication and these are really really nice to use with unfolding right because we could look at each such iteration and each when we're looking at a particular user and its contribution

That, in a sense, is a layer in the network. And now we can learn the parameters of that layer, right, of that greedy iteration from the training data. So here we're not directly solving an optimization problem, but rather we're looking... at different greedy methods and looking at their different parameters and freeing them, learning them from data. So rather than, again, assuming that we know the distributions, the prior probabilities, the power, or even the channel.

of each user, these are all parameters that we're going to learn from data. But here we could start with known solvers. Again, these are typically greedy approaches. And just free the different parameters in each of these steps and learn them from data. So it's similar conceptually to bring the parameters in each iteration of an optimization solver. But here the iterations are actually coming from the greedy steps in our algorithm.

You can even do further. You can actually replace the iterations with relatively compact neural networks. So we just have some kind of an interconnection of small neural network, which imitates the operation of the greedy algorithm that we started with.

So now one can ask the question, okay, now you just have a set of neural networks. You actually have new ones. You cannot even initialize it to be the original algorithm you started with. So why would you do that? So actually, I think we can claim quite confidently that the fact that what you obtain... is a modular architecture where each one of those compact building blocks has a very concrete operational meaning, actually brings forth several key gains. First...

Again, think about this. It'd be the neural network, a deep neural network, where you actually understand what the internal features are. They should be, for instance, in soft interference cancellation, gradually improved estimates of what it is that you're trying to estimate. So that means you can train it very efficiently. Because when you train, your neural network usually only observes its input or its output. Here you can actually penalize your learning based on the intermediate features.

And that makes learning super more powerful, more data efficient, and much less prone to vanishing or exploding gradients, which is something that often happens when you just... give a loss based on the output and the input. It is also something which is very useful for adactivity.

For instance, let's say, again, it counts as a communication setup. You're trying to solve an interference cancellation problem, and you're doing it by parameterizing interference cancellation. Now, the channel changes. You need to change something. What's going to retrain the whole thing? No, now we can specifically monitor each building block using tools, I don't know, for concept drift or something like that. And you don't have to adapt everything. You just have to adapt a very specific part.

And also, some learning algorithms become much more efficient when you only have a limited set of parameters. So instead of having a learning algorithm apply to all the parameters jointly, You can have the learning algorithm applied multiple times, and this can really contribute to reducing computational complexity. And this is actually something that, this is why I think these kinds of architectures are very attractive.

for settings where you do want abstractness, like indeed neural networks, but you also want some level of adaptivity and interpretability, which goes way beyond what you get from plastic algorithms.

Augmenting Algorithms: Kalman Filter

Now, I would say maybe that, I mean, we talked mostly about deep unfolding because I guess deep unfolding is, I guess, the most structured form of... of model-based deep learning. This is obviously not the only one. I do think there's... The thing about D-pad folding is that it's very structured. You start from an optimization algorithm, you fix the number of iterations, and then you get some kind of a recipe, like those three recipes that we just talked about.

And I would say that you don't have to go through this way. You don't have to look on iterative algorithms. Like, effectively, any kind of a single-processing algorithm... which relies on a statistical model, if you can identify the part of it which relies on the unknown or the complex or the part in the statistical characterization,

It makes a lot of sense to augment that with deep learning tools. Now, that's not necessarily the structure that's very specific to the algorithm, but it is a very powerful approach. And maybe I'll say something about that, Mio. So I think... on a very high-level conceptual approach, deep unfolding, what it basically does is it ends up solving the problem using a deep network. But the specific network is coming from the algorithm that you started with.

And it's typically not going to be so deep, right? We're going to use a small number of layers. The second approach that Neil referred to now, where you start with an algorithm and you augment it, at the end, end-to-end, we're actually not using a deep network, right? End-to-end, we're using an algorithm.

your favorite algorithm that you used before. Just within that algorithm, anytime there's a part that depends on unknowns, that particular block is augmented with the network. So there's a network embedded. within the algorithm, but end to end, you're using an algorithm. In the first approach, end to end, you're using a network, but each layer in the network is basically implementing the solver that comes from the knowledge that you have.

Okay, so the example that we're going to discuss now is about the Kalman filter. The Kalman filter is the minimum mean squared error estimator for tracking a dynamic system based on the description of the system as a state-space model. Now, states-based models are very widely established models for dynamic systems. They're modeling the change of a target over...

or a state that you're estimating in time as a first-order Markov, and the relation to the observation as some kind of a hidden Markov model. Now, this algorithm is probably implemented in everyone's cell phone at least seven times. probably even more. And it relies on the ability to describe this dynamics in a very specific form. It assumes that the state that we're tracking evolves linearly and that everything we don't know can be modeled as Gaussian noise.

Now, this is a very established model. It comes from first-order physics, for instance. Like, I'm tracking a moving object. It makes a lot of sense that its next position would be pretty much the previous position plus the sampling interval times the velocity. That makes a lot of sense. But that's an approximation. And everything we don't know, we have to fit in and describe it as if it's Gaussian nozzle. Now, the question is...

Okay, so is there any way we can implement the Kalman filter without relying on our ability to characterize the stochasticity, without trying to tame it as something which is Gaussian? Now, in order to do that, you can kind of try to open...

the Kalman filter algorithm and identify the part which depends on this stochasticity. And that's specifically the computation of what's called the Kalman game, which is the part which depends on our ability to propagate the second order moments of the prediction. and of the noise signals. And this part is what actually requires you to be able to fully characterize your state-based model and to have a close form description for the stochasticity and preferably Gaussian one.

So one of the things you can do, instead of replacing, saying, okay, my Kalman filter doesn't work anymore because my data doesn't come from a linear Gaussian state-space model, I can only augment the computation of the Kalman gang. trainable machine learning model, let's say a recurrent neural network, and then train the whole thing end-to-end. Now, when you're saying training the whole thing end-to-end, you don't have a ground truth common game. You're training, it's a common filter.

based on its ability to track the latent state. This algorithm, when you do that, this is one way of embedding the deep learning into the Kalman filter. There's several other approaches to do that. We actually...

We're supposed to have a magazine paper just about that, which will soon be published in the Signal Processing Magazine. But I do think that specifically this approach that I described now is a very classic approach of how you can take an algorithm that you trust and you feel very comfortable with, and how we can augment it.

with deep learning to provide it with new capabilities, which gives a lot of added values. It's not just performance that you get and ability to cope with approximated models. You get relatively low complexity. You can actually implement this on a very limited computational resources. You can even get improved latency because, for instance, the extended command filter and those algorithms, on each time instance, they require matrix inversion.

Sometimes instead of computing the common gain with the matrix inversion, computing it by some recurrent neural network can be sometimes even more computationally efficient and reduce and be faster to do. And there's also various other gains by having a Kalman filter that is trainable. Now you often employ a Kalman filter, you know, with some feature extraction. Let's say you're tracking something from an image, from a video stream. So now you can train your Kalman filter, not based on your...

state-based model, but such that it best matches what you're extracting from your data. So you can jointly train the Kalman feature alongside the feature extractor. You sometimes use common filter with downstream tasks, which is very often, right? You're tracking something and then you do something on top of that. You're tracking a stock and you want to make, you know...

predictions based on that. You're doing a lot of these kind of things, which is very often carried out with common filter. And then the ability that you have a common filter, which is trainable, gives you a lot of added values.

ViterbiNet and Future Outlook

Now, this is one example. There are several other examples, obviously. So just very briefly, one example, and then we'll kind of wrap up and try to summarize what hopefully we've all learned here today together.

Another very brief example is what we refer to as the ViterbiNet, where we take the Viterbi algorithm, and, you know, we all have Viterbi algorithm implemented in our phones, so we know that it's a very, very powerful algorithm for detecting symbols that were sent over a communication channel.

So, you know, I send symbols to a receiver and the receiver gets the data and has to decode what the symbols were that were sent to them. And, you know, one of the most popular methods of doing it that's implemented, again, in all of our phones.

is the well-known Viterbi decoder, which basically is a maximum likelihood solver or a heuristic to approximate the maximum likelihood solver. And it assumes a Markovian channel, okay? So that's kind of the fundamental assumption. But of course, it also assumes that we know the channel. So what the Viterbi algorithm basically does is when each symbol arrives, it looks at the likelihood of the guess of that symbol given the past. And then once it happens, the likelihood feeds it.

into a dynamic programming approach, right, which is basically the trellis algorithm, which uses these likelihoods to compute the most likely path, which gives us the symbols that were actually transmitted. So that's kind of very abstract. description of the Viterbi solver. Now, the nice thing about the Viterbi solver is that you could really think of this as two blocks. There's the likelihood block.

which is computing the likelihood of each symbol given the past. And that, of course, assumes knowledge of the channel, right? Without knowing the channel, we can't compute the likelihood. But then there's the dynamic programming part, which is just taking these likelihoods and basically doing... a dynamic programming approach using these different likelihoods. Now, that part, that trellis computation part, actually, there's no need to replace that with a deep network, right? Like that part...

Makes a lot of sense, right? We have these different likelihoods. We're tracing them back to the different symbols. And we want to know which path makes the most sense, which path has the highest score, which will give us the symbols. Now that description. Makes a lot of sense, right? Why replace that with a network? Why try to approximate that with some ResNet or CNN, right? The dynamic programming part is gritty. The problem is that we don't have the likelihoods.

So rather than taking these two blocks and saying, okay, we don't have the channel, so we're going to take these two blocks and replace both of them by this massive network, right? And train it with a lot of data, etc. What we're saying is, why?

The trellis makes a lot of sense. Keep it. It's cool. It makes sense. It's efficient. It is actually the optimal or close to the optimal thing to do under different assumptions. And it works really well. And it's implemented in all of our phones. So we know how to do that efficiently. The issue is the likelihoods.

So just learn the likelihoods. And learning likelihood is something very simple to do, right? I mean, it's a simple classification problem. So this leads to the ViterbiNet, where we say, rather than replacing all of that end-to-end by some crazy deep network... Keep the dynamic programming approach. And just to feed in the likelihoods, use a very simple two or three layer network that's basically a classifier and replacing the likelihood calculation.

So that's another example where by using this approach, using the signal processing and communication models, knowledge, algorithms, we're not replacing all of it by network. We're just replacing the particular block that we need.

by a very simple network that has very simple tasks. And therefore, we don't need something very complicated. We don't need a lot of training data. And the beauty that we get is that, first of all, of course, the performance is really well. We could compare this to deep networks and not surprisingly, it performs much better.

But more importantly, it's super adaptive, right? I mean, if the channel changes a little bit, we don't have to train this whole massive deep network again, right? All we have to do is learn those likelihoods, which we do very efficiently. It's super adaptive. It's computationally super efficient.

We're using the chips that we already have because we're not doing dynamic programming all from scratch. So I don't need these complicated AI chips, but I could use standard communication chips and I'm just augmenting them with different inputs.

To wrap up kind of a lot of what we said, these are very important things to think about, right? I mean, yeah, we can use a massive deep network with massive training data and massive compute time and probably solve all of these problems. But that's not our goal.

Our goal is to do it in a way that we could implement efficiently, that we can use existing infrastructure, existing chips, that we could get interpretability, we can actually understand what we get. We could get adaptivity, right? Because if we're thinking of...

communication problems or medical problems. Things change locally, right? We don't want to have to retrain when we cross the street or go to a new hospital, right? We want to use the same ideas. We want to be able to understand the results we get. We don't want to have to have massive training data, right? Like that's not available to us in many different problems. And we want to do this in a very, very efficient way. So the ideas that we described here today.

are different ways of incorporating what we already know, right? Rather than saying, let's replace signal processing and statistical signal processing with deep learning. What we're saying is, no, let's take the beauty of both worlds. Signal processing is as relevant as ever. We need the fundamentals. We need the fundamentals for modeling. We need the fundamentals for thinking of different statistical considerations, of considerations of...

efficiency, computational efficiency, robustness, interpretability, all of these are key. But now let's augment them with the power of learning from data. And rather than replacing, we're augmenting. And that really gives us the power of everything we spoke about today. Well, I mean, I think it's time to wrap up.

It was really a lot of fun. There was a lot of stuff that we didn't discuss. Model-based deep learning is so broad. But I guess we can give our listeners an opportunity to also check out our monograph. It was published by Foundations and Trends in Signal Processing. It's entitled Model-Based Deep Learning.

And a lot of the high-level hand-waving stuff we discussed here appeared in closed form mathematical descriptions with their code that is fully available on GitHub and includes also some of the methodologies that we didn't discuss here. Yes. Thanks, Neil. We definitely encourage all of you to look at some of the published material on this and also to help us think together about some of the issues that we weren't able to address here today.

which are, for example, proving a lot of the claims that we said. So having theory is always important. And besides the fact that it's good to be able to prove our claims, having theory also gives us guidelines and how to choose. For example, how do we know how many layers to choose?

How do we know when to stop in terms of training? There's a lot of open practical questions that theory could really help us answer. So that's definitely an important thing that we'd like to look at forward. How can we use less training data? How could we use physics and modeling for the training data, not just for solving the architectures themselves? There's a lot of open opportunities here in many different applications. And, you know, we're definitely interested in continuing.

We both strongly believe in this approach and we encourage all of you to reach out, connect. There's a lot of open challenges that we'd really like to solve together. But I think the most important maybe message that we want to get through. is that we truly believe that the fundamentals of signal processing and statistical signal processing are more relevant than ever and getting good methods.

for learning, we believe, can really rely on the basics of statistical signal processing, signal processing and optimization. So we encourage you all to continue to learn those topics and to think on how we can merge them. with the power of deep learning rather than replacing them with deep learning. That's beautifully said. Junius, thank you so much. Thank you. It's been a real pleasure.

Thank you, the Signal Processing Society, for giving us this stage. And, you know, stay tuned for more content on signal processing and AI in our digital life. Thank you very much. And there you have it, listeners. Thank you so much for listening to our fifth episode of the Digital Life podcast series. A big thank you to the team that make every episode possible, the Cactus Production team.

Michelle and Bill from the IEEE staff. Marcelo, the Outreach and Visibility Committee Chair. In closing, we welcome you to explore our other episodes on all leading podcast platforms such as apple amazon and spotify we further welcome you to watch our video version on youtube and on the IEEE Signal Processing Society website. Until next time, keep learning and innovating. Bye-bye.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android