AI Testing and Evaluation: Learnings from Science and Industry - podcast episode cover

AI Testing and Evaluation: Learnings from Science and Industry

Jun 23, 202520 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In the introductory episode of this new series, host Kathleen Sullivan and Senior Director Amanda Craig Deckard explore Microsoft’s efforts to draw on the experience of other domains to help advance the role of AI testing and evaluation as a governance tool.

Transcript

KATHLEEN SULLIVAN

Welcome  to AI Testing and Evaluation:   Learnings from Science and Industry.  I'm your host, Kathleen Sullivan. As generative AI continues to advance, Microsoft  has gathered a range of experts—from genome   editing to cybersecurity—to share how  their fields approach evaluation and   risk assessment. Our goal is to learn from  their successes and their stumbles to move  

the science and practice of AI testing  forward. In this series, we'll explore   how these insights might help guide the future of  AI development, deployment, and responsible use.

[MUSIC ENDS]

KATHLEEN SULLIVAN

For our introductory episode, I'm pleased to  welcome Amanda Craig Deckard from Microsoft   to discuss the company's efforts to  learn about testing in other sectors. Amanda is senior director of public  policy in the Office of Responsible AI,   where she leads a team that works  closely with engineers, researchers,  

and policy experts to help ensure AI is being  developed and used responsibly. Their insights   shape Microsoft's contribution to public policy  discussions on laws, norms, and standards for AI. Amanda, welcome to the podcast.

AMANDA CRAIG DECKARD

Thank you.

SULLIVAN

Amanda, let's give the listeners  a little bit of your background. What's your   origin story? Can you talk to us a little bit  about maybe how you started in tech? And I would   love to also learn a little bit more about what  your team does in the Office of Responsible AI.

CRAIG DECKARD

Sure. Thank you. I'd say my  [LAUGHS] path to tech, to Microsoft, as well,   was a bit, like, circuitous, maybe. You know,  I thought for the longest time I was going to   be a journalist. I studied forced migration. I  worked in a sort of state level sort of trial   court in Indiana, a legal service provider  in India, just to give you a bit of a flavor. I made my way to Microsoft in 2014 and have been  here since, working in cybersecurity public policy  

first and now in responsible AI. And the way  that our Office of Responsible AI has really,   sort of, structured itself is bringing  together the kind of expertise to really   work on defining policy and how to  operationalize it at the same time.

And, you know, that means that we have  been working through this, you know,   real challenge of defining internal policy and  practice, making sure that's deeply grounded in   the work of our colleagues at Microsoft Research,  and then really closely working with engineering   to make sure that we have the processes, that we  have the tools, to implement that policy at scale.

And I'm really drawn to these kind of hard  problems where they have the character of two   things being true or there's like, you know,  real tension on both sides and in particular,   in the context of those kinds of problems, roles  in which, like, the whole job is actually just   sitting with that tension, not necessarily, like,  resolving it and expecting that you're done. And I think, really, there are two reasons why  tech is so, kind of, representative of that kind  

of challenge that I've always found fascinating.  You know, one is that, of course, tech is,   sort of, ubiquitous. It's really impacting so  many people's lives. But also, you know, because,   as I think has become part of our vernacular now,  but, you know, is not necessarily immediately  

intuitive, is like the fact that technology is  both a tool and a weapon. And so that's just,   like, another reason why, you know, we have  to continuously work through that tension and,   sort of, like, sit with it, right,  and even as tech evolves over time.

SULLIVAN

You bring up such great points, and  this field is not black and white. I think that   even underscores, you know, this notion that you  highlighted that it's impacting everyone. And,   you know, to set the stage for our listeners,  last year, we pulled in a bunch of experts   from cybersecurity, biotech, finance, and we  ran this large workshop to study how they're  

thinking about governance and those playbooks.  And so I'd love to understand a little bit more   about what sparked that effort—and, you  know, there's a piece of this which is   really centered around testing—and to hear from  you why the focus on testing is so important.

CRAIG DECKARD

If I could rewind a little bit and  give you a bit of history of how we even arrived   at bringing these experts together, you know,  we actually started on this journey in 2023.   At that time, there were, like, a lot of  these big questions swirling around about,   you know, what did we need in terms  of governance for AI? Of course,   this was in the immediate aftermath of the ChatGPT  sort of wave and everyone recognizing that, like,  

the technology was going to have a different level  of impact in the near term. And so, you know,   what do we need from governance? What do we need  at the global level, in particular, of governance? And so at the time, in early 2023 especially,  there were a lot of attempts to sort of draw   analogies to other global governance institutions  in other domains. So we actually in 2023 brought   together a different workshop than the one  that you're referring to specifically focused  

on testing last year. And we, kind of, had  two big takeaways from that conversation. One was, what are the actual functions of these  institutions and how do they apply to AI? And,  

actually, one of the takeaways was they  all sort of apply. [LAUGHS] There's,   like, a role for, you know, any of  the functions, whether it be sort of   driving consensus on research or building  industry standards or managing, kind of,   frontier risks, for thinking about how  those might be needed in the AI context. And one of the other big takeaways  was that, you know, there are also   limitations in these analogies. You know, each  of the institutions grew up in its own, sort of,  

unique historical moment, like the one that  we sit in with AI right now. And in each of   those circumstances, they don't exactly  translate to this moment. And so, yeah,   there was like this kind of, OK, we want to  draw what we can from this conversation and   then we also want to understand, what is also very  important that's just different for AI right now?

We published a book with the lessons  from that conversation in 2023. And then   we actually went on a bit of a tour  [LAUGHS] with that content where we   had a number of roundtables actually all  over the world where we gathered feedback   on how those analogies were landing, how our  takeaways were landing. And one of the things   that we took from them was a gap that some of the  participants saw in the analogies that we chose to  

focus on. So across multiple conversations, other  domains kept being raised, like, why did you not   also study pharmaceuticals? Why did you also not  study cybersecurity, for example? And so that,   you know, naturally got us thinking about what  further lessons we could draw from those domains. At the same time, though, we also saw a need to,   again, go deeper than what we went and  really, like, focus on a narrower problem.

So that's really what led us to trying to  think about a more specific problem where we   could think across levels of governance and  bring in some of these other domains. And,   you know, testing was top of mind. Continues  to be a really important topic in the AI policy  

conversation right now, I think, for really good  reason. A lot of policymakers are focused on,   you know, what we need to do to, kind  of, have there be sufficient trust,   and testing is going to be a part of  that—really better understand risk,   enable everyone to be able to make more, kind  of, risk-informed decisions, right. Testing is   an important component for governance and AI and,  of course, in all of these other domains, as well.

So I'll just add the other, kind of, input into  the process for this second round was exploring   other analogies beyond those that we, kind of,  got feedback on. And one of the early, kind of,   examples of another domain that would be really  worthwhile to study that came to mind from,   sort of, just studying the literature was  genome editing. You know, genome editing   was really interesting through the process of  thinking about other kind of general-purpose  

technologies. We also arrived at nanoscience  and brought those into the conversation.

SULLIVAN

That's great. I mean,  actually, if you could double-click,   I mean, you just named a number of  industries. I'd love to just understand   which of those worlds maybe feels the  closest to what we're wrestling with,   with AI and maybe which is kind of the farthest  off, and what makes them stand out to you?

CRAIG DECKARD

Oh, such a good  question. For this second round,   we actually brought together eight different  domains, right. And I think we actually thought   we would come out of this conversation with some  bit of clarity around, Oh, if we just, sort of,   take this approach for this domain or that  domain, we'll sort of have—at least for now—really  

solved part of the puzzle. [LAUGHS] And, you know,  our public policy team the day after the workshop,   we had a, sort of, follow-on discussion,  and the very first thing that we started   with in that conversation was like, OK, so  which of these domains? And fascinatingly,   like, everyone was sort of like, Ahh! [LAUGHS]  None of them are applying perfectly. I mean,   this is also speaking to the limitations  of analogies that we already acknowledged.

And also, you know, all of the experts  from across these domains gave us really   interesting insights into, sort of, the  tradeoffs and the limitations and how they   were working. None are really applying  perfectly for us. But all of them do   offer a thread of insight that is really  useful for thinking about testing in AI,   and there are some different dimensions that  I think are really useful as framing for that.

I mean, one is just this  horizontal-versus-vertical,   kind of, difference in domains and, you know,  the horizontal technology like genome editing   or nanoscience just being inherently different and  seemingly very similar to AI in that you want to   be able to understand risks in the technology  itself and there is just so much contextual,   sort of, factor that matters in the application  of those technologies for how the risk manifests  

that you really need to, kind of, do those  two things at once—of understanding the   technology but then really thinking about risk and  governance in the context of application versus,   you know, a context like or a domain like civil  aviation or nuclear technology, for example. You know, even in the workshop  itself that we hosted late last year,  

where we brought together this second round of  experts, it was really interesting. We actually   started the conversation by trying to understand  how those different domains defined risks,   where they were able to set risk thresholds.  That's been such a part of the AI policy   conversation in the last year. And, you know,  it was really instructive that the more vertical   domains were able to, sort of, snap to clearer  answers much more quickly.[LAUGHS] But, like,  

the horizontal nanoscience and genome editing were  not because it just depends, right. So anyway,   the horizontal-vertical dimension seems like a  really important one to draw from and apply to AI. The couple of others that I would offer is just,  you know, thinking about the different kinds of   technologies. You know, obviously, there's some  of the domains that we studied that they're just   inherently, sort of, like, physical technologies  … a mix of physical and digital or virtual in a  

lot of cases because all of these are, of course,  applying digital technology. But like, you know,   there is just a difference between something like  an airplane or a medical device or, you know,   the more kind of virtual or intangible sort of  technologies even, you know, of course, AI and   some of the other like cyber and genome editing  but also like, you know, financial services having  

some of that quality. And again, I think the thing  that's interesting to us about AI is to think   about AI and risk evaluation of AI as being, you  know, having a large component of that being about   the kind of virtual or intangible technology.  And also, you know, there is a future of robotics   where we might need to think about the, kind of,  physical risk evaluation kind of work, as well.

And then the final thing I'd maybe say in terms of  thinking about which domains have the lessons for   AI that are most applicable is just how they've  grappled with these different kind of governance   questions. Things like how to turn the dial  in terms of being more or less prescriptive on   risk evaluation approaches, how they think  about the balance of, kind of, pre-market versus   post-market risk evaluation in testing, and what  the tradeoffs have been there across domains has  

been really interesting to kind of tease out. And  then also thinking about, sort of, who does what? So, you know, in each of these different domains,  it was interesting to hear about, like, you know,   the role of industry, the role of governments,  the role of third-party experts in designing   evaluations and developing standards and  actually doing the work, and, kind of,  

having the pull through of what it means for risk  and governance decisions. There were, again, there   was a variety of, sort of, approaches across these  domains that I think were interesting for AI.

SULLIVAN

You mentioned that there's  a number of different stakeholders to   be considering across the board  as we're thinking about policy,   as we're thinking about regulation. Where  can we collaborate more across industry?   Is it academia? Regulators? Just,  how can we move the needle faster?

CRAIG DECKARD

I think all of the above  [LAUGHTER] is needed. But it's also really   important to have all of that, kind of, expertise  brought together, you know, and I think, you know,   one of the things that we certainly heard from  multiple of the domains, if not all of them, was   that same actual interest and need and the same  sort of ongoing work to try to figure that out.

You know, even where there had been progress in  some of the other domains with bringing together,   you know, some industry stakeholders  or, you know, industry and government,   there was still a desire to actually do more  there. Like, if there was some progress in   industry and government, the need was, And  more kind of cross-jurisdiction government   conversation, for example. Or some progress on,  you know, within the industry but needing to,  

like, strengthen the partnership with academia,  for example. So, you know, I think it speaks to,   like, the quality of your question, to be  honest, that, you know, all of these domains   are actually still grappling with this and still  seeing the need to grow in that direction more. What I'd say about AI today is that we have made  good progress with, you know, starting to build  

some industry partnerships. You know, we were  a founding member of the Frontier Model Forum,   or FMF, which has been a very useful place for  us to work with some peers on really trying   to bring forward some best practices that  apply across our organizations. You know,   there are other forums as well, like MLCommons,  where we're working with others in industry and   broader, sort of, academic and civil society  communities. Partnership on AI is another  

one I think about that, kind of, fits that  mold, as well, in a really positive way. And,   like, there are a lot of different, sort of,  governance needs to think through and where,   you know, we can really think about bringing that  expertise together is going to be so important. I think about almost, like,  in the near to mid-term,   like three issues that we need to address in  the AI, kind of, policy and testing context.  

One is just building kind of, like,  a flexible framework that allows us   to really build trust while we continue  to advance the science and the standards.  

You know, we are going to need to do both at once.  And so we need a flexible framework that enables   that kind of agility, and advancing the science  and the standards, that is going to be something   that really demands that kind of cross-discipline  or cross kind of expertise group coming together   to work on that—researchers, academics, civil  society, governments and, of course, industry.

And so I think that is, actually, the second  problem is, like, how do we actually build   the kind of forums and ways of working  together, the public-private partnership   kind of efforts that allow all of that expertise  to come together and fit together over time,   right. Because when these are really big,  broad challenges, you kind of have to break   them down incrementally, make progress on  them, and then bring them back together.

And so I think about, like, one example that I,  you know, really have been reflecting on lately   is, you know, in the context of building  standards, like, how do you do that,   right? Again, standards are going to benefit  from that whole community of expertise. And,   you know, there are lots of different kinds of  quote-unquote standards, though, right. You kind   of have the “small s” industry standards. You  have the kind of “big S” international standards,  

for example. And how do you, kind of, leverage  one to accelerate the other, I think, is part of,   like, how we need to work together within this  ecosystem. And, like, I think what we and others   have done in an organization like C2PA [Coalition  for Content Provenance and Authenticity], for   example, where we've really built an industry  specification but then built on that towards an   international standard effort is one example  that is interesting, right, to point to.

And then, you know, I actually think that  bridges to the third thing that we need to   do together within this whole community, which is,  you know, really think again about how we manage   the breadth of this challenge and opportunity  of AI by thinking about this horizontal-vertical   problem. And, you know, I think that's where  it's not just the sort of tech industry,  

for example. It's broader industry that's going to  be really applying this technology that needs to   get involved in the conversation about not just,  sort of, testing AI models, for example, but also   testing how AI systems or applications are working  in context. And so, yes, so much fun opportunity!

[MUSIC]

SULLIVAN

Amanda, this was just  fantastic. You've really set the   stage for this podcast. And thank you so much  for sharing your time and wisdom with us.

CRAIG DECKARD

Thank you.

SULLIVAN

And to our listeners, we're so  glad you joined us for this conversation.   An exciting lineup of episodes are on the way, and  we can't wait to have you back for the next one. [MUSIC   FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android