Abstracts: November 4, 2024 - podcast episode cover

Abstracts: November 4, 2024

Nov 04, 2024
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In their 2024 SOSP paper, researchers explore a common—though often undertested—software system issue: retry bugs. Research manager Shan Lu and PhD candidate Bogdan Stoica share how they’re combining traditional program analysis and LLMs to address the challenge.

Read the paper

Transcript

GRETCHEN HUIZINGA

Welcome to Abstracts,  a Microsoft Research Podcast that puts the   spotlight on world-class research in brief.  I’m Dr. Gretchen Huizinga. In this series,   members of the research community at  Microsoft give us a quick snapshot—or   a podcast abstract—of their  new and noteworthy papers.

[MUSIC FADES]

GRETCHEN HUIZINGA

Today I'm talking to Dr. Shan Lu, a senior  principal research manager at Microsoft Research,   and Bogdan Stoica, also known as Bo, a doctoral  candidate in computer science at the University   of Chicago. Shan and Bogdan are coauthors of  a paper called “If at First You Don't Succeed,   Try, Try, Again …? Insights and LLM-informed  Tooling for Detecting Retry Bugs in Software   Systems.” And this paper was presented at this  year's Symposium on Operating Systems Principles,  

or SOSP. Shan and Bo, thanks for  joining us on Abstracts today!

SHAN LU

Thank you.

BOGDAN STOICA

Thanks for having us.

HUIZINGA

Shan, let's kick things off  with you. Give us a brief overview of   your paper. What problem or issue does it  address, and why should we care about it?

LU

Yeah, so basically from the title, we are  looking at retry bugs in software systems. So   what retry means is that people may not  realize for big software like the ones   that run in Microsoft, all kinds of  unexpected failures—software failure,   hardware failure—may happen. So just  to make our software system robust,   there's often a retry mechanism built in.  So if something unexpected happens, a task,  

a request, a job will be re-executed. And what  this paper talks about is, it's actually very   difficult to implement this retry mechanism  correctly. So in this paper, we do a study to   understand what are typical retry problems and  we offer a solution to detecting these problems.

HUIZINGA

Bo, this clearly isn't a  new problem. What research does your   paper build on, and how does your  research challenge or add to it?

STOICA

Right, so retry is a well-known mechanism  and is widely used. And retry bugs, in particular,   have been identified in other papers as root  causes for all sorts of failures but never have   been studied as a standalone class of bugs.  And what I mean by that, nobody looked into,   why is it so difficult to implement retry?  What are the symptoms that occur when you don't   implement retry correctly? What are the causes  of why developers struggle to implement retry  

correctly? We built on a few key bug-finding ideas  that have been looked at by other papers but never   in this context. We use fault injection. We  repurpose existing unit tests to trigger this   type of bugs as opposed to asking developers  to write specialized tests to trigger retry  

bugs. So we’re, kind of, making the developer's  job easier in a sense. And in this pipeline,   we also rely on large language models to  augment the program and the code analysis that   goes behind the fault injection and  the reutilization of existing tests.

HUIZINGA

Have large language models  not been utilized much in this arena?

LU

I want to say that, you know, actually this  work was started about two years ago. And at that   time, large language model was really in its  infancy and people just started exploring what  

large language model can help us in terms of  improving software reliability. And our group,   and together with, you know, actually same set of  authors from Microsoft Research, we actually did   some of the first things in a workshop paper  just to see what kind of things that we were   able to do before like, you know, finding bugs can  now be replicated by using large language model.

HUIZINGA

OK …

LU

But at that time, we were not very happy  because, you know, just use large language   model to do something people were able to do  using traditional program analysis, I mean,  

it seems cool, right, but does not add new  functionality. So I would say what is new,   at least when we started this project, is we  were really thinking, hey, are there anything,   right, are there some program analysis, are  there some bug finding that we were not able   to do using traditional program analysis but  actually can be enabled by large language model.

HUIZINGA

Gotcha …

LU

And so that was at, you know, what I  feel like was novel at least, you know,   when we worked on this. But of course, you know,  large language model is a field that is moving   so fast. People are, you know, finding  new ways to using it every day. So yeah.

HUIZINGA

Right. Well, in your paper,   you say that retry functionality  is commonly undertested and thus   prone to problems slipping into production. Why  would it be undertested if it's such a problem?

STOICA

So testing retry is difficult  because what you need is to simulate   the systemwide conditions that lead  to retry. That often means simulating   external transient errors that might happen  on the system that runs your application.   And to do this during testing and capture  this in a small unit test is difficult.

LU

I think, actually, Bogdan said this very well.  It's like, why do we need a retry? It's, like,   when unexpected failure happen, right. And this  is, like, something like Bogdan mentioned, like   external transient error such as my network card  suddenly does not work, right. And this may occur,   you know, only for, say, one second and then  it goes back on. But this one second may cause   some job to fail and need retry. So during normal  testing, these kind of unexpected things rarely,  

rarely happen, if at all, and it's also difficult  to simulate. That's why it's just not well tested.

HUIZINGA

Well, Shan, let's talk about  methodology. Talk a bit about how you   tackled this work and why you chose the  approach you did for this particular problem.

LU

Yeah, so I think this work includes  two parts. One is a systematic study. We   study several big open-source systems to see  whether there are retry-related problems in   this real system. Of course there are. And  then we did a very systematic categorization   to understand the common characteristics. And  the second part is about, you know, detecting.   And in terms of method, we have used,  particularly in the detecting part,  

we actually used a hybrid of techniques of  traditional static program analysis. We used this   large language model-enabled program analysis. In  this case, imagine we just asked a large language   model saying, hey, tell us, are there any retry  implemented in this code? If there is, where   it is, right. And then we also use, as Bogdan  mentioned, we repurposed unit test to help us  

to execute, you know, the part of code that large  language model tell us there may be a retry. And   addition to that, we also used fault injection,  which means we simulate those transient,   external, environmental failures such as network  failures that very rarely would occur by itself.

HUIZINGA

Well, Bo, I love the part in  every paper where the researchers say,   “And what we found was ...”  So tell us, what did you find?

STOICA

Well, we found that implementing retry  is difficult and complex! Not only find new bugs   because, yes, that was kind of the end goal of  the paper but also try to understand why these   bugs are happening. As Shan mentioned, we started  this project with a bug study. We looked at retry   bugs across eight to 10 applications  that are widely popular, widely used,  

and that the community is actively contributing  to them. And the experiences of both users and   developers, if we can condense that—what  do you think about retries?—is that, yeah,   they're frustrated because it's a simple  mechanism, but there's so many pitfalls   that you have to be aware of. So I  think that's the biggest takeaway.  

Another takeaway is that when I was thinking  about bug-finding tools, I was having this   somewhat myopic view of, you know, you instrument  at the program statement level, you figure out   relationships between different lines of code and  anti-patterns, and then you build your tools to   find those anti-patterns. Well, with retry, this  kind of gets thrown out the window because retry   is a mechanism. It's not just one line of code.  It is multiple lines of code that span multiple  

functions, multiple methods, and multiple files.  And you need to think about retry holistically to   find these issues. And that's one of the reasons  we used large language models, because traditional   static analysis or traditional program analysis  cannot capture this. And, you know, large language   models turns out to be actually great at this  task, and we try to harness the, I would say,   fuzzy code comprehension capabilities of large  language models to help us find retry bugs.

HUIZINGA

Well, Shan, research findings  are important, but real-world impact is   the ultimate goal here. So who will  this research help most and why?

LU

Yeah, that's a great question. I would  consider several groups of people. One is   hopefully, you know, people who actually  build, design real systems will find our   study interesting. I hope it will resonate with  them about those difficulties in implementing   retry because we studied a set of systems and  there was a little bit of comparison about how  

different retry mechanisms are actually used in  different systems. And you can actually see that,   you know, this different mechanism, you know, they  have pros and cons, and we have a little bit of,   you know, suggestion about what might be good  practice. That's the first group. The second   group is, our tool actually did find, I would say,  a relatively large number of retry problems in the   latest version of every system we tried, and  we find these problems, right, by repurposing  

existing unit tests. So I hope our tool will be  used, you know, in the field by, you know, being   maybe integrated with future unit testing so that  our future system will become more robust. And I   guess the third type of, you know, audience I feel  like may benefit by reading our work, knowing our  

work

the people who are thinking about how to use  large language model. And as I mentioned, I think   a takeaway is large language model can repeat, can  replace some of things we were able to do using   traditional program analysis and it can do more,  right, for those fuzzy code comprehension–related   things. Because for traditional program analysis,  we need to precisely describe what I want. Like,  

oh, I need a loop. I need a WRITE statement,  right. For large language model, it's imprecise   by nature, and that imprecision sometimes actually  match with the type of things we're looking for.

HUIZINGA

Interesting. Well, both of you  have just, sort of, addressed nuggets of   this research. And so the question that I  normally ask now is, if there's one thing   you want our listeners to take away from the  work, what would it be? So let's give it a   try and say, OK, in a sentence or less, if  I'm reading this paper and it matters to me,   what's my big takeaway? What is my big  “aha” that this research helps me with?

STOICA

So the biggest takeaway of this  paper is not to be afraid to integrate   large language models in your bug-finding or  testing pipelines. And I'm saying this knowing   full well how imprecise large language models  can be. But as long as you can trust but verify,   as long as you have a way of checking  what these models are outputting,  

you can effectively insert them into your  testing framework. And I think this paper   is showing one use case and bring us closer to,  you know, having it integrated more ubiquitously.

HUIZINGA

Well, Shan, let's finish up  with ongoing research challenges and   open questions in this field. I think  you've both alluded to the difficulties   that you face. Tell us what's up next  on your research agenda in this field.

LU

Yeah, so for me, personally, I mean, I learned  a lot from this project and particularly this idea   of leveraging large language model but also  as a way to validate its result. I'm actually   working on how to leverage large language  model to verify the correctness of code,  

code that may be generated by large language  model itself. So it's not exactly, you know,   a follow-up of this work, but I would say  at idea, you know, philosophical level,   it is something that is along this line of,  you know, leverage large language model,   leverage its creativity, leverage its … sometimes,  you know … leverage its imprecision but has a way,   you know, to control it, to verify  it. That's what I'm working on now.

HUIZINGA

Yeah … Bo, you're finishing up  your doctorate. What's next on your agenda?

STOICA

So we're thinking of, as Shan  mentioned, exploring what large language   models can do in this bug-finding/testing arena  further and harvesting their imprecision. I   think there are a lot of great problems that  traditional code analysis has tried to tackle,   but it was difficult. So in that regard, we're  looking at performance issues and how large   language models can help identify and diagnose  those issues because my PhD was mostly focused,  

up until this point, on correctness.  And I think performance inefficiencies   are such a wider field and with a lot of  exciting problems. And they do have this   inherent imprecision and fuzziness to them  that also large language models have, so I   hope that combining the two imprecisions maybe  gives us something a little bit more precise.

HUIZINGA

Well, this is important  research and very, very interesting.

[MUSIC]

HUIZINGA

Shan Lu, Bogdan Stoica, thanks for joining  us today. And to our listeners, thanks for   tuning in. If you're interested in learning  more about this paper, you can find a link   at aka.ms/abstracts. And you can also find it on  the SOSP website. See you next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android