Developing better code with automated testing - podcast episode cover

Developing better code with automated testing

Jun 10, 202045 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Graham Lee, Research Software Engineer, Oxford RSE Group, gives talk for the department of Statistics on 22nd May 2020. Abstract: If we want reliable, reproducible simulations and data analysis software, we need to know that we have implemented our code correctly. Further, we need to be confident that changes we make to the code do not introduce unintended flaws. Automated testing is a technique widely used in industry to capture information about the expected behaviour of software and ensure that the system retains that behaviour through its evolution. In this talk, Graham explores the application of the technique to scientific software.

Transcript

OK, good, so it gives me great pleasure to welcome Graham Lee from Computer Science, who is going to be talking about automated testing, so please take it away. Great. Thank you very much. Introduction. Hopefully, what you're now looking at is the slides. So and yes, I will rely on Garrett to moderate the the chat and the Q&A. Welcome questions at any point.

So I feel free to nudge when you've got a question, but I'm going to be focussing on my presenter notes and all my slides, so I probably won't notice and I'll rely on him to let me know if anything's come up. So yes, as as Karen said, it isn't true. I mean, the research software engineering group, which is based in the computer science department, although my background is in professional software engineering.

So I did a degree in Oxford in physics back in 2004 and then moved into like commercial computing. I've worked at a bunch of companies, but I have always had software testing as a sort of thread running through my career, so a large number of years ago than I care to think about. I wrote the book test driven iOS development, which is about how developers writing for software apps for the iPhone and the iPad can test their software.

I was the manager of a engineering group at Facebook who developed the mobile testing frameworks the Facebook used, and one of the things you have to ask in doing testing is what are the benefits? And obviously we we hope that by having test coverage, by knowing something about the behaviour of our code, then we're going to have increased confidence in its in its correctness and in its function.

But there are fringe benefits as well. And at Facebook, we actually reduced the time it took to release a new updates to the Facebook mobile app. So the Facebook app for iOS for Android from four weeks to one and a half weeks, and it's probably even shorter now, just by taking some of the testing that was being done manually every time there was a new release candidate of the app and automating that so that we could very quickly get information on the quality of the software.

And I also did some work, Apple last year documenting that test tools and the techniques for using software testing with that technology. And that's on the Apple developer website so that, you know, that's the bit where I just tell you what my CV is. So there's some legitimacy for me being a gay guy giving this talk.

And then the question who we know, who is this Oxford Research software engineering group and where a effectively a sort of service or a facility in the university for helping researchers to achieve their goals using custom software and bespoke software development. That often means that we get involved in a project like a grant funded research project or a spin out actually writing software. But that's not the only thing we do. We also obviously do outreach like we use seminars, we do teaching.

And really, one of our main goals is to sort of bring up the the standard of software development across the university. So rather than just being the, you know, some sort of like gatekeeper or central clearinghouse software, we're actually sort of building a community of expertise and and practise across the university. So that's really why, you know, I'm very happy to be given the opportunity to.

Yeah, it's it's there's a waffle on it like phone on Friday afternoon about software testing, but less waffle because it says only me between us in the weekend now. So that's that's going. So this is really an introduction to the idea of software testing in a sort of scientific or like computational research context. So I'm going to try and stick mostly to sort of principles about how we think about testing, why we think about testing and how to sort of plan for creating tests for your software.

I'm not going to go in-depth on any particular like tools or technologies, partly because I think telling you how to use something before you've got motivation for using it is, you know, is kind of off-putting and not relevant or useful. And partly because there's, you know, there's a wealth of different technologies out there and depends on what you're trying to do.

If you're writing some sort of data manipulation in, are you going to have a very different experience from if you're writing a web application in JavaScript? And so like picking any one of those would lose a bunch of audience and not necessarily even be useful for the people who are using that particular technology. So what are we doing when we test software? You know, what are we trying to get out of this thing?

And I I've come up with four sort of goals for testing, which I've call continuity, correctness, reproducibility and recovery. So let's let's take a look at those continuity. I mean that what the software does today should be, you know, somewhat related to what it's going to do tomorrow. We obviously do evolve software. We add new things, we fix bugs. The idea is any of these should be an improvement. It's very rare for us to deliberately remove capabilities from software.

It does happen sometimes we realise that we're supporting an old platform that's no longer relevant or where we've got some old algorithm that the community has moved on from and that we don't need to have that algorithm anymore. But those are like specific events that we can plan for what we really don't want. These are unplanned breakages or loss of functionality, which called regressions in the software industry.

You can imagine that if you've published research based on a code that performs a simulation or does some analysis of the data and someone comes along and wants to replicate that analysis or rerun that simulation, they may want to do it in a newer context. They may want to try new ideas. They may want to use newer techniques, but they want the the thing to basically work so they they still want to get the the results that they were able to get before.

So one thing that having tests gives us is not only the knowledge that our software works now, it's knowledge about whether future versions of the software still have that early capability and that earlier behaviour because we can always keep these tests and run them against new versions of the software. Correctness is perhaps the one that makes a lot of people doing scientific computation kind of stop and wonder whether testing is really relevant for them.

You know, I'm doing research, I'm trying to find out the results. So a question for which I don't know the answer. By definition, if I knew what the answer to the question was, it wouldn't be. It wouldn't be research. So how can I write a test for what? I don't know what the outcome is going to be? And that is a good question. It's an important question. Yeah, we could have some complex problem domain that we're trying to model and a new context to explore with that model.

And while we may not know what the what the outcome is going to be in terms of the research problem. We we want to at least have an idea that the model that we have come up with conceptually. Is correctly implemented in our code. So, you know, if I'm simulating this sort of track looks like some many body problem, which may be in gravitation. Well, we have models of many property problems in gravitation and we know how a model like this behaves over time.

We know that if we set it into some initial condition or some initial situation and then progress it by some amount. We know where everything should end up. And if we know where everything should end up, we also know where a we also know whether if it does end up there is correct and if it doesn't end up there, then something has gone wrong.

Obviously, we're not building physical models, we're building software models, but software models of complex gravitational problems still have the the aspects that they are implementing some part of a of a simulation of a problem domain. And if we design the simulation, then we can know how that simulation behaves and we can validate that it is behaving in the way that we expect. And there are things that we can do to help that.

So in many body problem, we know what happens when there are two bodies in a gravitational interaction or in a more complex system. It's easier to work in, say, the sort of low velocity relativity domains where you know where space and time are basically constant and don't change than it is in the in the sort of Einstein domain. Velocities approaching the speed of light. That's not necessarily the problem that we're trying to solve here.

It's just an example. And this brings us on to a principle that software testers use called equivalence partitioning, you know, another problem we may have is the scientific problem we're trying to model could be incredibly large. Just to give a different example. Garrett and I were talking before the start of the seminar about the behaviour of particular proteins in a biological system.

Now, one thing that we might want to do with a computer is simulate the structure of these proteins by saying where they they're built. Of all of these components, all of these atoms organised in this particular way. What structure is that going to sort of collapse into if the, you know, when the various electromagnetic forces on the different ions and atoms in the thing are sort of stable state? And that's that that is a common computational problem to solve.

But you know, when you when you were working with proteins. Now if you take something big like a virus and say, how is this going to fold? A. You may not know the answer in advance. B, it could take a very long time, you know, even on a like a supercomputer cluster like out in order to find out what the answer is. But. Let's take a simpler problem. We know what the angle between the two hydrogen oxygen bonds in a water molecule are. Does our model get that right?

If it, you know, if it doesn't, then we probably shouldn't be particularly confident in using it for any more complex problem. If it works for water, then it's try sugar. Well, let's try a really simple protein and see whether it gets the right answer. We're not doing anything weird in a software context here. What you're saying for this problem where I know what the conditions are. I also know what the outcome is, and I can run my software and then verify that I get the same outcome.

If I run this multiple times with different inputs and always get the expected outcomes, then I increase my confidence that my software is correct. And then the two remaining goals are reproducibility and recovery. So. Reproducibility is obviously very important in in research as why we have the reproducible research network.

We want someone who's running our analysis with our code to get the same results, and that may mean that they're running on our computer may mean that they're running it on a different computer. It may just mean that they're doing exactly the same thing by a different time. But we would we would expect to get the same results in that context. Someone running our analysis, but with different data, should get consistent results in many circumstances.

If you're running a simulation in a different but related domains and the simulation correctly behaves and represents the outcome of the model in those particular domains, then the results should be comparable somehow. We would also like someone who takes our ideas, takes our model and free codes, and gets a consistent results, or at least if they don't get consistent results, then it's possible for us to investigate why where the disparity comes from.

And that's another thing that we're going to get from automated testing that we'll look at later is a bit more fine grained information about how the different parts of our software system interact and and which bits of it are behaving in particular ways.

And there's a really important thing to bear in mind where we're talking about like reusing software, reproducing the results, we get types of software and recovering the behaviour of software that we use a long time ago, and sometimes that poor person is having to deal with their software is like me or you. It's the same person who wrote it. And you know, we get distracted by another project or like, we get some teaching that we have to do for a term.

And a few months later, we come back and we don't quite remember what we were doing. There was obviously some stroke of genius when we wrote that function there, but why did we write it that way? What does it do when we've got a collection of tests that say, here's what this part of the software does in these circumstances?

That's more documentation that's more help both to us to kind of recover our mental model of what the software does and for other people to reconstruct that mental model and get an idea of what if, how this software works so that they can either reuse it or develop it. So, you know, how would testing help in in these scenarios? So some someone else wants to use our code and run it with the same data.

If we've got a collection of tests that explain how the what the software does and how it should behave for particular inputs, then before someone else runs the simulation and or runs the analysis and checks what whether they get the same outcome, they can run those tests and see whether they all pass to see whether all of our expectations are satisfied. And that's going to give some information if any of those things fails.

That's going to give some information about what the assumption is that isn't satisfied in this new context. Maybe the software expects some files to be present, like configuration files or inputs set up in a particular way that I've got in my home directory and that definitely work for me. But someone else needs to know that information set the same thing up in the same way if they want to get compatible results.

Someone using different data wants to get consistent results, or again, if we know if we can prove that the software does correctly implement the model or the other sort of scientific concepts that we're trying to embody, then they can be somewhat confident that the results they get out from using it with their data are the result of our model being applied to

that data and not the result of something weird going on with some code or with there being some mistakes in the in the behaviour somewhere.

And then if someone else wants to take a light re-implement our model, be that just for a cross-check to make sure that that they understand what the model is, maybe because they using a different context, like maybe their super computer doesn't have this or that cluster doesn't have the same libraries as ours and they want to build a version is compatible with their set up.

Or maybe they're using a Mac and we were using Linux or for whatever reason, they want to rebuild it if they can see the tests and they can see the expected results. They can compare the results they get from their implementation with the results that they get from our implantation. Then they know something about the compatibility of those without having to just like, run the whole experiment and see what the outcome is at the end.

So really, all of these things are about increasing confidence in the software and increasing the rate at which we get feedback that informs that confidence in that software. That was my strategic pause, just to check whether there were any questions, obviously not right now. As all carry on. So how do we design a software test?

You can think of the behaviour of any software as being a form of contract that you make with the user of the software, whether that's yourself or other people in your group or members of the public or whoever's using the software, you can think about a form of contract where you say. If you arrange for this collection of things to be true. And then uses software, then I will make this guarantee about the outcome if there is a calculation about the result of using their software.

And like that sort of design principle, whatever tools you're using to write your tests, whatever sort of level will come in later to discuss the different sort of levels of abstraction that exist in designing tests, this idea of as a contract is universal. If you set the world up in this way and then use my software, I will do this thing as a result.

And so, you know, we could think back to the many, but the many body problem I can say, if you have a mass appoint mass, a mass M here and mass of mass am to over here and the distance between them is, ah, then I would say they then ask my software to calculate the gravitational force on the first mass.

It's going to say that result is jammed over squared, which is the light, the Newtonian gravity gravitational force equation in the direction from this point from the first point to the second point. If you set the mass of one of these things to be negative, then my software is going to generate an error because we haven't worked out how to do negative mass or we decided that negative mass isn't within isn't a problem we want to solve.

If you have multiple matches in your simulation and you ask what is the Fourth Street gravity over here, we're going to work out each of those individual contributions and some of them. And so, you know, you can see that this idea of the contract is coming into play if you have done this. If there is a mass here in the mass there, then you ask for the gravitational forces, then the result, then that you know what the software is going to do is to give you this answer.

And so a test takes that contract takes that idea of the preconditions and then the action and then the post conditions. And what you do is you create a single concrete example of that. Where you know what the answer is for a given question. So if the mass of the other object is zero, then the gravitational force is zero. Super simple one and that that is a valid case, and we can write that as a test.

If the mass is one kilogram and the distance is one metre, then the the yeah, then the force is just the gravitational constant G. Again, another example. And we, you know, we start to think, well, aren't there an infinity of examples such as taking this example of this sort of scenario of many body gravitational problem? I could have anywhere from zero to an infinite number of different masses. I eat at any of infinite points in space, isn't this?

And with infinite initial velocities, do I really have to write that many infinities of different tests? And and so that. State that. What software testers do is they look for what are they actually meaningfully distinct regions, so distinct domains in the problem space. And then they write tests that capture by one example over each of those regions.

So. You know, there's the trivially degenerate case that there are no masses in your many body problem simulation, that there's a very simple answer there's one having one mass. That's again, a trivial situation of which there's one example having two masses is another similar simple example.

And then, you know, the idea that as you add more, what happens is that you some of the contributions to gravity from each force tells us that as soon as you've got more than two, if it works for any number of more than two, it works for all numbers more than two because it's just got to do the same maths, but with with more inputs. So a tester would write a test for zero, one, two and three basses and would then be entirely happy.

Which works really well for sort of discrete variables like that where you have continuous variables as a related idea where as well as there being different ranges in the or domains in the problem space. You also then do what's called boundary value analysis, where at the at the boundary between two of these regimes you say, is there? You know, are the results effectively continuous through the boundary? Does it do the right thing as you move from one domain to the other?

And so if we had some simulation that had say like relativistic corrections then and small velocities, it just use like a normal linear space and time. And then when you went to higher velocities, it used the relativistic corrections.

There would be some point in between where it started to use these corrections, and the texture would look at what happens just below this at this point, what happens on that point and what happens just after I make sure that there are sort of three consistent values which make sure that the transition through the regime is smooth.

And this is this is so common in software testing that there are a couple of little sort of mantras they use, either in particular technologies or by particular communities to sort of encapsulate this idea if there is the contract and of the preconditions, the action and and impose conditions. One of them.

Which is very common in sort of communicating the the meaning of a test between software developers and by and say problem domain experts say researchers who are working on the software is to use the phrase given when then which encapsulates that idea. Given that this set of conditions were created. When the software does this, then this is the outcome. So again, we've got stuff that happened fast. Given this set of initial conditions, court action when this happens in the software.

And we've got an outcome then this will be the result. People using unit test frameworks, which are a way of testing small components as they are like little pieces of a bigger software system, use a a phrase called Assemble Act Assert. And again, the precondition is you have assembled this thing in this state. The act is the the action that the software takes and then asserting is saying, I am telling you that the result of the software will be this.

So a software test is always a binary outcome, and it says an assertion of what the correct behaviour should be and a failure to satisfy that assertion means not having confidence in the software means believing that something has gone wrong. So with that idea of how to build a test in mind, what's the best way to get started?

The easiest thing to do is just take your existing software and think about this given when then think about this idea of the contract and apply it to running the software as a whole. And this is called an end to end or E2EE test in the sort of jargon of professional software testing. If you've got like a big problem, like a machine learning training problem or a a massive like a super computer simulation, it's going to take.

A long time and a lot of resources to run, then yeah, this is not necessarily the optimal thing to do. You may end up waiting a very long time or even year, costing a large amount of money just to get the results of your tests.

And this is why we look for a sort of sample problems, toy problems, smaller datasets, something where not only do we know what the outcome is, but also the sort of computational effort in getting to that outcome is going to be small because the the less time it takes to run through your tests, the more frequently you're going to do it. Yeah, that's just just the way that people work.

If if it takes more than a few minutes to do something, we get distracted and we're going to look at something else here we go and check social media or our emails or go make a cup of coffee or whatever. And so we we tend to like, save this for, Oh, it's lunchtime, I'm going to let go of my tasks and then go and get some lunch and then come back and see the result.

What if I'm already running my tests every lunchtime, then if they passed on Monday lunchtime, they passed on Tuesday lunchtime and they fail a Wednesday lunchtime. The only thing I know is I did something on Tuesday afternoon or Wednesday morning that broke. That made this software behave in a way that I believe is incorrect. So I've now got to kind of go back through my entire set of changes I made over that day and try and understand what it was.

If it takes me like a minute to run through the test, so I might just do it every time I've changed software and then. If I run them and I find that something's failed, I've only got to go back to the thing I was doing a minute ago, which is still fresh in my head and I know I won't change. It was and b I know like what I was changing because yeah, I know there's a limited amount of stuff you can do in that time and I know what I was trying to achieve.

So I've got some idea of what I introduce that could make the thing go wrong. So we tend not to build massive batteries of end to end tests, we tend to build a small number of highly important tests that show that basic things work and that like, and that our system is basically glued together the right way. So if I think back to the work I was doing at Facebook, we would have a smoke test that was can I launch the iPhone app, log into Facebook and then post some text as a status to my newsfeed?

And that would get run by every developer every time they made a change to the application. Most of these changes weren't going to break that behaviour, but as soon as someone did break out, babe, you wanted to know about it because if you had a version of the Facebook app where you couldn't post to your newsfeed, that wouldn't be useful to almost anybody using the application. So this is a very high value test, a very small focussed piece for the functionality that we were exploring.

And these tests typically don't need any changes to your software if you manage to change your your dataset or like your problem specifications that you're running a very small problem. You just you're just using your existing software. There's no real design changes are required and you can just run through these with.

It with a script, if they're a sort of simulation tools, you just want to go online or if you've got something that's got a user interface, you can find a tool for automated like pressing buttons on the user interface that will just run your software, as is. These tests are very useful because they tell you like whether your software is kind of all plumbed together properly is dealing with data,

as you would expect. But they're also very low signal in that if it goes wrong, what you know is there's a problem in your software somewhere. You know, think about that Facebook example. Let's say that the. The ability to post at a feed didn't work, is that because the little submit button in the UI is broken, is it because the the thing that sends the data to the network is broken and how is it broken because it can't connect to the network?

Or is it because it isn't reading the data out of the UI? Or is that that all got sent? And then the UI does not update to show the the new results? Or is it that the server ignored this data coming? You know, there's so many different ways in which this test could fail in any given way that all we know really is that the software is broken. We don't really have any information on what is broken and any way to narrow down our investigation on how to fix it.

So the community has this idea of the test pyramid going back to my example of Kerbal Space Programme. We certainly could test the Space Shuttle by building a space shuttle and then seeing whether the Space Shuttle works. But that's a really expensive and time consuming way to test the Space Shuttle.

It's built out of all of these different components, right? One of the earliest things that NASA did when they were building, not the Kerbal Shuttle, but the actual real space shuttle was to build the business, got the Delta wings. Don't don't bother putting any engines on it. Just strap that to the back of a jumbo jet, take off, let go of the thing and then try and land it.

And that tells you whether the aerodynamics and the controls work without having to build these massive solid rocket boosters or the main engine without having to assemble all of that and then stick on a launch pad, fuel it up and said anything up without even having to build the little so control engines there on the back of the main body.

So they they took a component of this complete system isolated to that component, set that into some reasonable starting condition and then saw how that behaved once they initiated some action, which was planned for the Space Shuttle. And we can do that kind of thing with software as well. So now we do get into the stage where we're having to think about the design of our software, which components are actually distinct.

So functionality are distinct behaviour in this software that have they're responsible for, like some subsets of the overall system. How are those related to each other if we try if we were to take out all of the rest of the software, what would we have to supply for this thing, to have enough information to be able to work? What would it expects to be able to do? Does it want to read from a file or write to a file?

Does they expect a database to be present? Does they expect some variables in the programme that's outside of its control to be set? So we are now making changes to our software, but these changes are themselves potentially useful because what we're doing is taking each of these components and reusing it outside the domain of our, you know, our immediate science problem. And in the domain of a test, this means the changes we make are changes to the reusability of this module.

We can now take this software, this component, and apply it to different context because we now know what we need to do and how to set this thing up so that we can use it elsewhere. And what we get from doing this is we get much, much more precise feedback when a test fails. We know there is a failure in this component. If I if I took away landing the space shuttle test and my space shuttle didn't land, I wouldn't need to check the solid rocket boosters because I didn't use them.

I only used the aerodynamic dynamic part, and you can imagine going even smaller. So the so-called cockpit windscreen on the front of the shuttle, you could test how impact of resistance that is just by exposing is a large force like the equivalent of hitting it with a hammer. If it breaks, then you know that the problem is with the windscreen, not with the rest of the shuttle, and certainly not with all of the engines and other components.

So we're getting much, much more localised and immediately actionable feedback from our test results. But what we're not doing is answering the question, does this actually does this offer actually solve a problem? I have, yes, I need this like this unit, which is just a a way of describing like a class or a function, some very small part of software behaviour.

Yes, I need this to to work in particular ways, but it's only going to be providing a valuable contribution to serving my overall problem if that working in particular ways is then used by the rest of the units in a way that's that's enabling my problem to be solved.

So it's very, you know, it's very common in commercial software, for example, to find projects that have a large number of unit tests at a very high level of coverage by which we mean the the fraction of the statements or the different logic flows through the programme that are tested so that you can find very like the tests at the unit level. That's the very small, separate component.

Tests are very well specified in COVID because it's easy for a programmer to think, What do I need this function to do? But then gaps as you get further up the pyramid into the integration and the end to end levels such that you actually don't know whether the programme works. But you know that every function and it does what the programmer thought it needed to do. But you don't know, does this actually solve a problem that anybody has?

So the sort of motivation of having this pyramid graphic is to say he is like, Yeah, he's a good idea for how you should spend your testing effort. Lots at the small level, which gives you high fidelity, actionable feedback or sorry, high precision, actionable feedback. And then some at the top level that say that you actually are able to achieve your goals using the software.

And then some bits in the middle that sort of provide the impedance match between these separate functions work and this whole software works. These bits, when assembled together, are also correct. So as I said, this was really an introduction to the concepts of testing. Here is a specific tool so you can look at the are relevant to using particular programming languages. I've tried to cover most of the things I've seen in the world of scientific computing.

There may be others. I apologise wholeheartedly to any Fortran programmers who feel left out at the moment, but I don't have experience with testing or FORTRAN, so I didn't have any recommendations for tools to examine that.

So quick summary, the reason we want testing in a scientific context is partly to improve the confidence that we have in our software and partly to improve the reproducibility of results that we get with the software because we know how the software acts and how it responds to particular inputs.

Even if we don't know what our scientific outcomes are going to be, we should at least understand what conceptual model we're trying to express in our software and can say that we have correctly expressed this model, even if we can't say a priori what the scientific outcomes are going to be. And the way that we design tests is using the idea of a contract that's given when that idea that if I set things up in a particular way,

then use my software, I will get this outcome. And that expression is an assertion. If it if it is satisfied, then the software is correct for that case. If it is not satisfied, then the software is incorrect. It fails to meet our expectations. The easiest way to get started is just to run your entire programme with a known input.

That way, you know what I come to expect. And of course, the RC Group can help, and we run these things good software surgeries, which like a sort of half hour discussion with one or two research software engineers about your software projects. So if you want some help. So getting started with testing or finding out how to use software tests in a particular way? Drop us a line as our email address, you can find that out.

So that has to be done. And I guess now it's time for some questions. All right. Thank you very much. It was wonderful. Thank you. OK. So I'm going to I'm going to stop the recording and then people can freely ask questions.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android