Introduction to Advanced Research Computing at Oxford

00:04

So it gives me great pleasure to introduce Guy Jenkins and Andy Getting's Dei is the head of research, computing and support services for Oxford's Advanced Research Computing Service. And Andy is the scientific research software advisor for ARC. And today they're going to tell us about the amazing supercomputing facilities we have at our fingertips here, Oxford and and beyond. And I hope that in this we get an overview of what's possible and how to use it. So over to you. OK, thanks.

00:46

And thanks for having us along today to you to talk about, OK, so what we can try and strike a balance between in this presentation is some well, hopefully all of it is going to be useful in some kind of way. It's not meant to be a fully in-depth training course, but what is meant to do is to try and give you an overview of the facilities that we have available. Why would you want to use high performance or advanced computing in general, not just through ARC?

01:15

And also how you can access our services? I'm looking at support what's available for them and and so on and so forth. So getting into the slides, so just a quick kind of overview. Key facts of of arc. So Arc, the University of Oxford Advanced Research Computing team and we provide the Central High Performance Computing resource in Oxford University, and we are hosted within I.T. services.

01:47

We are the Central University resource, but there are significant other high performance computing resources that are based around the university. But we are the only ones available to all four divisions that are free at the point of access to to researchers. So that means that you can, as a member of the university, be able to request a user account on ARC and through a project.

02:14

You can then start doing work on the arc resource without having to find any income, any funding in order to provide to the OK series in order to gain access to it. So in terms of in terms of numbers, the team is I'm quite small, it's only for staff and myself. So to boast I followed, not with a stay. They are the systems administrators for the service and they keep the systems and fed and watered up and running.

02:40

And then there's Andy and another person called, Yes, me, neither who provide the applications and user support for the service, and we will be giving more of an overview on what's available there in his half of this talk. So in terms of the hardware we have available to us at the moment, we have two principal clusters.

03:01

One of those is for high throughput applications, and the other one is what we use for or was billed as being the capability cluster, which is used for the it's kind of what what was what was deemed as being too high performance applications, which are closely coupled in terms of processor communications and memory available and so on and so forth.

03:23

I go into that in a bit more detail and those are connected through to high performance, deep fast filesystem that we have connected to the cluster in terms of the user base that we have. As I said earlier on, we are free to all four divisions within the university.

03:42

But in terms of, you know, not all divisions users, as much as others, MPLS uses one about 90 percent of the resources of the the compute cycles and core hours that are provided by the service that simplifies social sciences division and medical sciences division and then humanities who use us very rarely. So that was what happens across all divisions in terms of the board members use numbers.

04:11

We have 20 five hundred registered users across the university and that's increasing by around about 600 users per year. Out of all of those, and 400 are active and submitting jobs to the system, any one given moment in time. The service is quite busy and we're running around about 50000 jobs per month and increasing, which means that our clusters and the resource that's available with them is around about 80 percent utilised at all times.

04:39

And that's kind of around about the maximum level that we like to run them at in order that there's sufficient headroom to turn around jobs within the scheduler and have sufficient throughput on the system. So, you know, I'm just answering the question, but why not 100 percent to 80 percent of the thing? It doesn't become kind of almost gridlocked within within the system.

05:04

So what services do we offer? Well, first of all, there's the access to the cluster resources and also the research applications. We have a range of cluster resources. We have x86 nodes, GP nodes, high memory nodes in the system. And Andy will talk more about this. But we have over 400 centrally installed applications on the system that we take care of and curator available for you to use in order to do your research on the system and as well as providing the the hardware and support for that.

05:35

There's also the use of trained user support and training that goes along with it. We run face to face training as well as well with encoded situated in COVI times. We do most this this online, but if you do have any ad hoc queries, then a ticket can be submitted through to support of the Orchestra Act or UK. And we will we will deal with anything that comes across on that by email and if things do need to be followed up on that,

06:02

we can then do a teams call in order to to help you out. There's also what we call premium services as well that we offer on there own the cluster and these are in addition to the free resources I talked about earlier on. And examples of these are node reservations, so we can effectively book out a set of notes for you to do a specific piece of work on and be given a reservation on that. And then that will be yours to use for a specific period of time.

06:32

We do kind of like test the case for those quite rigorously, so that's not something we just give as a matter of. As a matter, of course, because it then takes that reserved resource away from the wider user population. There's also priority time as well, which is a paid for service, paid for kind of product that we have and that is allows your jobs to be pushed forward in the queue ahead of other people that are just having standard service.

07:03

And so the reasons why you would want to do that is if you are coming very close to a paper submission deadline or you know, you're it's coming close to your thesis being submitted or anything like that, then you can purchase priority time and it will allow you to go ahead in the queue and get that work done ahead of others and become of the final premium service that we use. A lot is co-investment, and that's where a researcher will buy notes.

07:33

Those no. Two effects will be given over to the ARK team. We will install them in the system and we will feed water, operate those for for the researchers. The benefit that we can get back to arc is that when the research that's bought those nodes isn't using them. We can then do what's called back filling of jobs onto those nodes, which benefits the wider user community, as well as the researcher that bought the nodes in the first place.

08:03

And a large number of our GPU notes have been bought under that co-investment model. What we can also do as well for external users of the system is we can provide access for them and this is primarily and it says here it's open access for collaboration and commercial partners for academic collaborators. We can arrange access free to to ask to work with you and your teams on a quest for commercial partners. Be to charge them for access to the to to the system.

08:42

And finally, on this access to IS is incredibly easy when it's free. Secondly, in order to access ARC, first thing you need to do is to set up a project and that can be done by a principal investigator or group lead. And that's done by our web form and the link for that issue on the slide. Once that project is then being created, you can then do a a an individual user account request and then we can set you up with a user account,

09:08

which is then linked to that project account. And then you can start beginning to use the the park service. So it's a fairly easy, simple process to go through and from start to finish. This can probably be done in about two or three days more or less in normal in normal circumstances. But, you know, so that's what all kind of is a very high level as well as kind of what we provide. But but you know, what is high performance computing and why is high performance computing?

09:38

So first of all, what is high performance computing? No, single single or really kind of an all encompassing definition of what high performance computing is or isn't? But for the purposes of this kind of, you know, framing this topic, it's really something that can't be performed on a desktop or workstation easily or just takes so much time to do all that kind of resource.

10:02

It becomes impractical, if not impossible, to do it in the useful manner or something may be amenable to be carried out in, you know, on multiple processes in in parallel. And that can either be so at multiple instances of the kind of a second job, but with a slight problem to tweak.

10:18

Or we can be working using multiple processes in parallel to sort out a truly complex problem that is requires a single job that's require access to multiple processes into changing information with each other via API to to come to come to a solution on that problem. But overall, I think kind of it's agreed that HPC should find the person to take a unit to work in less time or more work in the same time or to you that something is otherwise impossible.

10:47

And kind of the reason why these these images have been chosen only, well, it's on the on the left and right hand side of my slide. It's kind of like exemplify that this isn't really something that is new or it's been done in the last five or 10 years. The diagram of the picture at the bottom of a slide that's one of the and the bombs that was used to to to to find out the relative positions of the Enigma machine.

11:12

And that would be kind of a classic example of kind of a high throughput how high throughput computing type job where you're working on, not dots are all completely interconnected, but there are lots of different jobs, but you have the requisite if you have the time pressure that you have to figure out what those two positions are in a 24 hours period so that the information can actually be of any use because as soon as you can step outside of that period,

11:38

then of course, all the work you done previously is is absolutely null and void. And the work that Lexie McCray, that's the naturally dressed guy standing next to the big one thing at the top, which is a cray Acree, when that machine is was primarily meant to set problems to to to do calculations, but again with time to completion requirement was wasn't so great.

11:59

But where the competition was so complicated that it would take so it would take a huge amount of time in order for somebody to do it basically on their own with a slide rule or working in 10s and with the slide rule in order to do it. So High-Performance computing, high throughput computing, not necessarily very new concepts at all. There. But usually one thing in common with high performance of all high performance computing is it's normally a large system somewhere in a data centre.

12:29

And our data centre is Bedford Park, which is based just outside of of Oxford. So what can we switch computing be be used for? So generally, there are kind of four types of research computing. So this this compute intensive and surprise surprise. And that's doing things where, you know, requiring a large amount of compute with high performance into processor communication in order to actually to do that.

12:58

So this is what we would term as being so traditional, high performance, heroic supercomputing that you're doing modelling simulation problems with things like fluid dynamics, climate modelling, molecular simulations, et cetera, et cetera. And where we see most this kind of work coming in onto our systems is through researchers in MPLS and medical sciences division. Then we can move through to data intensive High-Performance computing. And this is this is, as it says on the tin.

13:29

It's applications requiring or operating on large amounts of data, quite fast, efficient ingress egress of data. So in this case, it's not so much performance within the computational part of the node. It's performance there, but also in kind of the the the data. So the hierarchy there in order to be able to hold data into the compute and back out again.

13:54

And so applications that we see in this area are basically around by occupying informatics, genomics and machine learning applications, which are of using and manipulating and operating on large increase in an increasingly larger amounts of of data in order to, you know, in order to come to a solution that becomes a high throughput computing. This is although we said it's been a high performance computing stuff that we can't really do on the desktop.

14:25

This kind of work is stuff that you can do on a desktop, except that what you produce high performance computing for is to harness the sheer amount of resource that's kept in a single place to almost operate as a thousand two thousand 4000 laptops working on things like parameter to sweep experiments. But you could do individually and then serial on it on a laptop, but it would take you so much time to do it that it would be impractical to do so.

14:50

So what you do is you can do that if you do those types of experiments on arc multiple multiple times and then that increases or decreases over your time to completion immeasurably. So that's kind of high throughput computing, and it's using a number of different application areas and that's prevalent across all of the divisions within within within the university. And then we finally get to memory intensive computing, which in some cases is a bit of a tweak on high throughput computing.

15:21

Applications are many and varied, and basically they require requirements in these cases mostly is for all of your data to be in memory in order to be to be operated on. So it's either input is incredibly large and has to sit in memory or the outputs from what you were doing and require a huge amount of memory in order to push it to be pushed into.

15:42

And there's some examples in the economy. But overall, take home message from this is that Arc provides some compute resource and that goes across all of these four areas and finds a general and high performance computing service to to the university. So as a kind of one of the kind of extreme examples that we sometimes give to to to to illustrate what's required. So I been talking in a desktop PC can reverse or tens of gigawatts.

16:17

That is useful for most day to day applications, but in some cases that's just not going to be able to catch it. An extreme example is things like short range weather forecasting where you need to and shorten in terms of time, rather in terms distance that brings different complications. So we have two term predictions around for the next day within that particular day that that that that you want to run today in order for it to be available for tomorrow.

16:43

And so for the things that the Met Office requirements a compute system of circa one petaflops required in order to actually to to do that and be able to turn around those calculations on a useful timescale in order to be available for the next the next working day. So that's kind of an example of where you require an extreme compute in order to do to do something which which we both don't take for granted nowadays.

17:12

I'm just giving an illustration of the size of the computers that are the fastest supercomputer in November 2021, as shown up on the the. The top 500 is around about 440 to petaflops, which is a massive and staggering. And when you consider that when I first started in high performance computing, which is back in around about 2000 and seven, we had the National Service. Hector then was, I think it was 60 teraflops in size.

17:47

And so strolling on around August 13, 14, 15 years, we have this size of computer. Hector in two thousand six seven, was in the top 10. This system is the fastest in the top 10, and it is now on rough calculations, probably around 7000 times faster than than than than the hack the system was back then. So that's how much things have come on in leaps and bounds because the applications are driving us in this direction. So it gives you food for thought. Right.

18:22

So in terms of the hardware. But before I go on, does anybody have any immediate questions? So. OK, so I'll just so I take a quick canter through the through the actual hardware resources that we have available for turning over to Andy that will talk more about the the softer elements of the service. We operate two clusters of a type of park. One is the high throughput cluster, which is kind of, in some cases, I going to discuss in more details, but it's just a bunch of nodes.

19:00

And then we have the ARK cluster, which is our capability cluster, and I'm going to explain more about what these are. In a minute, the way that most users interact with Senate jobs, the system is through the arm, through the flumes scheduler to select a job to learn it then assesses it. And then when the when the resource becomes available, should other factors be equal, it will then launch the job onto the system and it will run. All of our clusters are based out at the Park Data Centre.

19:34

Just outside, just outside Oxford. And these are then connected back into the university's core Odin network. So taking each cluster into its supercomputing cluster, these clusters purposely set up the preference small jobs, which in this case are less than less than one node inside in size.

19:57

And this is to make sure that only small jobs, high throughput jobs actually get through in the system and that the system is not going to get swamped by jobs that are larger than that which should be targeted to the arc system. The system also provides a kind of as well as finding a system. It also provides a hosting infrastructure, a mix of CPU and GPU nodes that we bought and co-investment within this system.

20:24

We have two high memory nodes, and they're available to users on each of those has three terabytes each. We have a whole slew of GPU nodes in that system as well, which you can, which you can. You can get full details on at the link, but they range from these kind of very high performance and servers to show on the system, which is 80 x Max-Q, which has eight v 100 AMD GPUs, which are all connected via a very high bandwidth NVLink interconnect, and then that has two Intel processors within it.

21:00

So this is basically a server that's completely dedicated to high end machine learning and in some cases by simulation kind of research on it that you can do on that system and the GPUs on that also and do double precision workloads as well. So we have a number of those in this in the system.

21:18

But what we also have, as well as a large number of more kind of like prosumer consumer type cards, almost like gaming cards that within the system with varying amounts of graphics, graphics memory that go along with those as well. And those are more geared towards just just out machine learning applications.

21:39

On top of what we have in terms of, you know, you also have another 40 just standard CPU nodes within that system, as well as the high throughput workloads, and these servers are exactly the same specification as those in the arc capability machine, but they just do not have the the InfiniBand interconnect linking each of the nodes together. With regard to the co-investment nodes that are put into the system, as I said earlier on, research buys them, we feed and water them.

22:11

But when the research is not using, they're available to the wider university. What that means is that jobs can backfill into those nodes. But the maximum length of the jobs that come back on those nodes is 12 is 12 hours and full details of those co-investment nodes. And there the specifications are also available via the link on this on this slide.

22:36

Let me going to the art cluster, which studies the capability cluster, and this is dedicated surprise surprise to modes at the jobs that are very much larger than when one note in size. And so when they go into the scheduler, you know, dependable that the science part of it is going to be a prioritisation, prioritise prioritising part as to where it goes, when when it goes in the system, they all cluster. We have 258 compute nodes that are in there.

23:06

Those compute nodes are all arranged into and arranged into seven islands as they call within the system within each island. There are around 40 nodes and so 40 to 44 nodes and within each island. And they're all connected by this this InfiniBand S$100 interconnect within each island. There is one to one communication non blocking communication within it, and that provides the highest performance.

23:37

It's like communications that we can provide between all nodes within the system, and the performance of that interconnect would be comparable to what you would see on the on the national services. So things like Archer and between you can run jobs between these islands, but there is a slightly higher communications overhead on on that.

23:58

And where it goes from being non blocking to being a three to one contention ratio in the in the network connectivity between that so contention three to one, the higher those ratios go, the the lower that your performance is going to be from. From that particular given interconnect operating system on this is Santos.

24:21

But what? We Santos eight. But we also ran to some of the other choices within that system in a legacy configuration, which looks a bit like Arcus Bay that runs CentOS seven point seven and that's just specifically set up for legacy applications that can't not come. I can't run on central site. And I said before the schedule that we have on the system is slow. This system is a kind of the grand scheme of things, a modest twelve thousand three hundred cores.

24:52

But in the Oxford Sense, this system provides a significant improvement over what we had previously with with our Casspi and the old HTC system, which collectively had, I think it was probably about five thousand eight hundred cores between them. So we more than doubled the the capacity on this type system over our previous arrangements. Full details on the system are again given on the on the Arc website, so if you want to look at things in more detail.

25:22

So that's the compute side of things, then we have the storage that goes behind all of this. We operate a very high performance file system called called Universe, and we have two petabytes of that that are available to us. I don't think there's really much more to say to that. They don't any restrictions on use of that power from quoting. And that's pretty much all there is to say on that one.

25:48

I won't go into full details on the performance of that because because sometimes going when you and the other. So that's those last three slides are what we have as a team based in the the banquet park data centre. But what we also have access to as well is the data service, and Jade stands for the joint academic data science endeavour. And this is a grant that was funded by ATSIC, led by Oxford for a four or five million pound system that is based entirely on a slender TJX Max-Q product.

26:33

So it's essentially it's owned by University of Oxford. We reported on behalf of of researchers across the UK on behalf of EPS rc well, efforts to provide this money. And we bought it and the system is is hosted at the Hartrick Centre in the north west of of England. The system itself is based on is based on 63 x max q boxes. It's very similar to the previous one service, except that there's just a large amount of it.

27:10

The only other difference between really between grade one and two, apart from the increased capacity of the system, is also the fact that it's connected through to a much more high performance file system on that on on on on that particular iteration of the of the cluster.

27:29

And what this is this system is mainly geared to providing is just a higher level of resource and specifically the ability to do in, you know, in multiples of one boxes, multi GP work, which most universities don't have the resources to be able to do. We are quite fortunate at Oxford in that we have around about five of these systems.

27:52

TJX minus key systems are self, but there are a number of universities that do not have access to this kind of this kind of kit, and that's what data is meant to be an unchanged one before it is meant to be that facility. So that's all I have to say on those elements, I'm happy to take any questions, but if there aren't any, then I can have the to Andy. OK. Sorry, I just can't say die. You mentioned there are different types of notes on on arc, including chip use.

28:35

Mm hmm. Maybe it may be A. You're going to say a little bit about this, but if if if one wanted to run an arc job on a particular class of node, I would one go about doing that. You can specify it into Islam, will cover it a little bit more to talk about, but it won't be in too much detail. But yeah. OK, thanks. Yeah. OK. I will carry on then. Thanks. Thanks, Joe. So, yeah, I imagine to get I need this off applications group within the Arctic.

29:07

I'm going to cover some of the usual application support type things that Arc do specifically. We'll look at things like the training courses, use documentation, what we do with software applications. And obviously, as I've already mentioned, one of the things that we do quite a lot of is providing general user assistance via our email address, which is support Arctic oxygen to UK.

29:39

We do often repeat that so that people can actually send us any kind of support request and that goes into our general ticketing system. And and usually we're very responsive on that. One member of of the marketing is usually dedicated to answering tickets each day of the week, but there's only four of us, so we have to double up occasionally. And I got.

30:07

Let me go, so trying engagement, so the kinds of things we do regularly are the training courses, the main one being the introduction to OK, this is for new users. And if anybody is particularly interested in actually getting on our account. Feel free to go onto the training course website, which is shown there there marked at the UK slash training and book book a place, it's presented twice a month and gives a really good overview of how to use the system and what capabilities there are.

30:48

There's also another course that we do called effective use of clusters for known programmers, and this is for uses that have maybe been on the mark in introduction to our course. I have a couple apps they use and they want to get a little bit more out of the system.

31:05

And the reason we say and on programme is because it could be something where a user has been given some code from GitHub or just found something they want to build it and make it work on Iraq, and this place will give some pointers. And it also works for people who are just just running commercial applications or any other applications that are pre-installed on the system. And we try and run that around once per term, and I think we're pretty to have one later this term.

31:35

The other thing that's quite important is that the all of the arc systems run Linux, so we stated that it was running centre central site, and that can come as a little bit of a shock to some users who are more used towards the windowed windows environment. So we do have a link on this training page to quite a nice, web based training course is not ours, but it's quite a nice one to get a good grounding in Linux.

32:10

Something else we do our drop in sessions, and we sort of tend to schedule these around the day after an introduction to AAC course. And this is just some time where a couple of the team members will be available on teams for people just to drop in and talk to us about about any issues that they may have on the system. And it tends to be primarily for for four questions that don't really sit very well within an arc or supporter arc type email request.

32:41

Maybe they need some help going through some building some software and they've got they've got a bit stock and it's a bit more interactive. So you know, that's something that we have these set times when people can join us. Or of course, if you do sport requested, we can, of course, just do those kind of things ad hoc. As I said previously and the other kinds of engagement that do we do?

33:05

Just lost that page. Three guys are attending student welcome events and running any kinds of presentations like this on on on request. So we have the use of documentation currently, it's all hosted on WW W, the arc to Oxford, to the UK, and that's quite quite a nice site, but gradually starting to break that out and review what we have there in terms of the arc user guide and software guide and migrating them to read the Docs type page,

33:41

which is hosted on GitHub. And that'll make it a little bit easier for users to take information, export it as a PDF and turn it into a hard copy. And just as some people like to read those types of documents in that way. The documentation has, as I think I mentioned, has all the links to getting a project user accounts and also covers all the sections on priority credits and also the service level agreements for things like quality of service for different types of abuses.

34:24

So we have quite a varied audience for the arc, just for jobs that of the arc systems, as you probably say. So, you know, we're not experts in any one of these areas, but there's quite a wide range of both applications on the system and experience of users. But there are lots and lots of common issues that they face in these types of HPC environments on ARK, for example. So yeah, there's lots of challenges there, so they tend to be the same types of ones.

34:57

So, you know, it is mainly that one uses have from moving from their work station to a system where you know, the they're not they're not the only person running on that. They have to queue things up. They have to write batch scripts. And so these are the kinds of things that we can help with. The software we've got available log his quite nice list of of what we have.

35:24

You might recognise some of the applications, for example, are Anaconda, those types of things, for example, like with our we have about 1100 libraries available in our AR installation and we make it quite easy so that you can actually install your own locally as well. So if you're particularly using AR, that's useful.

35:52

Anaconda Anaconda a nice way of packaging Python code and you can use virtual environments or conder environments to actually package that up, and you can run a virtual environment in your own areas. And so those types of things, we don't have to worry about installing software for you. We give you advice, and there's a nice walkthrough example of how to create conder environments on Arc JCC and Intel.

36:28

They're on here because they're that they're two of our main compiler tool chains that we use, that you've got code you want to compile for yourself, JCC and tell either of those two things. And so that will be added with things like API maths kind of libraries for Intel or the open plus open or allow pack types of mass libraries for JCC. So this is just a nice slide showing the Times applications with their domains.

37:07

So, you know, I know that some, some, some so many people probably won't be looking at some chemistry of genomics, baby stuff. But you know, there's there's quite a large number of applications on the system. As I say this, there's over 500 now. I think I think we had 400 earlier on, but it's now over 500. Application modules are available that are specific specifically for applications. There's over a thousand in total with all the supporting dependencies that these applications use.

37:40

So when it comes to software support tend to be involved in looking at the applications that somebody wants to use, working out whether we can actually install it and updating it on on request. We also can provide software build assistance to users, and that tends to be a little bit like I said earlier, where people have found software on the internet, they found a GitHub repository. They're not very confident and building it themselves. They would just say to us, Could you install this for us?

38:18

And if we think it's something that a few users might? Find useful, then we'll install it centrally, otherwise we can help them install it in their own, in their own areas, and on average, we get about four or five application requests a week, some of them quite easy. We can we can just deal with them in a matter of hours. You know, it's quite easy to put on.

38:42

Some of them have been somewhat more problematic and it's taken taken weeks to get a piece of software working as we'd wanted to work on the system. One of the one of the differences between maybe a system that you'll work at your own workstation is that we use a a feature called Environment Modules to manage the software applications on the system.

39:10

So you do a module load and then the name of the application that you want, and that sets the entire environment up so that you can actually run that piece of software and. I'll give you a quick example, I think here. So in this example, someone's trying to run ah, and it's just not there. But if you then load the module, ah and then run ah again, you can actually see that it's all in the system. You can run something and it works fine. And then you can unload the module.

39:43

It's just completely disappears. So that's quite a nice way of managing your environment, which, if you had to do it manually, would be quite quite cumbersome, especially with the number of modules that we have in the fact that they can actually interfere with each other because they've all been built with different libraries, different compilers and other dependencies. The fact that the module system allows us to to to make that consistent is very nice.

40:12

When it comes to natural software installation, we make our own lives a little bit easier by using a framework called easy build, and that allows us to use known recipes for building, for the most part, open source software. It also allows us to to install some commercial app codes as well, but it's a little bit different for that. But this means that you can. It has a predefined set of known compiler library tool chains. These are updated twice a year and a B version each year.

40:49

So that has the latest and greatest versions IGCC, Intel PGI or other well and BSD SDK type tool chains, and it makes those available. And then there are other recipes then built on those in order to to allow an application to be installed. And this really helps with reproducibility because a lot of other academic institutions in Europe particularly use easy build.

41:20

And so it gives you a known and unknown environment that, you know, if you're going to load that particular version of of a module, it will work. Or at least it should. So that does also when you when you build with these, you build you've got a basic level of assurance that the application will function because there are inbuilt tests at the end of the end of the build. However, we do have a minority of applications that we have to install manually.

41:52

These tend to be the restricted licence applications, things like the MATLAB and this and other and other types of codes. Also, you'll get code are licenced as a source. So VASP one that comes to mind as a as a source licence because of code, as his kalsi. And so they have to be restricted in their in their access. Commercial licence are also licenced differently.

42:24

So we'll find that the module files will also points to a particular licence server because AAC don't run very many or don't own many licences. We tend to licence the Intel compilers starter and a couple of other things, but everything else is usually another department or group that own that licence. And so that's needs to be needs to be protected so that the wrong people aren't using the wrong code.

42:56

When it comes to assistance, users want to build their own code. We tend to it tends to be limited to customisation for ARC itself. We really can't. We don't have the bandwidth given the number of staff that we have to provide a large amount of RC type effort to get into code and help people power lines code. We can give ideas and a little bit of help, but it's not something that we can easily get into.

43:23

It does take a lot of time. One thing that we do find ourselves doing is actually optimising commercial codes because there are a number of codes that run in this crisis and Chef X, which is a CFD code. And it was we found it was running quite poorly on our new system.

43:45

And by changing the MPI stack, which is the message passing interface that the system uses to communicate between nodes from the one that answers supplied to our own one built locally on our system, we were able to dramatically improve its its performance. The idea being a nice linear scaling up, we've got it pretty close. Previously, it was it was all over the place. It was horrible. Have we got the graph?

44:18

Unfortunately, but it was. It was very poor. Something else that's that's become very popular recently is the concept of software containers. And people have heard quite a lot about Docker Docker images. We don't support Docker on our system natively. Due to some of the security issues with it. But we can convert Docker images into singularity containers that we run singularity on on ARK and.

44:54

That's just a very nice way of being able to package up an application for your own workstation and just know that it will actually run on the system, albeit you probably wouldn't be able to get a parallel run that way, but it certainly would be great for running multiple instances and doing some kind of Monte-Carlo type type simulation. And also, it's I have put on that homicide singular, and she has been renamed to obtain a due to it's joining the Linux Foundation.

45:28

As of November. Dai touched on the fact that we slow down, slow Mr Simple electric utility for resource management, and that's how a user specifies what resources they require for their job. So if you are crafting a or you will need to craft a little a little script to run the application that that you want to use, and in that you have to specify a number of resources that that job requires. That could be the number of CPUs that require a large amount of memory.

46:07

Maybe it needs a GPU and maybe you have a reservation or have other similar quality of service requirement. Once you feed all of that information into your script, some will see that and it will make a sensible, hopefully decision on to where where that that job runs. And it may be that that job could run immediately somewhere where it may have to be queued until the resources are available. So that's what the the resource management does does for us.

46:38

But any more information on that, and I want to work out how to create a slim script, the best place to learn that is on the intro to ARK course is which we run every couple of weeks. Di also mentioned that the GTA to this now we as victim, we provide local support for Oxford users of giTe. Unfortunately, the well, I say unfortunately that's not fair. We can't provide direct systems administration of the of the great system.

47:11

It's it's not run by us. So we don't quite have the same level of access as we have on the arc system. So when it comes to helping users with an arc with a problem, we usually can get in and help you go into your directories. Have a look at logs, make changes and help that way. But we can't do that on site, so we have to ask you to raise a ticket with Heart Tree, which we can then mount monitor and help and add information to.

47:42

But in order to maybe help with that, we also have some TJX Max-Q nodes, as I mentioned earlier on in our HTC cluster. So it's probably quite a good idea to run any any jobs locally on our the arc first before then scaling up and running them in the in the right system. Because once again, the the the TJX is in charge tend to run containers produced by Nvidia, which we can also use. So I think that's it from me. Any questions?

48:25

Great. Thank you very much, Andy and A. There was a wonderful presentation and a round of applause. Thank you, thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript