#11 - Stefano De Angelis: A brief history of consensus protocol design | The AwesomeAlgo Podcast

00:00

Hello, and welcome to episode number 11 of the Awesome Algo podcast. Today's guest is Stefano De Angelis, and he is a solution architect from Algorand with PhD in cybersecurity. Stefano has a lot of experience in software engineering and has a very strong academic background that will allow us to do a very interesting episode that covers a history of consensus protocols.

00:45

To be more precise, I would rather say it's a rather higher level overview of the brief history because there are just too many awesome things that have happened over the past 50 years. We can't cover them all, but we'll do our best.

00:58

Don't expect us to dive too deep into the details, but essentially the goal of this episode is to highlight the fascination with the overlap in between the field of distributed systems and fault-tolerant systems and how both of those contributed a lot to consensus protocol design. And with that, I would like to give the stage to Stefano, and let's start with the introduction essentially.

01:26

And if you could tell us and now we'll listen a little bit about yourself and what was sort of the first domain of computer science that sparked your interest the most. Yeah, thank you. Thank you, Hal. First of all, for inviting me in this great podcast. I'm super excited. So yeah, let me give you a brief introduction of myself and also about my background. So I'm Stefano, the Angelis, as you said. I work at Algoranda as a research solution architect.

02:05

I'm Italian, so I spent the most part of my life in Italy during my education studies. And as a background, I studied the engineering of computer science for my bachelor degree at the University of L'Azapienza in Rome. So during my first degree, I was also working as a full-stack software engineer in a company in Rome, in a digital company. And I was almost doing something like web application developments and also mobile applications development for Android OS.

02:48

So my background was mainly the first programming language I had was Java. And it was OK. I liked it. I enjoyed it. But at the end of my bachelor, I said, OK, maybe programming is not what I really like for my future. So in my master, also down in Rome at the University of L'Azapienza, I decided to deepen topics like cybersecurity and distributed systems. Why these two topics? Well, because at that time, it was the rising of cloud computing.

03:28

And cloud computing was quite fascinating for me because I was questioning why these systems, quite complex, can ensure reliability and full tolerance. Thanks for application and help people to consume services in such efficient way. So I was very interested in this stuff. So I said, let's try to deepen distributed computing.

03:57

And eventually, during my master, I came up with an amazing book called Reliable and Distributed Programming written from the professor Christian Kashin, University of Bern, Switzerland. And this book was great because it introduced the problem of distributed systems, the basic problems of distributed systems, like how messages can be broadcast in a distributed network, how shared memories works across replicas, and of course, the problem of consensus protocols.

04:33

So this was my first approach to distributed systems. And that's my main research area. I will make sure to include the links to the resources. Well, I think it's a great book. And going a bit further, a nice reference was Java, by the way. I think it's a great program language to dive into object-oriented programming. So you mentioned the pursuing of the degree at the University of Sapienza and the PhD at the University of Southampton.

05:10

How do you think the experience of pursuing the degrees there shaped your academic background, your future goals and aspirations? Yeah, let me tell you a big story. So University of Sapienza is probably one of the largest universities in Europe. And there is this big faculty of engineering. It's a very, a new faculty just close to the Colosseum in Rome, full of history, full of a lot of scientists started there. Because there is all the faculties of engineering.

05:50

So electronics, mechanics engineer, computer science, and all of them. So I had the chance to improve my bad skills in math there. So it was quite helpful for me. But so I had the chance to work with brilliant colleagues and professors. So I collaborated in this research group of cyber security there. And from the contacts I had the chance to get there, one of the professors offered me the possibility to go abroad for my final project of my master's thesis. I won a scholarship at the time.

06:34

So I had the possibility to go abroad. And the professor I met there in the research group in Rome, he offered me this possibility with Southampton. Southampton is a city in the UK, South on the UK. University of Southampton is quite a good, very big and good university because it's part of a Russell group, which is a list of 24, the best 24 universities in the UK. So there I was super excited to join the university. And then I moved to Southampton.

07:12

I joined the cyber security research group in the Southampton University, which is an academic center of excellence for the research in cyber security. And there the people working there were interested in distributed computing, security and block chains. So during my experience at the University of Southampton, I had the chance to work on various projects mostly related on cyber security and block chains.

07:43

But the fascinating thing is that the university was extremely exposed to institutional and military sectors. So I had a research from a theoretical point of view for block chains and disability assistance, but then I had the chance to put things in practice with practical projects. I see. That's very nice. Oh, sorry. I thought there's a continuation there. But that's a very interesting background.

08:12

And you're mentioning the field of cyber security, which on one glance, it could be a rather broad topic to cover. So let's slowly diverge a bit towards the blockchain space because I'm also curious on why that particular domain in the field of cyber security was of the most interest to you.

08:38

So perhaps if you could also mention one of the first projects you worked on specifically in the blockchain space and perhaps aside from cyber security, what was the big fascination that got you involved with the blockchain specifically? Yeah, yeah, I definitely agree with you. I focused on the blockchain space and, you know, as I told you, I moved to Southampton for the final project of my master thesis, which was related with the University of Rome.

09:09

And at that time, I was studying security of consensus protocols proposed for private block chains. That was my final project of the master thesis. So I moved to Southampton because the Southampton group was quite experienced on block chains. And when I moved, at that time, the research group was working on an European project, H2020.

09:38

They were trying to investigate the adoption of a blockchain technology for implementing a federated Pan-European cloud infrastructure for the public administration. So it was a big cloud system composed by all European countries. And this cloud system was using the blockchain as a shared source of trust for the information. So in that context, I worked as a research fellow for the design of the blockchain infrastructure and the application use cases.

10:14

So it was the first time I was coming from my background in investigating from the theoretical point of view the security of interest protocol from blockchain. But then with this project, I had to put in practice things starting to implement the infrastructure and smart contracts. I see. I see. Interesting.

10:35

And going a little bit, going a little bit further to the topic of Algorand specifically, because, well, going back to some of the insights from the previous episodes, we know that it's a rather small space if we look at the total amount of engineering force in blockchain development in general. We have up to 300,000 people total. We have multiple companies. So there's a vast number of projects to pick. So why specifically Algorand? How did you first discover that?

11:12

And what about its technology picked your interest? And, well, made you want to get involved? Yeah, sure. So when I moved to Southampton, there was this big European project. The guys were, to be honest, they were working with Ethereum at the time. It was late 2017, and they were trying to investigate Ethereum as a platform for the project. But Ethereum was too slow, and there was extremely a fit in terms of cost for the infrastructure. I'm sorry, which year was this?

11:53

It was 2017, late 2017. So at that time, Ethereum wasn't feasible for the project. So we decided to switch to another platform, which was a private blockchain, Hyperledger Fabric, because it was more efficient for the project and also more feasible, because as I told you, we were building a pan-European cloud federation. So when you build the federation, in that time, the private blockchains were the best solution. But after the project, I started my PhD, and I'm talking about early 2018.

12:36

The project ended, and I started my PhD. So once the project finished, I started asking myself the following question. So why was so difficult to achieve scalability and security for that project? Why was so difficult to use blockchain with a certain level of performance and security in that system? So this was my first research question, and it was quite fascinating at the time. And then I started comparing the solutions the market was proposing.

13:11

So early 2018, there was a plenty of solutions, both in the private blockchains and public blockchains, but it wasn't well documented either. So I decided to start studying these platforms and their consensus protocols. And eventually, I came up with a paper from Silvio Michali that was proposing a solution on how to solve consensus in large scale decentralized networks.

13:40

That paper impressed me a lot, because Silvio proposed an extremely elegant way to solve consensus in such a large scale network. So the paper impressed me a lot, and I started playing around with Algorand during my PhD. Both from a theoretical point of view by investigating the consensus protocol, but also from a practical point of view. So Algorand was launched in 2019, and at the time, there was some future that you were able to experiment with, like smart contracts.

14:19

So during my research, I started using Algorand and proposing Algorand also for the projects my research group were working on. For example, in early 2020, I proposed Algorand together with my professor at St. Hampton for a very interesting project with the European Public Administration. So I was involved with the Ministry of Finance, the Defense and the Social Security Institute. We were trying to achieve a blockchain system for the public administration in Italy.

14:58

And we designed a prototype with Algorand. And we used Algorand for the infrastructure, but also for the application use cases, like how can we issue a loan on Algorand's smart contracts, or how can we issue a social security for citizens on Algorand's smart contracts. So it was quite in the middle between investigation and application, real application use cases.

15:23

Interesting. And I suppose just to recap for the listeners out here, the interesting highlight from Stefano's background here is the fact that there was indeed a research that has been done in regards to finding a good chain and the good consensus that can satisfy a very specific problem, which in this case was the project that you outlined. And well, going back to the dates, I suppose this was 2017, there was a big boom in the crypto market back then.

16:05

And lots of chains also specifically showed the limits of their scalability. But it's interesting to hear that Algorand came in as a theoretically good option for the implementation, and then it showed to be the case after the practical application of it. Any other, perhaps minor things, or like some minor open source project or other projects within Algorand space that you want to highlight, perhaps you had the chance to work with while exploring the chain or learning it.

16:46

Otherwise, we can essentially dive into the main topic of the talk and talk about the consensus. So do you mean before starting Algorand? No, I mean, aside from the project that you worked on during the research, just if there's anything else within the Algorand space or Algorand ecosystem that you worked on or contributed to that you wanted to highlight, otherwise we can proceed. I'm sure there's a lot of things that you got involved with in the ecosystem as well.

17:29

But during my PhD, when I was proposing Algorand as a solution for a project, a research project, I also jumped in as a developer ambassador with Algorand Foundation. And I took time for proposing a couple of solutions on the Algorand developer portals. You can, I think they are still available online. One solution was quite interesting because it was a tutorial on how to build a private instance of Algorand, like building a side chain of Algorand in your own on-premise or on-cloud. I see.

18:11

It was like a big fork of Algorand, although it's quite complex. Yeah. Exactly. It was like a parallel and I built it just for research matters. And there is a tutorial that explains how to configure the notes, the relay notes, participating notes for doing that. So this was quite interesting because I had the chance to really understand both the code base of Algorand but at the infrastructure level, how you can handle commands, configurations required for notes communications.

18:57

Yeah. I must agree that that's one of the best practical ways to learn how the Algorand network functions and also to recall the guests from the previous episode, Patrick Bennett, the CEO of NFD, of TX and Labs. One of his ways to learn it was also essentially cloning the Algorand Go implementation and going over the intricate details of the implementation to learn about the network itself. So once again, Stefano, thanks for this amazing introduction.

19:38

This is certainly setting the stage for your background and experience. And what I'm mostly interested in during this episode is to talk a little bit about the brief history of the consensus mechanism. But before we dive into it, I feel like it might be important to firstly set the stage for the main definitions for the terminology that we will be using.

20:05

So perhaps let's start with the very fundamental questions such as, can you, for example, think about what consensus protocols are and essentially what role do they play in distributed systems? Yeah, sure. That's a good point. And I think before trying to define a consensus protocol, we should define what a distributed system is in general. So a distributed system, you can think about a set of connected processes that cooperate together for a certain goal or to accomplish a common task.

20:54

But from a user standpoint, we are used to have in computer systems a server and a client. So from the client side, the distributed system is transparent. So you just send your request to a server and you receive an answer. And you don't care that under the hood, there is a more complex system composed by several processes that altogether they coordinate to serve your request and produce an output. So this is what a distributed system is. Why we need consensus?

21:32

Because we have all these consensus in this last case. So these processes together need to agree on a consistent state. And to do so, they need a protocol, which is the consensus protocol. This protocol is executed in a distributed fashion. So everyone has the code of the protocol. And they all know, when I say they are in the nodes of the network, they all know how to process to execute that protocol. So the first primitive of a consensus protocol is the atomic broadcast. What is that?

22:11

Well, when you have a consensus protocol, you have these processes that need to communicate each other. And to do so, they need an algorithm, a protocol. And the protocol for doing these communications is the atomic broadcast that guarantees for all replicas of the network that they are on the same page. So they have the same set of messages. And this set of messages is in the same order. So consensus is quite crucial for complex systems.

22:44

We can imagine the importance of consensus when we talk about cloud computing, when we talk about distributed databases, where consistency can be very important for replicas, and then blockchains, of course. When describing distributed systems, there is this theoretical mathematical model, I would say, in this case, that is often used to build the fundamental notion on how to basically approach consensus mechanism design.

23:24

And what I'm referring to is the state machine replication problem, which I think could be another good terminology to set up before we dive a bit further. Just if you could also briefly describe a little bit the state machine replication. And another important aspect to cover is where is the system that is using the consensus protocol is operating, right? When we're talking about networking, in this particular case, there is a set of different assumptions, right?

23:58

There is an ideal theoretical assumptions that could be relaxing a lot of conditions where we have a fully synchronous network. Everything is we have some sort of centralized clock. Then we have more realistic real world application, and they reach most of the modern consensus operating, which is partially synchronous model, and then we have the asynchronous, which is something that is proven to be impossible to design a consensus mechanism for.

24:30

And well, there are assumptions in the protocol design. Let's perhaps go over those terms as well. And there would be one more before we will be able to dive into the proper history. So yeah, let's start with the state machine replication. If you could just briefly define the definition of this problem. Right. So it's a big question. Let's try to be as quick as possible. Sure. We need to sanitize like 30 years or more of distributed system theory.

25:10

So first of all, you mentioned about state machine replication, but what is replication? So I briefly give you an introduction of what a distributed system is. So replication is quite a similar concept. So in traditional computers, client server architecture, we have the server that represents like the single point of failure. This means that if the server goes down or there are some outages, your service provided by the server is not available anymore to the client.

25:45

So that's why if we want to provide robustness to our service, we need replication. So what is replication? You take the server and you replicate your service across different servers. In this way, if the server goes down because of foods or because whatever reason, the system can keep operating correctly and the client can be served from the server side. So this concept was firstly theorized by Leslie Lamport in the 70s with the concept of state machine replication.

26:22

So what is the state machine replication? So imagine again our distributed network of processes and each product process is running a deterministic state machine replicated for overall processes. In that sense, every process starts in the same state and then executes the request coming from clients in the same order because of atomic broadcast.

26:52

So basically given an input to the state machine replication, all the processes do the computation and then they produce a single output which is equal for any replica of the state machine. How can we achieve that? Well, by running a distributed consensus protocol across replicas.

27:15

So state machine replication is quite important in distributed system and it is as much important as full tolerance because you can have a state machine replicated of the nodes, but if there is not full tolerance, your operation can be faulty if something goes wrong. So the full tolerance is the concept of guaranteeing that a process, that operations from the system are served and are correct even if some process goes down.

27:48

And we call this kind of system F fault tolerant in the presence of F faulty nodes. So if we have F faulty nodes and the system can't tolerate those nodes, we call the system F fault tolerant. Of course, you will need a certain amount of harness node to keep the system processing operations. So in this background, then we have several considerations as you were mentioning about the execution of your fault tolerant system. What we have to consider.

28:25

So first of all, the network model because we are in a distributed system and our processes needs to communicate with each other. The communication happen across a network. So in an optimal situation, in a perfect world, the communications are synchronous. So every time process A sends a message to process B, the message is instantly a synchronously delivered. And also the processes can rely on the synchronous clocks. So everyone has the same view of the time.

28:58

And this is great for consensus because it's very easy to achieve consensus in such an environment. Sadly, this is unfreezeable in the real world because real networks are asynchronous. So every replica may have a different perception of time. At each time, they have a different perception of time.

29:22

And indeed, I don't know if you heard about it, but there is this very nice study from a Fisher, Michael Fisher, stating that it's a theorem called FLP impossibility, which states that it's impossible in a distributed system to achieve consensus during a certain time bound if our replicas can make crash. So in presence of crash faults. So that's why we cannot model consensus in an asynchronous network.

29:59

And that's why the most, as you were mentioning, the most accepted model for network is partially synchronous network. A partially synchronous network is basically an asynchronous network that eventually behave as a synchronous network. So messages eventually will be delivered to hold the honest note. And under this assumption, a lot of protocols have been developed.

30:25

And for sorry, I just wanted to add some additional information here for the listeners and for folks who are interested in the practicality of what it means to have this assumption in a partially synchronous model. In other words, you could also imagine that the consensus protocol essentially has some sort of assumption on the time it takes for the message to be delivered.

30:50

In other words, setting in certain cases in the simplest example we could pick is, let's say, picking some constant T that would outline the time in the network under which, let's say, the network is being under attack or the network is trying to achieve the state under which it's going to reach a synchronicity.

31:12

So this assumption that we're atlining here is mostly referring to the theoretical design of the consensus protocol because it's very important to outline the environment for which you are designing this protocol. But yeah, please go ahead. Yeah, that's correct. And it's very important because from there, you can read your protocol. And the second pillar of the analysis, it's fault. So your process may fall somehow. You may be subject to faulty processes.

31:50

Or maybe your process can be subverted by adversary and act maliciously against the protocol or even being disconnected for a certain period, like partitioning of the network. So all these faults are called by-zantine faults. And there is this term by-zantine that comes from the by-zantine general problem stipulated again by the genius of Leslie Lampo.

32:20

So the by-zantine general problem, it's an allegory to a distributed system represented as a set of generals around a village that they want to attack this village. And they need to decide a strategy. Either to attack the village instantly or retreat for some reason. So Leslie Lampo explained to us that you need enough generals to achieve a consensus to either attack or retreat the village. And why you need that? Because this general, to communicate each other, must rely on messengers.

32:59

And these messengers, again, can be disconnected or delayed. And again, the generals itself can be traitors. So they can cheat again. Exactly. So to achieve a consensus, Lampo demonstrated that you cannot solve this problem if you just have three generals. Or more in general, sorry for the wording, but for more in general, you need one-third of generals to be by-zantine. You cannot have more than one-third of generals to be by-zantine.

33:34

And when we say that, let's say, a node experience, they Byzantine fault, essentially, in other words, we could say that this is an aggregated description for a variety of different errors that could happen. Either it's a software failure, hardware failure, and this also includes the possibility that there is some malicious intent behind the node itself.

33:59

So when you hear the term, a Byzantine fault, it is essentially a way for highlighting that there is this array of very unpredictable behavior that can happen within the node. So a Byzantine fault or a consensus is usually something that can guarantee a very strong assumptions on security if it is able to withstand Byzantine faults. And with this little recap for the listener, so we just covered the essentially main, I would say, terminology that we would be touching in the next parts.

34:43

And Stefano was kind enough to provide the fundamentals here in regards to what consensus is, what distributed systems are. We covered the state machine replication. There was very nice mention of the FLP impossibility as well. And all of the references to what Stefano is mentioning are going to be provided at the description of the episode.

35:05

FLP stands for official lynch and patterns and disenames of the academicians who came up with this theorem and perhaps for folks from other software engineering backgrounds. If you're familiar with CAP theorem, this is something that I would say mirrors the CAP theorem a little bit, although it comes from a slightly different domain of engineering.

35:28

And we also covered the networking models, which is synchronous, partially synchronous, and asynchronous, as well as the Byzantine generals problem and Byzantine faults. So I think with this in our baggage, we should be able to appreciate the history of the consensus now. So diving into one of the first well-known consensus protocol designs, and you mentioned the work of Leslie Lamport, who certainly gave a tremendous contribution to this field.

36:02

So any specific protocols that you could highlight in the early era of the rise of this domain? Yeah, so of course, well, the contribution of Lamport was great, but we first need to divide, let's say, our story of consensus and two big families of consensus protocol. One is the consensus protocols in the first stage of development, and I call them, well, they are called crash tolerant consensus protocols, because this protocol are just able to tolerate crash faults.

36:47

So when your replica goes down for any reason. And then there is the second family of consensus protocol, more complex, which is the Byzantine fault tolerant protocol that we will see in a minute. Let's take on crash fault tolerant consensus protocol. So in that stage, there is this big trust assumption. I was mentioning about a system of n participants. With n participants, the assumption is that we can tolerate up to f, which is a theoretical number, faulty processes.

37:29

In crash tolerant protocol, this f is represented as the minority of the whole participants in the distributed system. So f must be lower than the half of the participants. In this context, the first implementation of a consensus protocol, guess who, was proposed by Leslie Lampert in the 80s. The protocol was called Paxos. So Paxos is a very complex protocol, actually. Let me try to give you just the intuition.

38:12

Paxos is the, implements the protocol, defining three main actors, which are proposers, acceptors, and listeners. So the proposers are in charge in the system to broadcast the new value, to be agreed by the network. So they broadcast the message to acceptors and listeners. The acceptors collect the information and then send back an acknowledgement to the proposer.

38:46

If the proposer recedes a majority of our acknowledgement, so a forum in the network, he's happy that everyone is on the same page because everyone has the same value to update it on their system. And then they can consider altogether the value committed on the storage or whatever shared memory is. Presents are just silent nodes that observe the communications. And eventually, if something is approved by proposer and acceptors, they agree with the system.

39:24

So they decide to update their system to the latest proposed value. In case we have something like concurrent proposers, the acceptors may decide to update the storage to the latest proposed version of a value in a certain round. And this ensures that everyone is on the same page. Of course, this is quite a strange protocol, well, not strange, but quite complex and was very difficult for developers to implement it and also for scientists to digest it.

40:01

So more recently has been proposed a new version, a simplified version of Paxos, which is called Raft. So Raft is a consensus protocol that is great because it's one of the first consensus protocols introducing the concept of leader in a distributed system. So how does it work? So Raft has this leader election. So the system elects one leader. And this leader is a trusted authority responsible of saying to the others, what's the next value to be updated on the shared memory?

40:43

And the leader is the main entity exposed to clients as well. So all the requests coming from clients are concentrated to the leader. Then the leader decides the order of this request and then decides also which request to broadcast to the rest of the network. These replicas are listening for messages from the leader. Once they arrive, they send back the acknowledgement. So for each round of the protocol, the leader proposes a new value until the majority of replicas agree on that value.

41:21

So once the leader collects the majority, he can consider the value committed. And why you always heard about majority, majority. And the point here is that in crush-tolerant systems, we have this bigger assumption that we don't have more than a half of replicas faulty. So in that system, with this assumption, you can be sure that if you receive the majority of votes, you are probably on the same page on the other honest replicas.

41:58

And this is almost the two main implementations for crush-tolerance protocols. But now we can dive into the more interesting part, like the Byzantine-full-tolerant protocol. Let me say, spend a few words on why it's very important to have a protocol which is Byzantine-full-tolerant and why this is very important in a blockchain system. So in a blockchain system, there is no trust, right, in the network. Ideally, trustless should be permissionless. Exactly.

42:35

Ideally, it's trustless and permissionless. Even in permissionless networks, you need a bit of more trust, but still you are in a decentralized system. But let's stick on the worst case, right? So the permissionless network where everyone can join, jump in, jump out. And also everyone can just download the source code of the protocol, which is open source, usually. Change it a bit and connect to the main network.

43:08

And when I say change it a bit, it could be like corrupting the code and making things that can be against the protocol, like producing bad blocks or producing double messages. Double-spending. There's an entire array of different attacks that you can perform, yeah. Exactly. So in that scenario, it's quite important that the protocol, the underlying protocol of your network is resilient to possible Byzantine behavior.

43:40

And that's why it's very important to adopt something like Byzantine for Torrent protocols. So going back to the history of the distributed systems, after the work on crush Torrent protocols, there is this theoretical study showing that there is the possibility of implementing the general problem of Byzantine for Torrent proposed by Leslie Lamport.

44:06

And the way we implemented it is through a protocol called Practical Byzantine for Torrent, PBFT, proposed in 2001 by two academics, Castro and Lisco. And this was great because they were able to implement a protocol from the assumptions proposed by Lamport that in a Byzantine environment, no more than one-third of your network can be faulty. And from this assumption, they demonstrated that in a partially synchronous network, they were able to achieve consensus.

44:48

So the PBFT protocol, just for giving you the intuition, works in the following way. So the idea is that we still have a leader, and then the replicas exchange a lot of messages trying to achieve a certain degree on a certain value. So in crush Torrent protocols, we have a few message rounds because we don't have Byzantine behavior. So we trust that when we receive the message, this message is trustable. In this scenario, this is not true.

45:23

So once as a replica, I receive a message, I also want to know you all what message received and also my colleagues what message received before committing that message on my shared memory. So that's why we need more rounds of communication. And to be precise, there are three rounds of communication in which, first of all, the leader proposed a block to all replicas. And then all the replicas exchange messages in a certain way to achieve a certain consensus over the message received.

46:01

So you might question me, what happens if there are delays? So if you are connected and receive some message with a certain delay, you may propose, well, this leader is not honest anymore. So I want to change leader. And that's why I can start a new process of voting, try to change the leader. And here again, there is a voting process composed by three phases where replicas can ask the network to change the leader. But even if there is a partition, nobody can communicate with each other.

46:36

What happened in that case? Well, the protocol has been demonstrated to be resilient to that scenario. And when I say resilient, it means that the protocol is able to ensure consistency across replicas, despite blindness. So your protocol cannot proceed. We are stuck at a certain round of the protocol, but nobody does anything. So we've done update our storage. There is no change on the shared memory or whatever. So we are safe. We are all on the same page, and we are stuck here.

47:11

Once the connection returns and we have enough honest notes to talk with, then we can start proceeding the protocol again. And this is quite a quick introduction of crash-tolerant protocols and the Byzantine fault-tolerant protocols. Interesting. So I think we could essentially, or I could eventually, sketch a little something. A supplement to this episode. But in terms of the hierarchy that we've just outlined, so once again, we have the crash-tolerant protocols.

47:52

Those systems, such as Paxos and Praf.net, was later introduced. A lot of the implementation details there is something that all the big tech companies are relying on. Because when we are saying crash-tolerant, this is not necessarily the domain of blockchain applications, because distributed systems are not necessarily blockchain only. So a lot of big corporations, such as Google, for example, do rely on things like Raft and Paxos and some of their offerings.

48:28

And essentially, this is something that I would say also powered a lot of infrastructure in Web2 space in general. But talking about the Byzantine fault-tolerance, this is an entire different branch of different problems that specifically deal with, well, Byzantine faults. Some of the folks who are familiar with a lot of terminology in Web3 and blockchain might be curious when they are going to hear proof of stake and proof of work.

48:59

But I believe this is something that is going to be covered as well. But aside from Byzantine fault-tolerance, I think if we were to look into that research area, what do you think about the rise of the blockchain technology introducing perhaps new ways to look at consensus design in general? We have, for the longest time, there has been a lot of innovation on consensus protocols that were based on the Byzantine fault-tolerance.

49:38

But talking about things such as proof of work, which in this particular case has this way of embracing this longest chain rule, which is something that sacrifices on consistency a little bit, but never sacrifices on availability in this case. What do you think about the way embracing of the longest chain rules based consensus protocols, how they affected the blockchain and research space in general?

50:11

And do you think this is something that is still a significant contender to more classic Byzantine fault-tolerant consensus protocols? Great question. Let's try to think about why we arrived to proof of work, why we needed proof of work. Proof of work has been proposed with the paper of Bitcoin, trying to solve a specific issue in distributed systems. So we had this theorization of blockchain, which is a decentralized network. The decentralized network is a special class of distributed system.

50:58

Because if you think about distributed system as a cloud infrastructure, you still have a central authority in control of your distributed system. So the authority can decide, the number of replicas can decide who is joining. And so in that situation, a protocol like Paxos, as you were mentioning, is perfect. It's very efficient and you are happy with that. But if we think about the decentralized network, it's quite different, right? Because the scale of your system is very large.

51:35

And also everyone can join in or can leave the network as they want. So in that situation, how do we achieve consensus? So basically, how do we elect the leader? So we don't have control because maybe I elect you as a leader, but after I elect you as a leader, you leave the network. So there is a problem there. So in a decentralized system and in blockchain space, the proof of work was great.

52:02

Was great because it introduced the concept of randomly select leaders to propose new blocks, propose new messages. And the randomness was given by the solving of a mathematical puzzle, which was quite hard to solve and required a lot of computation, mining. We all know about mining. When the solution was found, then the node that found the solution was the proposer of the block.

52:34

So such a puzzle is so difficult to solve that is quite improbable having two nodes, two miners solving the problem at the same time. And that's why, imagine we are in a network, I solve the problem with a certain probability, then I can spread the block I solved throughout the network. What happened if me and you are lucky guys, and we find the solution at the same time? So I start proposing the block I solved to my neighborhood and you start proposing the block you solved to your neighborhood.

53:16

And in that case, there is a big issue because the network start to be partitioned, not in terms of communication, but in terms of consistency of their storage. And now that we've reached the blockchain, when I talked about storage, I just talked about the ledger. So in that case, we need a certain rule that governs our system and then that says to your neighbors or to my neighbors which block they need to adopt.

53:44

In Bitcoin, we have this rule of the longest chain, which is let's keep pushing on the same ledger until you receive messages. You can consider a block finalized if there is six blocks in front of him that have been committed. Because in that case, it will be quite hard to revert the chain and remove that block from the chain of the network. So proof of work is great. It's great for random election of leaders in this entire system. The problem is that it was inefficient. Why is it inefficient?

54:26

Because the larger is your network, the more difficult is the puzzle to be solved. Because if the puzzle is easy to solve, miners will be easier for miners to solve the problem. And then you might have a lot of concurrency in blockchain. But if it's so difficult, it will be very slow as well. I'm not talking about the energy demanding of course of mining process.

54:57

So proof of work is great, but it wasn't like a solution for, it's a good solution probably for Bitcoin, but maybe it's not a good solution for smart contract platforms or for decentralized computation. And also there is big issue we were discussing before about that, which is the problem of proof of work tend to be centralized. What's the point here?

55:24

Well, since solving the mathematical puzzle is quite hard, nodes tend to converge to put the force together in order to improve the computational power. And this is the phenomenon of mining pools where people trying to put their forces together in terms of computation, trying to achieve the proof of work. And if you have like three mining pools that converge all the mining power of the world, probably these mining pools are controlling your network.

56:05

And you might not be happy to do that for that. So this is the point of proof of work and that's why proof of stake arrived. Let's talk a little bit more about proof of stake, but just a quick recap of what you just said in regards to long machine rule.

56:27

So as Stefano just outlined, despite being great innovation in regards to introducing randomness for the selection of the nodes and having this idea to have an economic incentive, was certainly an innovative idea for a chain that didn't have a need to also run during complete computation and smart contracts on it.

56:52

But in other words, in real world, after Bitcoin essentially started running for many years, what a lot of people saw over the years is the centralization of control over different mining pools. So essentially, if you put an economic incentive on compute, eventually you will see your network slowly becoming centralized because there's only so much players that can allocate so much resources to pay for this compute.

57:26

And which leads us to proof of stake, which is an interesting alternative on that. We covered a lot of the terminology for proof of stake in the previous episode. So don't be constrained setting up the stage for it. But yeah, let's talk a bit about proof of stake now and what was the main innovation there. And I suppose there has been some things from the proof of work that certainly served as an inspiration.

57:57

Yeah. So proof of stake replaced the computing power required for leader election in proof of work with stake. What does it mean? Well, we don't have miners anymore here, but we have validators. So we have these nodes that are in charge of participating in the consensus protocol. How do we select these nodes is the big problem of proof of stake network.

58:25

So first of all, the nodes entitled to be validators are the ones that are willing to commit some state, some or the worth in terms of stake to the network. There are some limitations. The first limitation, I will split the problem in two parts. The first part is leader election. So once you have these validators, you need to elect one leader. And once you elect the leader, you need to achieve consensus on the message the leader is proposing.

59:00

These are two big problems because let's bear in mind that we are in a decentralized network which is a very large network. So if we go back to the PBFT, so the Byzantine Fault Tolerant Protocol and implementation, we know that we can achieve consensus in such an environment. The problem of Byzantine Fault Tolerant Protocol is that they require a lot of messages to be exchanged by processes.

59:31

So if we are in a decentralized network with thousands, hundreds of thousands of nodes, probably it will take a lot of time. So although your validators are committing their stake and they can join the network, probably it will take ages to achieve consensus protocols. There are some solutions that have been proposed. So there are different variants of proof of stake. The first variant is bounded proof of stake.

01:00:00

In that case, the stakeholders lock their funds and for each fund, they got voting rights. With these voting rights, they can select validator nodes that can be elected as a leader. The node with more votes is the leader and this leader will just propose the block to the network. An alternative to that, just to reduce the complexity of leader election, is the delegated proof of stake. In the delegated proof of stake, you don't have a lot of validators.

01:00:32

You just have a small set of validators, like hundreds of validators, 200 of validators, or 2,000 validators. I don't know. And these validators are known by the network. Stakeholders may decide to either build its own validator or they decide to commit their stake to the validator, delegate some stake to a validator. Why do I want to do that? Well, because validators with more stake have the possibility to be elected as leader.

01:01:03

And the validator with large stake can propose the blocks and dominate the network. But maybe it sounds to you something like centralization again, because you are centralized to fuel the power of many. So this is probably a trade-off, right? We need to do that because BFT protocols are so complex in terms of mixed-suggest changes that we need to reduce the problem in a smaller scale. So if we think about proof of work, there is the concept of randomness.

01:01:41

And both in bounded proof of stake or delegated proof of stake, I haven't mentioned about randomness, although some solution are trying to adopt randomness to select leaders. But I haven't mentioned about it. In Algorand, there is the pure proof of stake algorithm. What does it mean, pure proof of stake? So the idea behind that, so the paper proposed by Silvio, is that solving the problem, the consensus requires a leader election fair.

01:02:12

A fair leader election means that everyone can join the network and can be elected as a leader. And to do so, there is not special validator's committee or something like that. There is just everyone that have some stake can be elected as a leader. The likelihood to be elected as leader is proportional to the stake you have in your wallet. But you are not committing that stake to any validator. You are not delegating the stake to others.

01:02:47

And the idea of randomness is that these leaders are elected randomly from the old stake holders in the network. The random election is governed by a cryptographic function called the verifiable random function. And the idea behind that is the same idea of proof of work with mining. You do some work, some computation, you obtain a result. This result eventually gives you the winning ticket of a lottery. With proof of work, we do that with mining.

01:03:20

In pure proof of stake, you do that with verifiable random functions. You can run this verifiable random function with your stake. And eventually, you are elected as a leader. When you are elected as a leader, as in proof of work, you can spread the block you want to propose to the network. Another is the second phase of proof of work consensus protocols, which is the agreement on the message you receive.

01:03:49

So in bounded proof of stake and delegated proof of stake, you just receive the message from the leader and you commit the message. Like I trust the leader because I've been elected by the stakeholders. I will commit the message. Then if there are two leaders at the same time, probably there will be a fork in your network. And we can use the longest chain rule or something like that to eventually agree on a consistent state.

01:04:18

But in pure proof of stake, what happens is that once the leader is elected randomly with the verifiable random function, then there is a committee of nodes, again, selected randomly, that run a traditional Byzantine fault tolerant protocol, as I explained you before. So with three stages, message exchanges. What does this? This committee, what does is just validating and verifying the block received by the leader.

01:04:47

If everyone agrees on that on the block, then the block can be considered committed on the chain. And if that's the case, there is no fork, right? So if the committee accept the block, the block is committed on the blockchain. And this is probably the best way to achieve consensus in a very large scale network because you are still reducing the problem, having a large number of nodes, but reducing to a small committee, but the committee is randomly selected as the leader is.

01:05:20

And we use randomness for that reason. And I really like the fact that one of the interesting innovations that the pure proof of stake is also introducing is the fact that the selection of the proposers and the selection of the members of that committee are happening through the VRF that is actually existing in the security of the line. It's a very cheap cryptographic operation. If you compare it with, for example, well, running your mining pool to massive racks of GPUs and Isis.

01:06:01

But the beauty of cryptography in this particular case is that it's a O1 constant operation. You can run it locally. But then once you actually have the result at this point as outlined in different key notes that Professor Mikali was also doing, at this point, it's pretty much equivalent of putting something on WikiLeaks. We have the Gossip Protocol nodes are using the Gossip Protocol to spread the message.

01:06:31

And once you actually have the result, there's no point to attack it because it's already spreading over the network. So I think that aspect of having the ability to cheaply execute it without actually revealing the result until you have it was also a very interesting application of the pure proof of stake. And to continue with that, and once again, thanks for the great coverage here. I think we indeed did cover a very big branch of different consensus mechanisms.

01:07:07

And I think it's important for the listeners out there to understand the importance of the terminology and the foundational primitives that are required for you to understand the consensus. And I think it's a hard of any blockchain. So if you're studying, trying to deal with some particular blockchain project, it's very important. First, to understand the consensus.

01:07:34

And if we are to continue with it, so we did cover a lot of different innovations that have been happening ever since the works of Leslie Lampert and Paxos Protocol. Given your experience in the research, what do you think is the current state in the field in general, the field of consensus protocol design? And where do you think the field is headed in the future? What do you think is the biggest challenge, essentially, for the current state of the science?

01:08:13

So I think before blockchain, we were in a sort of plateau, so we reached the optimal, actually, the acceptable level of efficiency in a distributed system. But then, blockchains arrived with decentralization. And the problems were a lot. In terms of efficiency and security, there are a lot of three dots there. I think with the pure proof of stake, we are in a good stage. Because if you think about the concept of randomness for electing leaders, that we adopted a lot of algorithms.

01:08:52

Now, a lot of blockchains are switching to that concept as well. For example, in Cardano, there is the concept of randomness, but also Polkadot, they are adopting VRIF for electing leaders. So we are like in a standard, at least for the blockchain space. In my opinion, but there is a lot of work to do for the improvement of scalability. Because networks will be larger and larger. And it's always more complex to achieve consensus.

01:09:26

Even if we create small committees and we try to simulate the whole network with a random elected committee, still it will be complex to proceed with the consensus there. So I think there is a lot of improvement in terms of performance and security tradeoffs. Of course, the consensus protocols in the future, maybe it will be more performant because of Algon, for example, we can produce finalized block in 3.7 seconds on average. So I think there is a lot of improvement there as well.

01:10:09

And another field that I think is very interesting is the cross-chain communication. So I imagine that in the future, we might have an ecosystem of blockchains. There is some work on that, but something more smooth, more flexible, something that can be very easy to use in terms of, I don't know, I like the toolkit proposed by Halgron, but I also want to use something on Cardano. And these are completely two different ecosystems.

01:10:38

Now we are trying to implement bridges and in Algon, we are doing well with state groups. But I think there is a lot of improvement from the consensus part. So what if I have a protocol that enables direct communications, enabling consensus between different worlds, let me say. So the word of Algon and the word of another chain. What if I do have a consensus there? How can I achieve the consensus there? What's the complexity of the consensus there?

01:11:08

I think this is a good point for the research field in the future. Yeah, I would definitely agree that I also personally think that there is going to be a lot more research and of course practical applications in collaboration in this space. Because I think that's the big factor for succeeding and for the blockchains who essentially reach the adoption levels that can provide a lot of value. And it's not necessarily just competing with each other.

01:11:44

It's setting the good standard for the competition by actually collaborating on standards, collaborating on protocols that will unblock cross-chain collaboration. And certainly great efforts being done by Algrand with the announcements this year done in regards to the state proofs as well as state proofs allowing for the post-quantum security.

01:12:11

So with that, I think this has been pretty much a great coverage of a, hopefully we didn't board some of our listeners by diving too deep in some of the aspects, but I can assure you this is as high level as we can go. And essentially this should ideally spark some interest in regards to the variety of different things that are happening in this particular space at the moment. Before we proceed to the final question, I wanted to highlight one question that was asked on the Ask Osomalgo.

01:12:49

This is a platform where listeners can essentially ask questions by submitting questions on the Algrand blockchain. So there was one particular question I wanted to ask, which is, let me read it right now actually. So what are other prominent alternatives to BFT and longest chain rule based consensus protocols that are popular these days in your opinion?

01:13:16

So we mentioned BFT, we mentioned chains that based on longest chain rule, but I believe there are some particular exotic consensus protocols being designed and researched as well. Any particular protocols that you can highlight that are, you know, could be in the minority, but in fact are complete, you know, alternatives to what we've just outlined in the discussion.

01:13:44

Well, something that is, we discussed about different approaches to consensus in blockchain when it was about proof of work and proof of stake. In particular, for proof of stake, we have this concept of leader election with the run on SRO with other kind of models, and in that part, we do have protocols like Polkadot that are Ethereum as well with the merge that they are adopting proof of stake. They have a sort of BFT protocol to achieve consensus.

01:14:24

And then they also adopt the longest chain rule because they might have forks in the network. And so they need a way to reconfigure, to converge to the same state. And there is this protocol called GOST protocol that enable, let's say the model, the longest chain rule adopted in Bitcoin, but in a proof of stake environment. But still we are talking about longest chain, we are talking about BFT, something that was very different from my experience.

01:14:57

And it's fascinating, I admit, it's the protocol behind Avalance. Avalance is a layer one blockchain like others I mentioned here. And they are trying to propose something new, very efficient as well, because they propose a protocol that probabilistically accept a consistent state across the replicas just by observing a local view of transactions.

01:15:27

So just the intuition, intuition is that if I'm a node and I receive transactions, then with a certain probability, there is a threshold that tells me that after the threshold, I can consider the transaction to be accepted by the rest of the network and finalized. It's a sort of Byzantine-Tofoltarian protocol and it's a sort of longest chain rule. But with a different approach based on direct acyclic graphs. So it's quite an interesting protocol as well. I see.

01:16:06

Well, I will certainly make sure to highlight that particular answer to the person who asked this question. To continue on the final part, this is something that is being asked to all of the guests of this podcast, but I usually tend to ask on some listeners of this podcast are aspiring engineers who may not be familiar with the blockchain development but would like to get into.

01:16:36

Is there any advice that you can give for aspiring engineers who want to try their hands-on blockchain development on Algorand or get into, let's say, Web3 space in general? So, I hope I will differentiate my answer from the others. So, what I really like about blockchain, it's a fascinating industry, right? And what I really like is that the technological foundation of blockchain is very, there is a lot of topics that involve blockchain.

01:17:12

So we talk about cryptography, we talk about distributed systems, we talk about telecommunications, the economy, a lot of things. So almost everyone can jump in and start being involved in the ecosystem. Apart from that, from a technological point of view for engineers, there is much more we can do, right? Because there are toolkits, there are SDKs, you can start developing smart contracts very easily because the tools are improving a lot.

01:17:44

For example, in Algorand, if you want to be involved, it's very easy because there is plenty of resources on the Algorand developer portal. Get to start the tutorial. I mentioned about the contributions from the developer ambassadors there with a lot of examples of real use cases that you can build with Algorand using Algorand smart contracts, using standard assets, and all the technology that Algorand proposed.

01:18:11

And it's quite straightforward because if you're an engineer and you are a software engineer, you go there, you start playing a bit. So my suggestion is try to understand the fundamentals of the technology, try to understand what you are doing because we are changing the paradigm here from traditional client-server applications to decentralized applications. And then try to develop things using the resources you can find online.

01:18:38

And get in touch if you want because I'm more than happy to help everyone. Awesome. Well, Stefano, I think this has been a great episode. We did cover a lot of topics.

01:18:54

I hope that this episode is going to highlight some of the aspects from the, I would say, theoretical angle that is not something that is brought up too often in the conversations because most of the time it's a lot of industrial, I would say, angles in regards to explaining what blockchain is as well as more, I would say, practical or development-specific angles.

01:19:25

But this was essentially to show that a lot of work that has been done is built on very strong theoretical background that started in the 80s essentially since the branching out of the distributed system protocol such as PAXOS. And with that, thank you for listening and I will see you in the next episode. Thank you.

Transcript source: Provided by creator in RSS feed: download file

#11 - Stefano De Angelis: A brief history of consensus protocol design

Episode description

Transcript