BFT Economy: Why Our Distributed Tokenized web3 Future Relies On Fault Tolerance
The more incredible our technology becomes, the more complicated problems emerge with its implementation and upkeep. While it is hard to ignore Bitcoin, it is easy to overlook the multiple factors at play that ultimately allowed its rise and success. Without Bitcoin, the decentralized financial tools and the burgeoning secondary economy online couldn’t have happened. The Byzantine’s generals problem is important to understand to truly appreciate the decades-long history behind achieving consensus and enabling decentralized trustless networks. While Bitcoin utilizes proof-of-work (POW) as a trustless consensus mechanism, it is also Byzantine fault tolerant. What does that mean, and why does it matter?
In this article, we’re going to look at the history behind interactive consistency, showcase the Byzantine generals problem, what it means to be ‘fault tolerant’, how this is implemented within modern blockchains, and the theoretical future of quantum byzantine agreements.
Knowledge is power. This article is more technically dense than others I’ve done in the past; if you’re interacting with blockchains and web3 on a regular basis and have a reverence for the space, some of the information here could interest you. I’ve tried to keep this as technically light as possible while conveying the importance and implications behind this technology. This is technical, but I promise it is interesting. Achieving consensus forms the backbone of the parallel economy of web3. The future of money and the internet is directly related to the information presented here.
Backstory
The story begins, like many great stories in engineering, with NASA. In the computer science lab at SRI International in 1978, the SIFT project began. SIFT stands for software implemented fault tolerance. At the time, NASA was primarily concerned with the inevitable future of computers flying commercial aircraft and spacecraft, and making sure they were reliable enough to perform that task. John Wensley and Leslie Lamport spearheaded this endeavor, with an example of their early work in 1978 here. In 1980, Marshall Pease, Robert Shostak, and Lamport co-wrote ‘Reaching Agreement in the Presence of Faults’ for SRI.
Within the context of aircraft control, fault isolation under a SIFT paradigm is achieved by using redundant bus systems to interconnect processing units. Software is used specifically for error detection and subsequent system reconfiguration, with iterative tasks recursively executed. Lastly, these iterated tasks are voted upon before being executed. Thus, a single point of failure can be tolerated by redundant execution. Notice the parallels here between Bitcoin’s POW mechanism. Distributed computing was essential for the progression and safety of aerospace technology. The ‘off-the-shelf’ minicomputers referred to in the 1980 paper by SIFT project engineers has the same spirit as Bitcoin in the early 2010s, when any regular consumer processor could mine the currency.
As noted by Lamport, prior to the writing of the aforementioned paper, it was “generally assumed that a three-processor system could tolerate one faulty processor. This paper shows that “Byzantine” faults, in which a faulty processor sends inconsistent information to the other processors, can defeat any traditional three-processor algorithm.”
A few years later in 1982, a similar paper was released entitled ‘The Byzantine Generals Problem’. This paper was successful and historically notable because of the author’s use of the allegory. This is an important reminder for the writer of any abstract idea: use the idea in a fictional story to make the point stick. A great recent example of using allegory might be Nick Bostrom’s story, The Fable of the Dragon-Tyrant. There’s a good reason why the chief founder of Ethereum, Vitalik Buterin, has this story front and center on his twitter profile. In the case of the Dragon-Tyrant story, it was used to highlight how humanity now has the tools to fight the ‘monster of death’ (in this case referring to aging). Within the context of decentralization, Buterin may be referencing it on his Twitter to make a comment on how the fundamentals of financials are changing. The Byzantine Generals allegory was used, ultimately, to make it easier for people to understand the importance of interactive consistency.
The Byzantine Generals Problem
The best way to understand this problem is to utilize the original analogy:
A group of generals plan to attack a castle, and they have to decide collectively if they should attack or retreat. Some of these generals may wish to attack, while others may wish to retreat. Most importantly, all of the generals have to decide what to collectively do, because a halfhearted attack will not be effective and will result in failure.
This problem is exacerbated by the potential for betrayal by some of the generals, who may selectively cast a vote for a suboptimal strategy by intention. Additionally, it is important to note that under this scenario the generals are physically separated, and send their votes via messengers who may change votes or fail to deliver them.
For instance: if there are nine generals, with 4 voting for an attack and 4 voting for a retreat, the ninth general has the potential for sending a vote of retreat to those in favor of retreating, and a vote of attacking for those in favor of attacking. Under this scenario, chaos will ensue if the ninth general does this and confuses the rest of the generals by falsely leading them to presume that a true consensus was made. Some generals attack, others retreat, and the day is lost.
How is this resolved?
Byzantine fault tolerance (as in, being able to avoid this scenario or one similar to it) is achievable if loyal generals can come to an agreement on their strategy. Most commonly, default vote values can be assigned to missing messages, where a ‘null’ value can be pre-assigned as a vote of ‘retreat’.
If we extend this analogy out to hardware and software, we can think of computers as ‘generals’ and internet communication protocols as ‘messengers’. In the analogy, the problem is solved purely with decision-making and having policies on how to operate. Within the realm of electronics, cryptographic digital signatures alone are not enough to solve the problem, because of the potential of voltage differences, ultimately (although rarely) resulting in erroneous encryption.
Byzantine Fault Tolerance (BFT)
The analogy gives us good footing for how to think about solving this problem using software.
Let’s break this down:
Byzantine faults are any faults that present differing symptoms to different observers.
Byzantine failures are any loss of a system service due to a Byzantine fault in any system that requires consensus.
Let’s define it: “The objective of Byzantine fault tolerance is to be able to defend against failures of system components with or without symptoms that prevent other components of the system from reaching an agreement among themselves, where such an agreement is needed for the correct operation of the system.”
The components that remain operationally correct within a BFT system will continue on with business as usual (presuming there’s enough components operating accurately to do so). Byzantine failures are relevant in that they are the most difficult class of failures among failure causes. Byzantine failures aren’t necessarily a security problem with people interfering with processes, usually issues arise from software faults or electrical interference. But hey, a problem is still a problem.
The analogy focuses on (or at least places a great emphasis on) deliberately misleading results. But it is important to note that there are five different ways a Byzantine failure can occur:
Fail stop - The node (processor) fails and stops operating
A failure to return a result
Responding with an inaccurate result
Responding with an intentionally misleading result
Responding with a different result to a different part of the system
A Formal Definition
Note: This definition is referenced directly from this paper. This is a simplified version of the problem. There are various proofs, algorithms, theorems, and definitions for the more ambitious mathematically inclined folks here.
Setting: Given a system of n components, t of which are dishonest, and assuming only point-to-point channel between all the components.
Whenever a component A tries to broadcast a value x, the other components are allowed to discuss with each other and verify the consistency of A's broadcast, and eventually settle on a common value y.
Property: The system is said to resist Byzantine faults if a component A can broadcast a value x, and then:
If A is honest, then all honest components agree on the value x.
In any case, all honest components agree on the same value y.
Variants: The problem has been studied in the case of both synchronous and asynchronous communications.
Within the formal definition of the Byzantine agreement (BA), there are two distinct variations: broadcast and consensus.
Broadcast (protocol): (or the Byzantine generals problem) is to have some designated player, called the sender, consistently send an input value (or message) to all other players.
Consensus (protocol): every player starts with an input value with the goal to make all players agree on the same output value. If all correct players hold the same input value then the output value is required to be the same as this input value.
As noted by Matthias Fitzi, when we take a closer look at these models, especially those with an active adversary (dishonest participants), the definition of consensus can allow only for a strict minority of corrupted players. This contrasts with the definition of broadcast, which allows for any number of players that are corrupted, because the validity conditions only depend on the correctness of the single sender of the protocol.
With that said, there are both synchronous and asynchronous versions of both broadcast and consensus protocols. As communication becomes async, it necessarily becomes weaker in the broadcast model. However, consensus in its original definition is still achievable with async communication.
Having covered the basics of Byzantine agreements, it is important to note that over the last 40 years there have been numerous papers and research surrounding the field of cryptographic protocols and fault-tolerance distributed computing. BA can be thought of as a primitive of these ideas.
Variations of BA include:
Interactive consistency: initially mentioned with the founding paper, the goal is to have all players broadcast a value to everyone and all correct players decide on the same vector of broadcast values.
Firing squad (problem): the goal is to make all correct players synchronously perform a common action during the same round, but players initially do not agree on a point of time when this action is to be performed.
Strong validity
Weak broadcast
Graded broadcast
King consensus
So, let’s get closer to Bitcoin. We know that communication cannot be synchronous, but we want to make the ‘right’ decision every time. How do we achieve that?
Practical Byzantine Fault Tolerance (pBFT)
Let’s briefly discuss practical Byzantine Fault Tolerance (pBFT) as it relates to Bitcoin. pBFT is not proof-of-work, but it is similar. Work began on this consensus algorithm in the late 90s with Miguel Castro and Barbara Liskov. It is specifically designed to work async, and with asynchronous systems, and is optimized for low overhead time. While BFT exists to safeguard against system failures, pBFT expands upon that by being faster (as it is newer).
pBFT’s consensus rounds are similar to Bitcoin’s PoW, but it has a number of advantages and disadvantages.
Advantages of pBFT:
It is energy efficient. Unlike PoW, it can achieve consensus without carrying out complex mathematical computations. Blockchains like Zilliqa, Hyperledger Fabric, and Tendermint employ the usage of pBFT in combination with PoW to be more energy efficient.
Transactions have faster finality, and do not require multiple confirmations (unlike Bitcoin).
Lower reward variances. The reward structure is set up differently, with every node being incentivized for responding to requests by the client (leading to lower variance).
Limitations of pBFT:
Sybil attacks. The primary mechanisms for pBFT are susceptible to these sorts of attacks.
Scaling in general. The more nodes added, the longer it takes to respond to requests. 0(n^k) where n is the messages and k is the number of nodes.
Scaling is the biggest issue for using pBFT in its purest form, which explains why platforms that utilize aspects of it often combine it with PoW or DPoS (Delegated Proof-of-Stake). The more nodes you add, the longer it takes for responses to happen within the network. This is where Bitcoin’s PoW steps in and really shines.
Bitcoin
Bitcoin is notable for tons of different reasons, but let us focus on just how it relates to BFT. Simply put, Bitcoin has PoW, which solves the Byzantine Generals Problem, because it achieves a majority agreement without a central authority. The network is trustless, so participants can be unknown, and the network is asynchronous. Bitcoin has been written about to death, so if you really want to dive into how the particularities work, I encourage you to start by reading the infamous whitepaper.
The PoW algorithm for Bitcoin is based on the SHA-256 hashing function to validate and confirm transactions and issue new bitcoins. The winner of a round of hashing records and aggregates transactions from the mempool into the next block. The ‘winner’ is randomly chosen proportional to the work done, incentivizing participants on the network to act honestly and only record true transactions.
The definition is in the name. Proof-of-work requires that nodes on a network provide evidence they have expanded computational work to achieve consensus, preventing bad actors from overtaking the network and achieving consensus in a decentralized manner.
PoS, DPoS, PoET, DAG, PoC, and HBAR
Bitcoin is BFT. However, there are other notable examples of BFT consensus mechanisms, some of which are better than others at actually achieving success. Taking a quick look at these different mechanisms highlights how thinking in this field of study has evolved over time.
Proof-of-Stake (PoS): a consensus algorithm that randomly assigns a node that will mine or validate transactions based on how many coins that particular node holds. The more tokens in a wallet, the more mining power granted. Many of the most popular blockchains today utilize PoS.
Delegated Proof-of-Stake (DPoS): users of the network vote and elect delegates to validate the next block. Delegates are also called witnesses or block producers. An example of a blockchain that uses DPoS is Tron.
Proof of Elapsed Time (PoET): a consensus algorithm developed by Intel Corporation that enables permissioned blockchain networks to determine who creates the next block. PoET follows a lottery system that spreads the chances of winning equally across network participants, giving every node the same chance.
Directed Acyclic Graphs (DAG): Unlike a blockchain, which consists of blocks, directed acyclic graphs have vertices and edges. Thus, crypto transactions are recorded as vertices. These transactions are then recorded on top of one another. Similar to a blockchain, however, transactions are also submitted to the DAG via nodes.
Proof of Coverage (PoC): The Helium Network uses Proof of Coverage (PoC), a novel work algorithm that rewards users for verifying coverage, thus proving location and network connectivity. Proof of Coverage is built on top of the Helium Consensus Protocol, which incorporates the HoneyBadgerBFT multi-party computation consensus protocol.
Hashgraph Consensus (HBAR): Unlike blockchains, hashgraphs do not bundle data into blocks or use miners to validate transactions. Instead, hashgraphs use a "gossip about gossip" protocol where the individual nodes on the network "gossip" about transactions to create directed acyclic graphs that time-sequence transactions. Each "gossip" message contains one or more transactions plus a timestamp, a digital signature, and cryptographic hashes of two earlier events. This makes Hashgraph form an asynchronous Byzantine Fault-Tolerant (aBFT) consensus algorithm.
Even diving into a single one of these alternatives could take weeks to truly grasp. Hashgraph is definitely the least well known, so I encourage anyone seeing this for the first time to learn more about Hedera and the work they are doing.
The Future: Quantum Protocols
If you’re still here, that means you’ve nearly made it to the end. That’s great. While the world is currently utilizing many of the protocols above to achieve fault tolerance, it is important to look to the future to see how blockchain technology might evolve, especially keeping in mind the pending singularity and inevitable collapse of our modern economy as we know it today.
When we think about the word ‘quantum’, keep in mind its definition as being ‘the minimum amount of any physical entity (physical property) involved in an interaction.’
Because the experiments and initial papers for these ideas are complicated, I’ll try to boil it down within the context of BFT. The whole idea is finding a way to achieve consensus using the basic quantum unit of qubits. Qubits are quantum bits, the quantum version of the classical version of a bit of information. Qubits are what are referred to in quantum mechanics as two-state. For an electron, you can think of the two states as being ‘spin up’ or ‘spin down’, and the photon equivalent would be vertical or horizontal polarization.
An important distinction is that within the field of quantum mechanics, qubits can be in a coherent superposition of both states at the same time. This forms the fundamental backbone of quantum computing.
For proposed quantum Byzantine agreements, the protocol utilizes a four-qubit entangled state. The protocol for the original algorithm presented by Shostak and Lamport is transformed into a quantum protocol.
The original paper is here. This quantum protocol is more or less a replication of the original problem with an emphasis placed on coin-flipping (you can think of this as an analogy for quanta being two-state). In 2007, the experiment was completed on this idea, which shows that while a method like this to achieve agreement wouldn’t currently be scalable, it might eventually be utilized if quantum computing evolves to the point where quantum computers are manufactured in bulk and running a distributed network. Needless to say, we are a solid twenty years away from anything resembling that happening. However, a quantum computer with ~1.9 billion qubits would be able to crack Bitcoin’s encryption within ~10 minutes. As we look into the 2030s, modifications may need to be made to prevent financial and technological collapse as a result of using quantum computers to crack encryption. What will decentralized trust look like by 2040?
Conclusion
We covered a lot of ground in this article. We’ve learned about the Byzantine Generals Problem, the backstory for the SIFT paradigm, what it means to be fault tolerant, and the various modern permutations that relate to interactive consistency at large. The future of our modern economy will come to increasingly rely on versions of these protocols as time progresses, and it is important to have at least a rudimentary understanding of how it all works, and how we got here. The ingenuity and intelligence of humanity shows no bounds, as clearly evident by the jaw-dropping implications of utilizing quantum computing as applied to these problems. Hopefully some of the takeaways from this article will give you a greater appreciation and reverence for the mechanisms at play that make our modern web3 economy possible.
References:
https://www.crypto.ethz.ch/publications/files/Fitzi03.pdf
https://www.microsoft.com/en-us/research/publication/sift-design-analysis-fault-tolerant-computer-aircraft-control/
https://www.researchgate.net/profile/Brendan-Hall/publication/226020071_Byzantine_Fault_Tolerance_from_Theory_to_Reality/links/5444f2570cf2a76a3ccdc6fc/Byzantine-Fault-Tolerance-from-Theory-to-Reality.pdf
https://academy.binance.com/en/articles/byzantine-fault-tolerance-explained
https://www.sciencedirect.com/science/article/pii/S1877050916304641
https://lamport.azurewebsites.net/pubs/byz.pdf