Chapter 2

Learning about Blockchain

IN THIS CHAPTER

Bullet Understanding distributed applications

Bullet Examining Bitcoin’s solution to the distributed dilemma

Bullet Building blockchains

Bullet Contrasting blockchains and databases

Bullet Describing ways to use blockchain

Blockchain technology is basically a distributed ledger that is shared between lots of computers and can run verifiable software to control how data is added. Blockchain technology depends on the capability to distribute data and software to many computers, using a technique called distributed processing. Distributed processing is the practice of spreading applications across multiple computers, and is a different way of looking at where data is stored and where application code is run from the more traditional centralized model.

Software applications have to run somewhere. Today’s applications can run on endpoint computers and devices, or on servers you connect to through a network. Regardless of where software runs, the computer or device running it has limited capacity. Growth has always been a challenge for computing environments, and at some point, users will probably want services faster than the computer running an application can handle. That’s where distributed processing comes into play.

In distributed processing, computers work together in teams to solve problems. If done well, distributed processing can help address the increasing demands that growth causes. However, it turns out that getting computers to work together in teams is hard.

Fortunately, a really smart researcher found a way to enable groups of computers that don’t trust each other to work together in a manageable way. This new approach to distributed processing and data storage is called blockchain technology, and it has revolutionized the way people think about distributed processing and trust.

In this chapter you learn about how cool blockchains are, how they are built, why they are different from anything in the past, and most importantly, what you can do with them.

Exploring Distributed Applications

Way back in the early days of computing, it became clear that computers couldn’t do everything. They could do some things really fast, such as solving math problems, but even when doing what they do best, computers would eventually run out of processing capability. The Apollo 11 moon landing almost didn’t happen due to a computer overload. The navigation computer in the lunar module was getting radar data too fast and threw 1201 and 1202 alarms. Those alarms basically meant that the computer couldn’t keep up with the data it was receiving. NASA engineers quickly determined that the error wasn’t bad enough to abort the mission, so the landing attempt continued. But for a few seconds, a computer overload almost caused NASA to scrub the landing.

Technical stuff Rumor has it that a deviation from the official NASA checklist ended up causing the lunar module 1201 and 1020 program alarms. According to the checklist, the docking radar should have been turned off once the lunar module undocked from the command module. The astronauts turned on the landing radar and left the docking radar on as well in case anything bad happened and they had to return to the command module. The navigation computer couldn’t handle input from two radars at once, so it triggered the program alarms.

Digging into distributed processing

One solution to application overload is to split up the computing load among multiple computers. What would have happened if the lunar module had been equipped with two computers? Maybe each one could have handled a different radar and no errors would have occurred. Of course, computers in 1969 were far larger and heavier than today’s devices. Adding a second computer at that time was just too heavy and expensive.

Today things are quite different. Our smartphones are way more powerful than the computers the astronauts took with them to the moon. And they’re far smaller and lighter too. Because computers are so small, fast, and affordable, we see distributed processing all the time in today’s applications. And networks are faster and cheaper to access. Most applications that run in a web browser or on a mobile device are distributed. That means part of the program runs in the browser or on the mobile device, and another part runs on a server.

For instance, when you shop online, your web browser connects to a web server to fetch a list of products. The web server probably connects to an application server and a database server to get the data, and then returns it to your web browser. If you try to fetch more data from the same website, it is highly likely that you’d end up connecting to a different server. The entire process is transparent because it appears that you're running software on one big computer. That’s the beauty of distributed processing.

Web applications are just one example of distributed processing. Other examples include specialized servers, such as graphics processing, and parallel processing, where multiple CPUs or computers split up data and work on each part of the data at the same time. The goal in each case is the same: allow more users to run an application than is possible when using a single computer.

Tip Even though parallel processing really is a type of distributed processing, it's generally considered a separate type of computing. In traditional parallel processing, all processors have access to the same area of shared memory. Traditional distributed processing, however, uses multiple computers, each with its own separate memory.

Several popular architectures of distributed systems exist. The difference between architectures is in which components carry out different types of processing. The main distributed processing architectures follow:

  • Client-server: A capable client computer does much of the work, while relying on the server only to store and manage shared data. You’ll find this architecture in small offices that run software on workstations connected to a central database server.
  • Three-tier: Simple websites use this approach, in which a client connects to a server, such as a web server, to get some content. The web server often needs to get data from a database server, which might also handle some of the processing.
  • n-tier: This architecture is an extension of the three-tier architecture, where jobs are clearly defined and multiple servers are used for specific tasks. Server types in an n-tier architecture can include web servers, application servers, database servers, and other servers that pervade specialized services. Most of today’s websites, such as shopping sites, are web applications running on n-tier architecture.
  • Peer-to-peer: In this architecture, all nodes, or participating computing components, are considered equal. Storage and processing is shared among nodes. Examples of peer-to-peer networks include file-sharing networks and the Linux software and updates distribution network.

Figure 2-1 shows the main four distributed processing architectures.

Flow diagrams depicting Client-server, Three-tier, n-tier, and Peer-to-peer Distributed processing architectures.

FIGURE 2-1: Distributed processing architectures.

Exploring problems with distributed processing

Distributed processing all sounds good, but there are problems with distributing programs and data. First, all computing nodes (computers running parts of a distributed application) have to trust one another. That’s not a problem if one company owns all of the computing nodes, but computers owned by competing companies simply do not trust one another. How can you trust that your competitor calculated that discount for your customer properly? Or even worse, suppose that your competitor saw a transaction for one of your customers and decided to cancel the transaction and then steal your customer? Lots of trust problems arise when attempting to distribute processing across multiple computing nodes that are not centrally controlled.

Scheduling and availability are other common problems. If you don’t own and manage all of the computing nodes, how can you make sure that they are always available when you need them? Could they be tied up running someone else’s applications? Or could one or more computing nodes be turned off or unavailable for some maintenance reason? These are just some of the problems with distributed processing.

Think of it this way: what if your family grew and you couldn’t fit everyone into your car anymore? If you couldn’t afford a bigger car, you’d have to do something to get your family from point A to point B. If your neighbor has no kids and a huge SUV, that could help solve your problem! All you have to do is get your neighbor to agree to share the SUV and coordinate your trips with your neighbor. But what if your neighbor doesn’t want to go where you want to go? How do you solve that problem? And what if one of the vehicles breaks down? Or what if your neighbor wants to sleep late on Saturday but your kids have a 7 a.m. game? Coordinating computers is at least as hard as coordinating cars.

When it comes to distributed processing, four main problems must be solved (and all of these problems relate directly to trust):

  • Launching remote processes: How a process on one computer launches a process on another computer.
  • Communicating between remote processes: How processes running on different computers communicate and coordinate activities.
  • Storing one version of data in multiple locations: How to store and update data on different computers without running into confusing differences.
  • Getting multiple computers to work together: How different computers handle issues such as resolving conflicts between computers and handling system load and outages.

Launching remote processes

Distributed processing makes it possible for one computer to spread the computing load by running part of the application on other computers. That means computer A has to launch part of the application on computer B. Security immediately becomes a problem. Traditionally, an operating system authenticates users that log in to that computer and then authorizes those users to run some programs. When a program run request comes from a different computer, figuring out how to limit who can run which programs can be difficult.

Assuming that you resolve the security issues, you have to define a protocol for how one machine requests that some process runs on another machine. You have to define what type of message should be sent and what data has to be included so that the target computer understands what it is being asked to do.

Communicating between remote processes

After computers can remotely launch processes on other computers, distributed systems have to communicate to work together. That means some process on machine A has to talk to another process on machine B, formally called inter-process communication (IPC), to get any work done between the two computers. The main problem with IPC is that all participating computers have to agree on the format of messages they want to exchange and the rules to communicate. Computer B may encounter problems or take longer than expected. In those cases, it has to be able to communicate back to computer A that things aren’t going well. And if things do go well, computer B needs to know how to sends its results back to computer A.

Storing and synchronizing one version of data in multiple locations

One of the more difficult problems with distributing applications is storing data in multiple locations and keeping all the copies of data the same. Centralized data storage is a lot easier because there is only one copy of the data. Suppose Mary’s checking account balance on computer A is decreased (she bought a large cappuccino at the local coffee shop). At the moment the data is changed, Mary’s checking account balance stored on computer B is incorrect. If Mary then uses a mobile app to transfer money into her account and that mobile app happens to be running on computer B, her balance could be all messed up. If computer B’s balance is considered to be correct, the cost of the cappuccino hasn’t been deducted from Mary’s account and she has more money in her account than she should have. If computer A’s balance is considered correct, there is no record of the deposit and Mary has less money in her account than she should have.

Getting multiple computers to work together

The last remaining big problem is just getting computers to work together nicely. Computers work as independently quite well, but it takes effort to get them to work together. For instance, if two computers store copies of the same data and both run the same programs, users expect that both computers will keep their data the same. But if computer A crashes and users change data on computer B while computer A is down, the data will be different. When computer A boots, its data will be old and inaccurate, and it becomes difficult to get computers A and B to coordinate to get their data back in sync.

Even if one of the computers doesn’t crash, anytime users try to change the same data but on different computers, the two computers must negotiate to see which change should be allowed. These types of problems happen frequently and make distributed processing more difficult as you add more computers and users.

Presenting some solutions to distributed processing problems

Computer scientists have been working on the problems with distributed processing for several decades. No one has completely solved all of the problems, but there are solutions to each problem you just learned about.

Launching remote processes

Remote Procedure Call (RPC) and Remote Method Invocation (RMI) are just two ways to define how computer A can launch a process on computer B. These two approaches simply set the communication rules and message formats for how two computers can run distributed processes. These aren’t the only solutions to remote process launching, but they have been around for a while and lay the foundation for process distribution.

Communicating between remote processes

The capability for processes running on different computer to communicate with one another is formally called inter-process communication (IPC). IPC is necessary to get any work done between the two computers. The main problem with IPC is that all participating computers have to agree on the format for exchanging messages and the rules for communication. Different standards, each with its pros and cons, exist. As with all distributed processing issues, all participating computing nodes must agree on how and when they communicate and what the messages look like.

Storing one version of data in multiple locations

Lots of approaches to synchronizing multiple copies of data exist. The biggest question is how to handle stale copies of data when one copy gets changed. One method is to mark the unchanged copies as “bad” or “stale” until the changed copy of data is written to the other copies. This approach raises all kinds of problems with timing and concurrency. Eventually, two users will update the same data on different computers at about the same time. A set of rules must be in place to govern which user wins.

Other methods for keeping data in sync are to apply locks to data before updates are allowed and to support merging multiple copies of data. Yet another approach is to place a timestamp on all data updates and resolve all conflicts by accepting the earliest change to the data. All existing approaches make developing and using distributed data applications more difficult, which is why computer scientists continue to search for a better way.

Getting multiple computers to work together

The last problem has perhaps the fewest standard solutions. In most cases, coordination among distributed computing nodes is based on one of two approaches: temporary dominance or consensus. Temporary dominance means that one computing node becomes a node with authority and decides a course of action. Some approaches arbitrarily assign nodes to have decision authority in a round-robin approach, and others have nodes vote for a leader when they have a conflict. Either way, this approach depends on granting one computing node the authority to decide.

The other main approach is to have all participating computing nodes engage in some game to come up with a decision. When a majority of nodes agree on some outcome, the group has reached consensus and accepts the majority decision. Many types of consensus “games” exist, and many are based on having computers solve puzzles. You learn more about consensus later in this chapter. Consensus is a major part of the “big solution.”

Examining the Bitcoin Solution to the Distributed Dilemma

In 2008, Satoshi Nakamoto published “Bitcoin: A Peer-to-Peer Electronic Cash System.” That paper contained a description of a new system of handling electronic currency. It described a data structure that consisted of a chain of special blocks, called a blockchain. This new approach makes it possible for many nodes that do not trust one another to exchange currency without a central authority.

Satoshi Nakamoto is a fictional name. Even today, we still don’t know who wrote that paper. The author could be a single person or a group of people. Regardless, Nakamoto started a revolution in distributed computing.

Nakamoto proposed blockchain — now known as blockchain — technology to implement the new cryptocurrency called bitcoin. In a few short years, bitcoin has become a viable currency and blockchain has started changing the way we look at distributed processing.

Remember One common mistake when you’re new to blockchain is to confuse bitcoin and blockchain. They were proposed at the same time in the same paper, but they aren’t the same. Bitcoin is a cryptocurrency that is an implementation of blockchain technology. Blockchain can be implemented in many ways, not just to support bitcoin. The subject of this book, Ethereum, is another wildly popular implementation of blockchain.

Let’s take a look at how blockchain provides a solution to each of the problems with distributed processing.

  • Launching remote processes: Blockchain technology is based on a collection of computing nodes connected in a peer-to-peer network. That means no node has more authority than any other node. Each blockchain node runs as a completely independent computing device and doesn’t support launching remote processes on other nodes.
  • Communicating between remote processes: This one is easy. Remote processes don’t communicate in blockchain technology because there are no remote processes.

    Technical stuff It looks like we’ve just ignored the first two problems with distributed processing. That’s because blockchain technology is out to solve only a few problems, not all of them. By ignoring remote processes, blockchain simplifies its approach to distributed processing and storage.

  • Storing one version of data in multiple locations: Perhaps the greatest contribution of blockchain is how the central data structure is constructed. A blockchain is an ever-growing chain of blocks, with the blocks linked into a chain structure. New blocks can be added only if a majority of the nodes agree to each addition. After a block has been added to the blockchain, it can't be modified. This feature solves the problem of keeping old data in sync. The only problem left to solve is coordinating how the blockchain expands.
  • Getting multiple computers to work together: The other large contribution of blockchain is in defining how peers — nodes that operate at the same authority — work together. They have to agree on when to add blocks and under what rules. The blockchain definition sets up these rules in a simple and straightforward way that makes it hard to break the rules and easy for everyone else to see if any node did so.

You’ll see that blockchain uses distributed processing to handle data storage and trust issues, and doesn’t focus on performance.

Describing Blockchains

At its core, a blockchain is pretty simple: It is a bunch of blocks of data linked into a chain. All blockchains start with a genesis block. The only thing that makes a genesis block special is that it isn’t linked to a previous block. The genesis block contains header info and contents data. All other blocks also contain header and contents data, but they also contain a link to their predecessor block.

Tip Each block’s data can have different contents, in different formats. Blockchain block contents aren’t constrained in the same way as database records are. The structure of data that you store in each block can be dynamic, to fit the data you're storing.

Examining blockchain details

You can think of a blockchain as being a big spreadsheet, except that each row may have different columns and a different number of columns. Instead of being identified by row number and column letter, each data value is identified by a key. That makes it easy to identify data in each block.

At a higher level, a blockchain can be viewed as a big spreadsheet that is shared with every node in the blockchain network. Every copy of the blockchain is identical, and all nodes must agree before any new blocks are added to the blockchain (think of adding new rows to the spreadsheet). That way, the blockchain always stays in sync.

All blocks, except the genesis block, include a previous block link. This link is a cryptographic hash of the previous block’s header metadata. A cryptographic hash is a number that uniquely represents a block of data. It is the output of a mathematical function given the data of the block as input. Hash functions make is easy to calculate a fixed-length number that represents a large amount of input data. And even though a hash function returns a shorter number than the size of the input, the returned hash value is unique for data used as input.

Different blockchains use different hash functions. For example, Ethereum uses the Keccak-256 hash function to calculate the hash value on the previous block. Ethereum uses that hash value as the link to attach a block to the previous block on the chain. The link Ethereum uses is the result of the Keccak-256 calculation of the previous block’s header information and a random number, called the nonce value. Ethereum nodes compete to be the first to find the right nonce that results in a hash value matching the current complexity target. Figure 2-2 shows the blockchain architecture.

Diagram depicting Block 51 to 54 of blockchain architecture with current block hash and previous block components.

FIGURE 2-2: Blockchain architecture.

Technical stuff Blockchain uses data from a block, along with a nonce, to calculate a hash value that represents the block. The word nonce means “a number that is used only once.” A nonce is used to increase the uniqueness of a hash value for a block. Calculating a hash on a block using two different nonces will return two different hashes.

Each Ethereum block contains some header information, including a timestamp, a block number, a version number, and other descriptive information, and content data. The content data can be any data that makes up the contents of the block, which can be plaintext data, encrypted data, or even executable code. A blockchain, when described only in terms of the blocks, looks like a straightforward data structure. But the real power of blockchain is how the data structure is created, extended, and used in applications.

Current blockchain implementations define blockchains as immutable data structures, which means that after each block is added to the blockchain, it can never be changed. This immutability property helps to solve one of the more difficult problems of storing distributed data in multiple locations. If the blockchain cannot be changed after a block has been added to the chain, the only remaining problem with data synchronization is how to control when blocks are added to the chain. All blockchains have clear rules that control the process of adding blocks.

Protecting blockchain visibility

You can build two types of blockchain: public and private. Your choice depends on what you’re trying to do with your blockchain. Public blockchains are available to pretty much anyone, but private blockchains are only available to users authorized by the blockchain owner, as shown in Figure 2-3.

Schematic diagrams depicting private and public blockchains with their differences listed below.

FIGURE 2-3: Public versus private blockchains.

Public blockchain

Anyone can interact with a public blockchain, also called a permissionless blockchain. All you need is a valid address, and you can read the blockchain and even submit transactions. This is the most popular type of blockchain, and one that most people think of when associating blockchain with cryptocurrency. Public blockchains ensure that no one organization controls the blockchain because any computer can become a node and each computer maintains a full copy of the blockchain.

Technical stuff Not all nodes store full copies of blockchain blocks. Full nodes do maintain complete copies of the blockchain, but lightweight nodes store just some blocks of the blockchain. Lightweight nodes often store recent blocks and provide transaction validation services for clients.

Private blockchain

Prior authorization is required before you can access a private blockchain, also called a permissioned blockchain. Private blockchains are almost always owned by a single organization or a small group. The blockchain owner requires that each blockchain user request authorization to interact with the blockchain data and provide access credentials with each access request. Private blockchains provide organizations the features of blockchain applications without having to expose all of their data to the general public.

Building Blockchains

You’ve already learned that blockchains are immutable and all nodes have to agree before new blocks can be added to the blockchain. Let’s look at how those two requirements are enforced.

Agreeing to add blocks

The first rule blockchain nodes must agree to is how to allow new blocks to be added to the blockchain. Because no node has more authority than any other node, the nodes use consensus to agree to add new blocks. Consensus in this sense simply means that when enough nodes agree to take some action, that the action is approved and agreed upon by all nodes. Most consensus strategies use simple majorities to succeed. So, as long as more than half of nodes agree to take some action, the action is approved.

Several consensus approaches are in use or proposed:

  • Proof of Work: Proof of Work (PoW) is the most popular consensus protocol used today, and is used by both bitcoin and Ethereum. Proof of Work means that some nodes compete to try to be the first to solve a mathematical puzzle. The puzzle is to find a random value to combine with a block’s header, such that the hash of the combined data matches a pattern. Solving the puzzle is hard, but verifying the solution to the puzzle is easy. The first node to solve the puzzle receives a reward for doing the work, and gets to add the new block to the blockchain. The block, along with the value the winning node found to solve the puzzle, is sent to all nodes. Each node quickly verifies the block and then adds it to their local blockchain. Although Proof of Work is the most popular consensus protocol and works well, it takes enormous computing power to complete. That means Proof of Work requires computers to use lots of energy, which produces a lot of heat.
  • Proof of Stake: The Proof of Stake (PoS) consensus protocol will likely replace Proof of Work. The developers of Ethereum already have plans to move to this protocol. The Proof of Stake protocol provides a similar level of consistency as the current Proof of Work protocol without using so much computing power (and wasting energy). Each node that wants to compete to add a new block locks some of its cryptocurrency and submits it as a bet. The “winning” node that gets to add the new block to the blockchain is chosen based on the size of the bet and other criteria intended to randomize the selection. The random part of the selection criteria keeps the richest node from always adding new blocks.
  • Delegated Proof of Stake: Delegated Proof of Stake (DPoS) is a modified PoS protocol. Most of the pool of candidate nodes are selected as in the PoS protocol, but a small number of additional nodes are added to the pool based on votes. All nodes in the network can vote for some nodes to be included in the selection pool. The nodes receiving the highest number of votes are added to the selection pool, and the winner is randomly selected from all nodes in the pool. DPoS makes PoS fairer and less likely to favor the richest nodes.
  • Delegated Byzantine Fault Tolerance (dBFT): The last consensus protocol is based on a dilemma encountered in all distributed systems: the Byzantine Generals’ Problem. This problem is a hypothetical situation that makes it easy to see how hard it is to get a consensus. Suppose nine generals and their armies from the Byzantine Empire have surrounded Rome and are waiting to attack. The generals have agreed that they must all attack or retreat in unison. If any general breaks rank and doesn’t do what the other generals do, they all will be defeated. In this case, consensus is necessary for survival. Because generals can communicate only through couriers, any courier could be bribed or even captured. Either of these actions would cause a message to be lost or changed. Any general could also be bribed to lie or just become scared and make the wrong decision. It is difficult for any general to trust that all other generals agree on any decision. The dBFT protocol ensures that all generals agree on a single course of action, even when some messages are changed or lost.

    The dBFT protocol is based on groups of nodes electing a delegate to represent them. Each time a new block is proposed for the blockchain, a speaker is randomly selected from the delegates. The speaker calculates the block’s hash and sends that to all other delegates. If at least two-thirds of the delegates agree with the calculated hash, the block is added to the blockchain. Otherwise, the block is discarded and the process starts over with new delegates and a new speaker being selected.

Making blocks immutable

The reason why so much effort is put into ensuring consensus is that after a block is added to the blockchain, it never changes. Well, that’s the goal. Technically, it is possible to change blockchain data, but it is very, very hard to do and very easy for anyone to detect the change. Using POW consensus protocol, the level of effort alone makes changing blocks pretty close to impossible. Let’s see why.

Before you add a block to the blockchain, you must calculate a cryptographic hash of the previous block. That is the link to the previous block and the guarantee that it will never change. When you calculate the hash value of the previous block, that block’s header (which is part of the data used to calculate the hash value) includes the hash of its predecessor block. So if anyone ever changes any block, all blocks in the blockchain after that one are invalid. They’re invalid because the hash values for all subsequent blocks don’t match up.

It is easy to validate a blockchain. All you have to do is step through the blockchain, block by block, and make sure that the hash value stored in each block is the correct hash of the previous block header. As long as all the hashes match, the blockchain is valid. That’s why blockchains are called immutable. You can change blockchain data, but doing so immediately invalidates that copy of the blockchain.

Reviewing the building process

Now that you know about consensus and immutability, let’s look at the steps used to build a blockchain:

  1. Users submit requests to a blockchain node. Requests can be financial transactions, code to run, documents, or really any data.
  2. When a node has enough data to create a new blockchain block, it organizes the data and adds header information, including block number, timestamp, and other descriptive details.
  3. The complete block is submitted for a consensus decision. Blockchain nodes carry out the steps in the consensus protocol to determine whether the new block should be added.
  4. Each node validates that the block adheres to all requirements, and then adds it to their local copy of the blockchain.

Keeping all blockchains consistent

After following these steps, every copy of the blockchain should contain the same blocks, but it doesn’t always work out that way. Although I've said that a blockchain is just a linked chain of blocks, there is more to it. The blocks in a blockchain are stored in a tree structure for efficient processing. The actual list of blocks on the blockchain are stored in the linked (or chained) list called the active chain. If two separate nodes solve puzzles for two different blocks at about the same time, they both would transmit their blocks to the entire set of blockchain nodes. Some nodes would add block A and others would add block B. Now we have a situation where the blockchain is not the same across the network.

This can happen in real life but it lasts only for a short while. Within minutes, a new block is added to the blockchain. The node that solved the puzzle solved it for its own copy of the blockchain. That means the winning node either depends on the previous block being block A or block B. Let’s assume this new block is based on the blockchain that previously ended with block B. When the winning node sends its block to all other nodes, those nodes with A as the last block will fail to verify this new block (because the hash was calculated for block B). That block will be rejected and now there are blockchains of different lengths. Although digging into the details of blockchain construction is interesting, the topic is beyond the scope of this book.

Ethereum defines a consistency rule that states when blockchains of different lengths exist on different nodes, the longest blockchain is the correct blockchain. So everyone discards the blockchain that ends with block A and uses the longer blockchain. Block A may go away, but all of its transactions are put back into the pool to be placed into the next block on the blockchain. So even though block A didn’t make it on the stable blockchain, its contents may still be in an upcoming block.

Understanding How Blockchains and Databases Store Data Differently

Up to now, it may seem that storing data in a blockchain is pretty much the same as storing it in a database. While the data is at rest (no one is accessing it), that may be the case. However, big differences exist in how data on a blockchain and data in a database are stored and used.

Storing data in a traditional database

Traditional databases store data in a central location. Clients connect to that central location to read and write data. Regardless of the architecture of the database, you can generally do four things with data: Create, Read, Update, and Delete. These are called the CRUD operations:

  • Create: Add a new record to a database, possibly with some generated identifying data.
  • Read: Locate an existing database record, generally through a search of key or index data, and then copy the record into a memory buffer for local access.
  • Update: Copy local changes to data back into the original record in the database. The update operation saves updated data in the database.
  • Delete: Locate an existing database record, much like with the read operation, but then remove the record from the database. A deleted record no longer exists; you can't access the previous contents of a deleted record.

Because data in a traditional database is stored in a central location, it is possible for multiple clients to read data, modify that data locally, and then write the data back to the database in an update operation. If client A and client B access the same data at the same time, and both modify that data, only one client can save his or her changes. For example, if client A saves changes first, then when client B saves his changes, they will overwrite client A's changes.

This process illustrates a classic concurrency problem. Database management systems (DBMSs) have long struggled with this issue. Today’s databases generally use one of three techniques to avoid having clients overwriting other valid changes:

  • Locking: The DBMS lock a record, or group of records, as the client reads them. While that client keeps local copies of records, no other clients can access those records for updating. The client releases locks when he or she is finished with those records, which allows another client to apply their own locks. This approach is safe, but makes it hard for many clients to share common data because it forces clients to wait in line for data to update.
  • Timestamp ordering: Each time a client wants to read a record, the DBMS records the time and compares it to the transaction timestamp and the record write timestamp. The DBMS compares these timestamps to determine when it is safe to read the record, and only allows reads when they are safe from data collisions. In this scenario, trying to read a dirty record (one that is being updated by another client) could cause your transaction to terminate. That makes it harder to write user-friendly applications.
  • Optimistic concurrency control: The previous two options assume that collisions will occur. Optimistic concurrency control assumes that collisions will not occur frequently. Clients can read records any time without restriction. When a client attempts to write a record, the DBMS compares the previously read record with the current record in the database. If these differ, another client has updated the record, and the write fails. If the record has not been changed, the write succeeds. This concurrency control technique generally supports the most scalable application design.

Traditional databases make it easy for applications to share data, carry out CRUD operations, and maintain data consistency in high-throughput environments. They don’t do such a good job at maintaining audit trails of data changes. They also require substantial effort to avoid having a database failure crash the entire application.

A distinct advantage to storing data in a database is access performance. DBMSs take advantage of features such as indexes to decrease the time it takes to locate or sort records. Record access is often one of the critical indicators of overall database application performance. Because DBMSs are optimized for performance, this storage option works well where users demand quick response and high throughput.

Storing data in a blockchain

A blockchain handles data differently than a traditional database. One of the biggest differences is that a blockchain does not support CRUD operations. The only database operations are Write, which is the same as Create, followed by populating data before writing, and Read. After data has been placed in a block and added to the blockchain, that data cannot change. A blockchain does not have Update and Delete operations.

The other big difference between blockchain data storage and databases is their location. A complete copy of the blockchain is stored on every full blockchain node. Much of the difficulty in maintaining a blockchain network is ensuring that all blockchain nodes store the same data. Each blockchain implementation has strict rules for maintaining a synchronized blockchain across the network, and those rules make detecting differences between nodes easy (and quick).

This distributed storage property of a blockchain makes it extremely resilient, because the failure of any node or nodes will have a negligible effect on the rest of the blockchain network.

Blockchain storage was never designed for high-performance situations. The storage method does support fast traversals through the block tree, but accessing individual data items within blocks takes some time. Remember that blocks can contain data in different formats, which must be filtered for searching.

Table 2-1 summarizes the differences between storing data in a database and on a blockchain.

TABLE 2-1 Differences between Databases and Blockchain

Feature

Traditional Database

Blockchain

Location

One central database copy

Each node stores a complete copy of the blockchain

Operations supported

Create, Read, Update, Delete (CRUD)

Read, Write

Performance

Optimized for short response time and high-throughput

Not optimized for performance

Integrity

Dependent on DBMS and application

Consensus and immutability provide integrity

Transparency

As allowed by central DBMS

Each node stores a complete copy of the blockchain

Control

Centralized

Decentralized

Effectively Using Blockchains

Blockchain offers some interesting features, but it might not be a good technology for every situation. Before jumping in and trying to design a blockchain application, think about how blockchain may meet some of your design goals but may not meet others. In this section, we look at some features that blockchain offers.

Transferring value without trust

One of the unique strengths of blockchain technology is that is supports transferring items of value between entities that do not trust one another. In fact, that’s the big pull for blockchain. You have to trust only the consensus protocol, not any other user. Your transactions are carried out in a verifiable and stable manner, so you can trust that they are being handled properly and securely.

Reducing transaction costs by eliminating middlemen

Whether you’re considering transferring money from one party to another or providing a product for payment, nearly all transactions need a middleman, such as a banker, an importer, a wholesaler, or even a media publisher. Because blockchain allows entities that don’t trust each other to interact directly, it eliminates most middlemen. Blockchain makes it possible for producers to interact directly with consumers. For instance, artists can offer their art directly to buyers, without needing a broker or a publisher, and these savings can be passed directly to the consumer. Although blockchain transaction handling does incur a small cost, it is generally much less than what middlemen charge. That’s good for producers and consumers.

Increasing efficiency through direct interaction

Lower fees aren’t the only benefit of eliminating middlemen. Any time you can remove one or more steps in a process, you increase efficiency. Greater efficiency generally means reduced time required for a process to complete. For example, suppose a musician decides to release her latest single directly to her fans by using a blockchain delivery model. Her fans can consume the new single the moment it drops. With a publisher, there is some delay while the content is delivered, approved, packaged, and then finally released.

Although the delay for digital media may be minimal, blockchain can eliminate any delays introduced by middlemen. The contrast becomes even clearer when looking at managing the process of delivering physical goods by using blockchain. If you buy strawberries from California, have you ever thought about how many processors stand between you and the grower? Blockchain can reduce the number of people who participate in the supply chain for pretty much anything.

Maintaining complete transaction history

Another design feature of blockchain is its immutability. Because you can’t change the data, anything written to the blockchain stays there always. “What happens in blockchain, stays in blockchain.” That’s good news for any application that would benefit from a readily available transaction history. Let’s revisit the strawberries example. You may go to the grocery store today and buy strawberries with a label that says “Fresh from CA.” You really have no way of knowing whether the strawberries came directly from CA or first from, say, Spain (the second leading exporter of strawberries.) But with blockchain, you could trace a pint of strawberries all the way back to the grower. You’d know exactly where your strawberries came from and when they were picked. This level of transaction history exists for every transaction in blockchain. You can always find any transaction’s complete history.

Increasing resilience through replication

Every full node in any blockchain network must maintain a copy of the entire blockchain. Therefore, all data on the blockchain is replicated to every full node, and no node depends on data that another node stores. This feature is a big deal for resilience. In a blockchain application, several nodes could crash or otherwise be unavailable without affecting the other users of the application. Fault tolerance is built into the blockchain architecture. In addition, distributing the entire blockchain to many nodes owned by many different organizations practically eliminates the possibility of any organization controlling the data.

Any application that benefits from high availability and freedom of ownership may be a good fit for blockchain. Many database applications go to great lengths to replicate their data to provide fault tolerance, and blockchain has it built right in!

Providing transparency

The last main category of blockchain features is directly related to the fact that the entire blockchain is replicated to every full blockchain node. Every full node can see the entire blockchain, which provides unparalleled transparency.

The data stored in blocks may be encrypted, although the data itself is available to any user of any node. To decrypt the data, a user needs the proper decryption key(s). (If the data is unencrypted, anyone with access to a node or the blockchain itself can see it.) Blockchain transparency makes it possible to trust the integrity of the data. Nodes routinely verify the integrity of each block, and therefore, the whole blockchain. Any modifications to the “immutable” blockchain data become immediately evident and easy to fix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset