7

How Facebook Goes Fast

LESSON 5:
Speed is a feature.

Background: The smartphone has made us feel that we have both more time (a minute to check the news, scores or Internet memes in line at Starbucks) and less (so much information, so little time), so the services we love must be fast.

Facebook’s Move: To deliver the most engaging services in the world, Facebook takes all the complexity of the software, hardware, networking, buildings and even electrical consumption on their own shoulders so that its users can take it all for granted and then gives away the vast majority of its technology to the industry to stimulate faster development. What is above can be only as good as what is below.

Thought Starter: What can you do with your infrastructure to be faster?

This is the story of how to build a cloud. A very big cloud.

Remember Chamath, Javier, Alex and Naomi and the Facebook growth team? Their focus in 2008 and 2009 to bring new users to Facebook really paid off inasmuch as they doubled the size of the service in each of those years (see Figure 7.1).

image

Figure 7-1. New Facebook monthly users (millions)

Heading into 2010, Facebook was staring down the barrel of the very real possibility of having to handle as many additional users in the year ahead as they had added to the service the first five and half years. All while each user was sharing more content than ever before.

The mysteries of how to make that happen are modern marvels and happen completely out of sight in something we’ve simply come to call “the cloud.”

Industrial Revolution image Information Revolution image Facebook Revolution

At the dawn of the 19th century, the Industrial Revolution was in full swing, its belching trains and factories upending the way things were made and distributed. At the time, the “technologies” of efficient steam engines, advanced metallurgy and power looms were a mystery to most people, and the workers who made it all possible were nameless, faceless cogs in a very literal machine.

Two hundred years later, at the dawn of the 21st century, the Information Revolution was also very much in full swing. Networks were its trains, and computers its factories, and they changed the way information was made and distributed. Its technologies of fiber optics, data centers and millions of lines of code were many times more sophisticated than those of the Industrial Revolution, but most people not only did not understand them, they couldn’t even see them or, for that matter, the people wrangling them.

Facebook, of course, is part of that Information Revolution, but—with apologies to Spinal Tap—“they have to go to 11” compared to the other guys. That personalized newspaper Chris Cox has been parenting for a decade? Turns out it’s very complicated when you have to deliver it to a billion people all over the world on tens of thousands of different devices every day, often dozens of times for each person.

To deliver a Web 1.0 portal like Yahoo circa 1996—which had a directory of links to other sites and Reuters news articles—you had to store one source of stuff and then send that stuff to the tens of millions of people who clicked on it. No algorithms. No personalization.

With the emergence of search—and especially Google—things got more complicated. To create a map of the Internet, Google’s computers have to constantly “crawl” the giant network and index its contents so that when you arrive looking for “When is the season 7 premiere of Game of Thrones?” it could use its algorithm to point you on your way to Westeros. The billion of us that use Google’s PageRank search algorithm represent a huge advance in scale, sophistication and speed over portals, but we all search the same source and basically get the same answer to that Game of Thrones question. (I’ll confess to slight oversimplification here since Google results are somewhat personalized taking into consideration country and locale of search and some degree of prior search history.)

Your Facebook News Feed, however, is entirely personal to you: your friends and things you’ve connected to, their content, everyone’s privacy settings, and what interests you the most.

It takes a hundred servers in multiple locations and tens of thousands of pieces of data from objects and associations in different databases to assemble and deliver—in about a second—your personal newspaper.

In the age of mobile and short attention spans, Facebook knows you don’t care how complicated all that is. Just that speed is a feature. And they work on every millisecond.

Table 7-1 compares the relative complexity of the key players in the Information Revolution.

image

Table 7-1. Scale and complexity of delivering various Internet services

Facebook has come a long way from its humble beginnings when they grew the site infrastructure by one leased server—for $85 a month—every time they added a few tens of thousands of students.

Today—and every day—they serve over a billion people, deliver hundreds of billions of individual stories, collect over ten billion likes, comments and shares and more than a billion photos that require a few petabytes of new storage—that’s a billion million or a giga mega—per day, and many exabytes of storage for all pictures ever uploaded.

Facebook has to store, understand the complex relationships between and deliver on a moment’s notice many trillions of things, all the while protecting them from nuisances as small as spam and as complex as state-sponsored cyber attacks.

No wonder, then, that by 2015, Facebook owned $3.6 billion of equipment and $3.9 billion of land and buildings and was investing $4.8 billion per year on the engineers that create and manage these systems.

All of those engineers and assets make possible that quick glance at your Facebook News Feed, your Instagram or your Messenger while in line at Starbucks. It’s more complicated than the Information Revolution of even just 10 years ago, and we don’t know how it works or who these people are either.

Which makes this a great time to introduce you to a couple of the cloud-builders.

Who Are Schrep and Jay?

There have been many thousands of contributors to Facebook’s digital factory, including Zuckerberg’s Harvard roommate, fellow computer science student and Facebook cofounder Dustin Moskovitz, who learned the necessary coding skills in a few days and jumped in to help Zuckerberg work through the technical challenges of expanding Facebook’s infrastructure in the early days, and Jonathan Heiliger, a prolific networking expert and Facebook’s vice president of infrastructure from 2007 to 2011. But no Facebook leaders have been cloud-builders longer than chief technology officer Mike Schroepfer and his vice president of engineering Jay Parikh.

Schroepfer, now in his early forties and affectionately known by all at Facebook as Schrep, arrived at Facebook in 2008 after stints as vice president of engineering at Mozilla—makers of Firefox, the number two Internet browser at the time—and Sun Microsystems after they acquired the datacenter provisioning company he had founded. With a bespectacled, what-you-see-is-what-you-get demeanor that is calm but projects competence and genuine enthusiasm for his and Facebook’s work, Schrep—a master’s in computer science from Stanford—is able to blend in completely with coders 20 years his junior at one of Facebook’s overnight Hackathons that imagine and build prototypes for the future but can just as easily explain Facebook’s complex technology on stage in front of an audience of hundreds of Facebook’s multimillion-dollar advertising customers who know little about the technicalities but who want to know they are as safe with Schroepfer’s teams’ creations as they would be with television.

Parikh, who is now in his mid-forties and clearly still doesn’t miss a workout, joined Facebook in 2009 after spending the majority of his previous career at Akamai, having started in the early days of the content delivery network responsible for 15–30% of the Internet’s total traffic, a crucial background for the challenges that would lie ahead of him at Facebook. For someone who works in the boiler room of the Information Revolution with mind-numbingly large-scale speed and complexity and an unforgiving nature that usually thrusts you into the spotlight only if something has gone wrong, Parikh conveys a degree of energy and—more importantly—meaning that infects not just the others in the digital boiler room but thousands around him at Facebook and in the industry. For him, having to think about an entirely new generation of infrastructure when his teams are barely done building the current one is a feature of his job, not a bug.

Together, Schrep and Jay are responsible for every piece of hardware and software that plays a role in your happiness with Facebook’s offerings:

(Network performance + Server performance) × Code efficiency = image

But what is all this unseen, unheard, unmoving hardware and software?

Facebook’s Wares: Hardware

Infra is the Latin prefix meaning “below” and also the internal shorthand at Facebook for the hardware infrastructure on which Facebook’s software and services run and the teams that build and maintain it.

It is quite literally what lies below, right down to the land. In 2010 Facebook launched the first of what will be, by 2018, seven data centers owned and operated by the company around the world (in addition to the leased spaces in California and Virginia in which they operate):

image Prineville, Oregon (opened January 2010): 310,000 square feet

image Forest City, North Carolina (opened April 2012): 300,000 square feet

image Luleå, Sweden (opened June 2013): 300,000 square feet and a planned future expansion

image Altoona, Iowa (opened November 2014): 300,000 square feet

image Fort Worth, Texas (opening 2016): up to 750,000 square feet over time

image Clonee, Ireland (opening 2017): 300,000 square feet and a planned future expansion

image Los Lunas, New Mexico (opening 2018): 510,000 square feet over time

By the end of 2018, that will be about 50 football fields—more than half the footprint of Disney’s Magic Kingdom—worth of space for hardware in buildings whose physical, mechanical and electrical configurations are custom-designed by Facebook down to the open-air evaporative cooling that avoids the energy inefficiency of air conditioning to reduce the heat given off by the densely packed electrical equipment. Consequently, their design has an industry-leading 1.07 power usage effectiveness (PUE) that sees 92.5% of the power that arrives at the building making it to the hardware instead of being wasted in electrical transformers, converters or air conditioning.

The sites—each of which cost more than $1 billion to build and requires as much as 50 megawatts of electricity—are chosen for their low cost and highly reliable and renewable energy (Sweden, Texas and New Mexico will be 100% renewable, using hydro, wind and wind/solar power, respectively), ample availability of water for evaporative cooling, rural location to reduce real estate cost and reasonable proximity to the Internet’s largest exchanges—on-ramps to the widest parts of the information superhighway—and Facebook’s users.

Once they’ve built the buildings, they need to connect to the Internet. With some big pipes. While Facebook doesn’t disclose their “egress” speeds—the amount of data leaving all Facebook data centers in one second—we can extrapolate from comments and data in late 2015 that they operate at many terabits per second.

Your high-definition stream of Arrested Development from Netflix is about 3 megabits per second, making Facebook’s bandwidth millions of times that at your house.

Facebook’s bandwidth needs are among its most rapidly growing, doubling in 2013 and astonishingly quintupling in 2014 with the adoption of video viewing in the News Feed. To give you some context, Table 7-2 shows Facebook’s services as a percentage of all mobile Internet traffic around the world compared to their largest rival Google.

image

Table 7-2. Percentage of mobile Internet traffic (Sandvine, Q4 2015)

Incredible bandwidth also needs to be maintained within the buildings as data travels to pods of many server racks, to individual racks within those pods and eventually to individual servers within those racks. This is known as switching, and for the sake of unadulterated speed dedicated to Facebook’s network architecture, Facebook has architected all its own networking equipment.

Then, of course, come the servers themselves, where all those bits live and all the code to handle their relationships runs. You guessed it, Facebook also designed these themselves, stripping out everything they didn’t need and building them so they can take server chips—each of which contains around a dozen CPUs—from either Intel or AMD to create flexibility and speed for upgrades.

Facebook doesn’t disclose the number of servers it uses, but we can make some rough calculations based on the total power draw of their data centers, their power efficiency and the documented power envelopes of Facebook’s server designs: It’s possible that by the time its first six data centers are built out at the end of 2017 and filled with servers, Facebook could be running as many as a million servers with more than 30 million CPUs, making them one of the premiere cloud operators in the world, along with Google, Apple, Microsoft, Amazon and others who are also building out tens of millions of square feet of data centers around the world.

Facebook’s Wares: Software

Once you have all that hardware, you need software to breathe life into it. Tens of millions of lines of code—and that’s just the parts they’ve committed to the Open Source community where code is openly shared and improved upon across the industry.

Much of that software lives on Facebook’s servers. On top of database technologies with unapproachable-sounding names like MySQL and memcache are services like the Facebook-built TAO (The Associations and Objects) that keeps track of trillions of things (e.g., people, things, comments) and relationships (e.g., Martin is friends with Steve, Steve checked in at Sydney’s Opera House, Martin liked Steve’s check-in) and that sees billions of requests every second, and the custom-built, aptly named Haystack, which stores and retrieves tens of billions of photos and videos.

It would be overwhelming for Haystack if it had to go all the way back to one particular server at one particular data center for each photo you want to see, so Facebook uses a software technique called “caching” to distribute popular pieces of data more widely throughout the Internet: the first time a piece of data—that picture of your friend’s vacation—is requested and is not cached anywhere, it has to be retrieved from its originating server at a Facebook data center, but as it passes through the Internet back to you, it is cached first in an origin cache at the digital entrance of that data center, then at one of the thousands of edge caches all around the world at an Internet intersection close to you, and finally on your computer or phone. The next time you want to look at that picture, it’s simply returned from your device. If a friend of yours in your Internet neighborhood wants to look at it, it’s retrieved from the edge cache, and even someone from elsewhere around the world won’t have to get it from the original backend server, as it will be returned to them from the origin cache at the entrance of the data center.

This works particularly well for Facebook for two reasons:

1. People want to retrieve something from Facebook (known as a “read,” which caching can handle) 500 times as often as they want to deposit something (known as a “write,” which has to make it all the way back to Facebook’s one-source-of-truth database in the data centers).

2. Facebook’s most popular pieces of data are way more popular than the rest: in a Facebook study of a random sample of 2.6 million viewed photos, 75% of users wanted to look at just 3.73% of the photos!

No wonder, then, that caching is crucial for Facebook and contributes significantly to your perceived performance of the service. Table 7-3 shows an example from sample photo data of how often a request can be taken care of at each layer of the cache network.

 

Cache Success

Browser (your device)

65%

Edge (Internet intersection)

34.5%

Origin (data center)

14.5%

Backend (server)

9.9%

Table 7-3. Percentage of requests served by various Facebook caching levels

That’s why caching is so valuable: With the original server only having to handle 9.9% of the requests, it is unloaded by a factor of 10, compared to not using caching.

But, it’s not all about databases and servers. A significant amount of Schrep’s teams’ software resources also go to building and optimizing the experience we have on our phones, a challenge that’s exploded in complexity over the years as the team has to address over 10,000 different kinds of phones—most of that due to the huge number of devices in the Google Android ecosystem—in a wide variety of network conditions around the world.

From its first mobile optimized website in 2007, to its first Apple iOS app in 2008 and Android app in 2009, to rewriting the iOS app in 2011 and completely rewriting it again in 2012 to double performance and significantly improve reliability, to launching the novel solution of Facebook Lite for emerging markets in 2015, the team is obsessed with every 100 milliseconds it can shave off when you start your Facebook app, when it has to access the network, process the resulting data, calculate the new screen layout of your News Feed, or send and receive messages; the team has built automated systems to mimic a broad range of devices and networks to automatically test every change they make.

All of this hardware and software can get a little esoteric, so let’s take a look at a handful of specific projects that will feel a little more familiar to Facebook users.

Some of Facebook’s Digital Factory’s Greatest Technical Hits

485 million people in two years: Remember how at the end of 2009 Facebook was faced with growing more in a year than they had the first five and a half years? To accomplish the feat, they launched their first custom-designed data center, as well as their own high-performance optimization (HipHop for PHP and the HipHop Virtual Machine) of one of the Internet’s most popular development languages, increasing the efficiency of their servers by over six times—the same effect as buying six times as many servers. Not only were they growing their hardware infrastructure size, they were making it radically more efficient at the same time.

The outcome of all that effort? The team was seamlessly able to support the addition of 485 million users in 2010 and 2011, the largest 24-month period of growth in Facebook’s entire history—before or after. (see Figure 7.2).

image

Figure 7-2. New Facebook monthly users (millions)

You would think that Facebook would guard all the technology that made this possible with the zeal Coke has for its original formula. Not so much. They gave it all away, contributing enabling software technologies like HipHop and HHVM to the Open Source software community and in April 2011 announcing the Open Compute Project (the brainchild of Heiliger) where, in a first for the hardware industry, they shared all the designs of their data center efforts—from servers to storage and networking and even building architecture—with the rest of the industry, including Apple, Microsoft and Google. These enabling technologies were important to Facebook but not as important as the larger technical community’s ability to also embrace them and—more importantly—further advance their performance together.

In sharing these technologies for the greater good, Facebook joined the likes of Jonas Salk who did not patent his polio vaccine and Tesla, which open-sourced all their electric vehicle patents.

Instagration: Remember that massive change in Instagram’s infrastructure back in 2013 when it had 200 million users and more than 20 billion photos? Me neither.

That’s because the year-long project to migrate Instagram from its original infrastructure on Amazon’s Web Services—infrastructure sold as a service to companies that don’t have their own teams—to Facebook’s infrastructure happened without a hitch.

It was a textbook example of an infrastructure team comprised of the Instagram team and their new Facebook colleagues going through a very complex process—they had to move the service through a two-step transition while it was live—with none of us on our phones any the wiser, in the process gaining scale, speed and reliability advantages to avoid site outages like the one that occurred to Instagram during an East Coast snowstorm in 2012. By December 2016, they had grown to more than 600 users and 40 billion photos (much more on the Instagram acquisition in Chapter 9).

Live video: We all know how Facebook revolutionized the way photos are shared and the huge storage and serving infrastructure that requires. Now imagine the impact on the size and complexity of that infrastructure when you add more than 8 billion video views per day and artificial intelligence to automatically analyze and compress different sections of uploaded videos with the encoding technique that will yield the highest quality and smallest size. Or the complexity of finding innovative ways to compress, store and deliver view-dependent adaptive bit rate streaming—say that three times fast—for much larger 360-degree videos in virtual reality.

According to Cisco, video is slated to become more than 80% of the Internet’s traffic by 2019, and it’s complicated to be sure. But perhaps none of it is as complicated as figuring out how to help Vin Diesel live-broadcast from his phone his thoughts about the latest movie scripts he’s reading simultaneously to over a million of his nearly 100 million Facebook fans.

For a live broadcast, Facebook’s systems no longer have the relative luxury of uploading a video, encoding that video carefully, storing it away and then retrieving it when someone wants to see it.

Instead, all of that has to happen in just two to three seconds, so that there doesn’t appear to be a big lag between what Vin is talking about, his adoring fans seeing it and commenting on it, and for him to react back to them. Unfortunately, because Vin and so many other celebrities and public figures on Facebook are popular, their fans’ enthusiasm for live broadcasts—which they can watch right in their Facebook News Feed—causes what the Facebook engineering team lovingly refers to as the “thundering herd” problem as people threaten to flood the live streaming server. So the team had to go back to its caching architecture and build a system that holds fans’ requests to subscribe to the live stream at their local edge server, until the streaming server can push the video out to all those edge servers—each of which will handle about 200,000 simultaneous viewers—and ensure that each subsequent fan’s request can be handled by the edge server, thus offloading the streaming server of about 98% of the requests it would have otherwise seen to make sure there’s nothing standing in the way between Vin and his fans.

The Future of Facebook’s Infrastructure: Sharks with Lasers

Around Facebook, Schroepfer and Parikh are fond of joking that their teams’ ever expanding efforts know no boundaries, a notion that in its farcical extreme is best known as Dr. Evil’s “sharks with frickin’ laser beams” from the 1990s comedy, Austin Powers: International Man of Mystery.

The farther the teams go in pursuing Facebook’s mission, however, the less farcical those lasers are becoming.

Connectivity: Having revolutionized the way Facebook connects to the Internet with their own data centers, network, servers and software, Facebook is now turning their attention to revolutionizing the way people connect to the Internet.

What’s the only thing more complicated than building your own infrastructure here on earth? Building your own infrastructure in the sky.

We already heard in Chapter 5 about the AMOS-9 satellite that Facebook’s coalition had planned to launch in 2016 to expand connectivity in sub-Saharan Africa, but Facebook is not stopping there. To cover less densely populated rural areas—not served cost-effectively by satellites—Facebook acquired British unmanned aerial vehicle (UAV, drones) maker Ascenta and is building solar-powered drones. With the wingspan of a 737. That weigh only about a third as much as a Toyota Prius. And use lasers to transmit data through the air to each other. At 10 gigabits per second. While in flight. At 60,000 to 90,000 feet. For three months at a time.

No, not sharks just yet, but damn good lasers accurate to the size of a dime over a distance of 11 miles while in flight.

A group of these drones, each covering a patch about 100 miles in diameter, will connect to a single terrestrial base station at a big Internet connectivity intersection using radio signals, relay the signal from drone to drone using lasers and then down to many smaller terrestrial base stations, which turn the radio signal into local WiFi or cellular connectivity for rural communities. The effort is intended to avoid expensive ground-based cable infrastructure and instead create new, more cost-effective modes of connectivity for operators or governments that don’t have the means, or appetite, for these kinds of innovation risks themselves.

Drones, however, are not the only thing Facebook will do to work with network operators to discover new, more efficient approaches to connect the next few billion people (more in Chapter 14). In 2016, Facebook joined with the likes of AT&T, Verizon, Deutsche Telekom and SK Telecom on the Telecom Infra Project, which will extend the move-fast-and-share-things playbook of the Open Compute Project for data centers to the telecommunications infrastructure from access (the place where you connect to a network, such as a cellular tower or WiFi access point) to the core and the backhaul network in between.

Artificial intelligence: As advanced as drones are to communications infrastructure, so is artificial intelligence to software. The notion of teaching computers to learn—and think—will play an important role throughout Facebook’s offerings in understanding photos, videos, speech and text, including its virtual assistant Facebook M (more in Chapter 13). As they have with much of their other software and hardware, Facebook is contributing their neural network training code and purpose-designed servers—artificial intelligence software learns better the more data it has to work with, making computing performance a fundamental limiter to progress—to the open source communities.

Virtual and augmented reality: And if all that reality wasn’t enough, Facebook’s virtual reality efforts—the Oculus VR acquisition (more in Chapter 15) also reports to Schrep—will deliver consumer hardware and a software ecosystem of experiences for our screens of tomorrow. Live 3D broadcast from Vin Diesel anyone?

Between the experiences on our phones, mind-boggling data center and telecommunications infrastructure innovations and future consumer interfaces, Facebook’s digital factory has put its money and effort where its make-the-world-more-open-and-connect mouth is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset