Chapter 7

Data Center Architectures and Opportunities for Silicon Photonics

Abstract

Networking an ever-growing number of servers presents a key challenge in large-scale data centers. The capacity of switch silicon continues to grow, but it is not keeping pace with demand as the number of switches and amount of associated interconnect are rising exponentially. Silicon photonics offers solutions at several levels: It can benefit interconnect, it can be used to tackle input–output issues of large chips like switch silicon and processors, and it can enable new and higher-performance system designs. Interestingly, the use of optics to enable new system designs has already been demonstrated, and while the early examples have not been commercially successful, silicon photonics is the optical technology with the most to offer to solve these data center challenges.

Keywords

Silicon photonics; data center; leaf and spine; server; optical transceiver; IP core router; on-board optics; copackaged optics; VCSELs; disaggregated platforms

Everything that has happened in the telecom network is now being replicated inside the data center. And then everything that is happening in the data center is going to be on the board, and then everything on the board is going to be in the package, and then everything in the package is going to be on the chip.

Lionel Kimerling

Data analytics is driving so much bandwidth and so much traffic within data centers, and it has such commercial value these days. I don’t want to be too clichéd but it acts almost as the ‘killer app’ for driving photonics.

Keren Bergman

7.1 Introduction

In his book, Thing Explainer, Randall Munroe uses diagrams and labels to explain how complicated systems work [1]. A novelty of the book is that the author confines himself to a 1000-word vocabulary. For Munroe, the data center is a computer building, a server rack is a holder, and the server’s processor is a thinking box.

We view the data center as the factory of the information age. The raw material is data which, when processed using energy and computation, is transformed into information. The factory’s output may be the result of a complex data analytics algorithm or a service delivered to an enterprise or a consumer, e.g., the WolframAlpha application discussed in Chapter 2, Layers and the Evolution of Communications Networks.

As with all factories, improving efficiency and adding an assembly line benefit the factory owner’s bottom line. The computing power, switching, and storage capacity needed to deliver the product, or service, is the equivalent of a factory assembly line, and the software that runs on the servers can be created and scaled dramatically in ways traditional manufacturing cannot match.

The data center also has few moving parts and is highly automated. Instead of rhythmic machinery, it has ventilation noise, blinking lights, and heat generated by rows of equipment. But the lack of an assembly line buzz should not be mistaken for inactivity—data centers are hugely complex, productive, and challenging environments.

The data centers continue to grow in size as they house ever more processors and racks, as discussed in Chapter 6, The Data Center: A Central Cog in the Digital Economy. The tasks the computing resources perform are also evolving. Such demands challenge the networking of servers. For data center managers, scaling operations cost-effectively remains a major headache.

The rising demand for video services is one development driving the need for more bandwidth. Fig. 7.1 shows the projected increase in cloud-based video and the growing appetite for viewing video by end users [2]. The bandwidth needed, and the uncertainty as to when and where video demand may originate, tax a data center’s performance and its networking.

image
Figure 7.1 The ongoing transition from traditional TV programs (linear TV) to Internet-delivered video. Note: OTT refers to over-the-top—in this case video, typically from an Internet content provider that rides over an operator’s network—while VOD refers to video on demand. Adopted from Market Realist.

This chapter details the challenges the Internet content providers face in expanding their data centers. This issue of data center scaling is not just the adding of more server racks and switches over time in a pay-as-you-grow fashion. For the Internet content providers, scaling implies the raw ability to reach the massive scale of computing they need. The basic networking architecture used to link servers and other equipment is described, as are its shortfalls.

Photonics plays an important role here in the form of pluggable optical transceivers, optical devices whose requirements continue to be pushed. But the demands of the data center are causing a more fundamental change in the use of photonics. The current partnership of chips and pluggable optics can only go so far, requiring photonics to be brought inside systems and closer to the chips. Moreover, there are already signs that bringing the optics onto the printed circuit board, closer to the chip, is itself insufficient longer term. The demands of the data center mean that optics, sooner or later, will be brought inside chips, or more accurately copackaged with them. Silicon photonics is the technology for such copackaging.

7.2 Internet Content Providers Are the New Drivers of Photonics

Internet content providers may yet not spend as much as telcos annually on optical gear (as discussed in Chapter 5, Metro and Long-Haul Network Growth Demands Exponential Progress), but their fast growth and pressing requirements are beginning to shape the evolution of optical technology. The 100-Gb transceivers, now state of the art, are being purchased in considerable volumes for their data centers. The role of the optical transceiver in enabling the Internet content providers’ businesses should not be underestimated.

The Internet content providers’ optical communication requirements differ from those of the telcos. The Internet players are putting up huge data center complexes that can house more than 100,000 servers. The Internet content providers do not balk at using proprietary solutions if needed, and they will buy equipment from a sole supplier—practices eschewed by the telcos, which as regulated companies must adhere to industry standards and rely on several suppliers.

The Internet content providers are also more relaxed than telcos when it comes to equipment specifications. This is not because they have lower standards than the telcos. Rather, their equipment must meet less stringent demands given the controlled environment of the data center. The data center equipment also has a shorter life, with data center operators upgrading their systems as often as every 2 to 3 years. In contrast, telco equipment is deployed in harsher environments and must operate across a broader temperature range. Telco equipment can reside in the network for two decades or more.

As large-scale data centers can require hundreds of thousands of high-bandwidth optical transceivers, Internet content providers want inexpensive, power-efficient designs to minimize their costs. Such requirements represent a great commercial opportunity for silicon photonics, and not just because of bandwidth. Optics accounts for an increasing proportion of overall networking cost, so new technologies are being sought to better meet market needs.

The market for 100-Gb/s data center transceivers kicked off in the second half of 2016. To achieve this throughput, data is sent over several lanes requiring multiple lasers and receivers. Single-mode fiber is considered the best approach for future traffic growth due to its ability to support longer distances and higher bandwidth using wavelength-division multiplexing. But single-mode fiber and associated transceivers are relatively expensive when compared with links using multimode fiber.

As multiple active photonic components are required for each link over single-mode fiber, photonic integration—whereby multiple components in the transceiver are copackaged or even monolithically designed—promises to reduce costs. Silicon photonics is already being used to deliver such transceivers [3] (see Fig. 7.2).

image
Figure 7.2 Wafer of Luxtera’s 100G PSM4 silicon photonics-based optical engines. From Luxtera.

It should be noted that the Internet content providers first requested 100-Gb transceivers in 2010. It took 6 years for the optical module makers to deliver products that addressed their needs. The industry’s belated response reflects the traditional optical component market’s development cycle. Now, with the growing role of the Internet content providers, things are changing. They demand much shorter development times and require products with a predictable delivery date. This is a role that silicon photonics, with its commonalities with the chip industry, can exploit.

Optical transceivers are but one part of what data center operators need. Silicon photonics’ attractiveness stems from the fact that it can play a role across many levels of interconnect, a focus of this chapter.

7.3 Data Center Networking Architectures and Their Limitations

The challenge facing data center networking engineers is how to link efficiently an ever-growing number of servers. This is especially tricky given how tasks and data are distributed across the vast computing resources within a data center. Whereas, in a traditional campus-based local area network, a user request would typically be fulfilled by one server, today’s data centers spread data sets and processing across multiple servers, requiring constant communication between them.

Such server traffic flows are referred to as East–West traffic, which means that the traffic stays within the data center, as opposed to North–South, which refers to traffic entering and exiting the data center. East–West traffic accounts for the bulk of the traffic flow within a data center. When a user logs into a social media site such as Facebook, e.g., the incoming request generates a near 1000-fold increase in traffic within the data center [4]. All the feeds unique to the user are aggregated, including what it calls anniversary posts—images or videos a user may have uploaded years ago—that have been archived and need to be retrieved from longer term storage.

To get a sense of the scale of Facebook’s operations, hundreds of petabytes are stored in its data centers—a petabyte being one million gigabytes—while 1.79 billion users visit the site every month [5]. The world’s population in 2016 is estimated to be 7.40 billion.

Scaling issues are not unique to Facebook. All the Internet content providers are grappling with how to ensure that they can keep growing servers, storage, and switching to meet their massive and growing individual operational requirements.

7.3.1 The Leaf-and-Spine Switching Architecture

Data center managers have adopted a standard hierarchical arrangement of network switches that they replicate across the data center to connect more and more servers and accommodate the traffic flows between them. A commonly used building-block switch architecture is known as leaf and spine [6].

Leaf and spine is a layered architecture. Leaf switches typically reside on the top of a server rack, hence they are also known as top-of-rack switches. The spine switches, typically having a higher switching capacity, link several leaf switches to enable them to talk to each other.

A data center manager wants the switch architecture to connect the end devices (e.g., typically servers, but it could also be storage units) as efficiently as possible. Efficiency refers to performance metrics such as latency—the time it takes for a data packet to travel from one server port to another—which ideally should be as low as practicable. Efficiency also means using the least amount of switch hardware to meet the networking goals: more switches mean more floor space, more power consumption, and more transceivers and cabling, all of which add cost. If a data center manager is going to use multiple leaf-and-spine arrangements to connect hundreds of thousands of servers, the basic leaf-spine unit had better be as efficient as possible.

Fig. 7.3 shows a simplistic leaf-and-spine arrangement to highlight how switch count and interconnect grow as more servers are added.

image
Figure 7.3 A simple leaf-and-spine architecture linking eight servers.

The basic leaf switch shown is a four-port device that can support up to four servers. Doubling the server count to eight increases the number of switches from one to six and sends the number of interconnects linking the switches up from nil to eight.

Fig. 7.4 shows the exponential impact of increasing the number of servers further. Note the interconnects grow more rapidly than the switch count. The cost of scaling a leaf-and-spine architecture is thus a function of the number of switches and the interconnects they use. From Fig. 7.4, what can also be learned is that an individual switch with a high port count is desirable.

image
Figure 7.4 Impact of server count on the number of four-port switches and interconnects in a leaf-and-spine architecture.

Table 7.1 further shows the effect on a network of scaling the number of servers. The number of switch tiers needed increases, which complicates the cabling and increases cost. The number of servers that can be linked is determined by the switch port count in the spine. Adding servers to meet end-user requests require a significant increase in switches and ports—all expensive items.

Table 7.1

The Number of Switches, Interconnects, and Switch Tiers Needed to Support End Devices in a Leaf-and-Spine Architecture

Device Ports Four-Port Switches Interconnects Number of Tiers
4 1 0 1
8 6 8 2
16 20 32 3
32 56 192 4

Image

In practice, data center networks are far more complicated than the example above. For a start, other networking elements are required besides servers and switches. One is edge routers that support Internet Protocol (IP) routing to handle external connectivity to the backbone and wide area network (WAN)—i.e., communication beyond just the leaf and spine. Another is the load balancer that, as suggested by the name, has a view of workloads and ensures that one cluster of servers is not overworked while another is relatively idle.

Fig. 7.5 shows a schematic of a network described by Facebook [4]. Consider a 128-port leaf switch where 96 devices such as servers are connected using 10-Gb links. The remaining 32 leaf switch ports are used for uplinks. For example, ports can be grouped in units of four to create eight ports, each at 40 Gb/s, to connect the leaf switches to the spine, or fabric, switches. A three-tiered architecture is used in this example, with the second spine used to enable more end devices to be connected on the network.

image
Figure 7.5 A three-tier leaf-and-spine architecture as described by Facebook. Modified from Facebook.

Top-of-rack and spine switches, with 96 ports each, enable a total of 221,184 ports to be used to connect servers and other appliances. How is this number arrived at? The number of devices that can be attached is a function of the edge or top-of-rack port count, Pe, and the core or spine switches ports, Pc.

For a three-level design, the maximum number of devices or servers that can be connected is Pe × (Pc/2)(L − 1), where L is the number of layers [7].

The issue with such a topology, says Facebook, is the huge number of cables and optical transceivers needed. In this example, there are 442,368 links between the switches (where the number of links equals Pc × Pc/2, as described above, where Pc is the number of core switch ports, in this case 96).

Not surprisingly, interconnect is a considerable part of the overall cost of scaling leaf and spine architectures. Moreover, the need for 100-Gb links and the need for single-mode transceivers affect the interconnect cost independently. It is this rising cost that is driving the need for cheaper interconnect.

The attraction of using the leaf-and-spine architecture is that it is scalable, predictable, and resilient. The switches are built using merchant silicon, and while these are complex chips, they are relatively inexpensive. Adding leaf switches and appliances does not require software changes as long as there are free spine ports. Multiple paths and spine-switch routing algorithms provide predictable transit times between the servers, ensuring good service for end users. And spine-switch failures can be circumvented with minor network performance degradation. All these factors are positives as far as data center managers are concerned.

The distance between the server and the top-of-rack switch is typically under 10 m, enabling copper cabling to be used. The link between the top-of-rack switch and the spine can be a few 100 m in length, but link lengths less than 50 m are typical. Although optics is used, the link cost is modest due to inexpensive optical transceivers and the short lengths.

As the network size increases and the number of servers and switches grows, so does the distance between the fabric (the spine 1 switches in Fig. 7.5) and the cluster (spine 2) switches, requiring more single-mode fiber and more costly single-mode transceivers.

Indeed, data center buildings can be distributed tens of kilometers apart to serve a metro area. Considering that redundancy is often desirable, such data center clusters can require a large number of links. Microsoft talks of distances up to 70 km; optical engineers are developing solutions for this application. In this case, tiered spine switches need to talk to each other over such distances to make the servers in the distributed data center buildings appear as a single shared resource (see Section 5.5.1)

7.3.2 Higher-Order Radix Switches

One way to scale data center networking is to increase the number of ports—or radix—of the building-block switch. Using a high-radix switch reduces the overall interconnects, switch tiers, and switch boxes needed. This reduces overall equipment costs and the cost of operating them, as less power is consumed and less floor space is needed overall. A high-radix switch also benefits network performance by decreasing latency as fewer hops between switch stages are needed when two servers chat.

A market leader in 100-Gb switch silicon is Broadcom. The fabless chip company’s flagship Ethernet switch integrated circuit (IC) is the StrataXGS Tomahawk with a switch capacity of 3.2 Tb: 32 ports at 100 Gb or 128 ports at 25 Gb [8]. While the bandwidth is higher than the top-of-rack switch shown in Fig. 7.5, the number of servers and other end devices supported is the same.

Ethernet switch equipment vendors have copackaged multiples of these switch chips to increase overall port count. Internally, these switches have a multitier CLOS, named after its inventor Charles Clos, a leaf-and-spine architecture, and the overall platform is typically used as a spine switch. But they are costly and power-hungry. Fig. 7.6 shows various vendors’ 100-Gb high-port-count switches.

image
Figure 7.6 A selection of the highest-port-count 100-Gb Ethernet switches available commercially. *Juniper announced, not shipping as of 3Q2016. Data from company reports.

7.4 Embedding Optics to Benefit Systems

Although high-radix switches is one direction to support data center scaling as detailed above, current system data center requirements are resulting in a rethink of system design and architecture.

Research is being conducted as to how electronics and optics can be used to ensure more efficient data center networking performance. Such research is seeking to improve the data center computation and processing speed while reducing latency, cost, and power consumption. The resulting systems being proposed require new combinations of photonics with electronics, bringing optics inside systems and placed ever closer to the application-specific integrated circuit (ASIC). An example of such research is now discussed.

7.4.1 An Electronic-Optic Switch Architecture for Large Data Centers

Keren Bergman’s career has always been at the intersection of computing systems and photonics, first with high-performance computing and now data centers. A professor of electrical engineering at Columbia University, Bergman’s current focus is on how photonics can improve overall computational performance of the data center while minimizing energy consumption.

A key challenge Professor Bergman and her research team are tackling is how photonics can best link 100,000 or more servers. Instead of developing a large high-radix—320 or even 1000-port—optical switch that would connect to the top-of-rack or leaf switches to offer an optical complement to the existing electrical switching architecture [9], Bergman and her team are taking a different approach: embedding photonics inside the electronic switches. Such an approach will be enabled by silicon photonics, says Bergman: “That is key.”

Bringing silicon photonics onto the switch chip board will create a new entity, a combination of electrical routing and optical switching. This results in the best of both worlds: all the benefits of an electronics switch chip such as programmability and packet buffering, with much denser optical input–output enabled by silicon photonics. In turn, by adding silicon photonics switching, traffic will be sent either through the electrical or the optical switching.

Embedded photonic switching offers a fundamentally different way of moving data across the data center, says Bergman. Photonics lends itself to data being broadcast to multiple locations—a networking approach commonly used for data packets known as multicasting—while individual transmitted wavelengths carrying data can be picked off at locations as required. Multicasting is much harder to do electrically and is costly in terms of latency and energy, especially over data center spans.

These optical-enabled data-transfer techniques offer a powerful way to deal with the huge growth in data traffic transiting the data center. Applications such as data analytics, used to extract valuable information from large data sets, is generating vast amounts of traffic between servers. Bergman sees a direct link between data analytics, which has great commercial value to companies, and embedded optics: “This is very recent, only in the last few months have we converged on this in our research.”

One goal of the team’s work is that a data center manager will not even know that the networking systems include embedded optics; the system will look and be controlled as current systems are today except it will run faster and consume less energy. This is extremely attractive to data center managers, whereas the adding of a large external optical switch requires them to change the data center controller software, she says.

But once silicon photonics becomes embedded within such systems, things will change. The resulting bandwidth of the networking architecture will become much higher and more uniform. Then, the need for hierarchical switching architectures will start to diminish. Professor Bergman predicts that we will see differently architected data centers in the next 3 to 5 years.

Silicon photonics start-up Rockley Photonics is also developing a switch architecture for the data center based on optics and a custom ASIC. As of this writing, Rockley is still in stealth mode. But it has said that its switch concept will scale with Moore’s law, meaning it will support a doubling of capacity every 2 years [10].

The idea of embedding optics in systems is not new. Intel announced a disaggregated server architecture enabled using silicon photonics interconnects incorporated within the system. And start-up Compass Electro Optical Systems (Compass EOS) went a step further and copackaged optics with electronics to create a complex ASIC with optical input–output. While both designs faltered commercially, they provide valuable case studies in highlighting the potential of bringing optics inside systems. Both systems are now described.

7.4.2 Intel’s Disaggregated Server Rack Scale Architecture

Intel introduced its Rack Scale Architecture in 2013, a disaggregated server design enabled by silicon photonics. A disaggregated design physically separates the various parts of a server—processors, memory, storage, and switching—linking them using high-speed, low-cost interconnects instead [11].

Using such an approach, the server’s performance can be tailored to a given task. If more memory is needed, it can be added without upgrading the entire server. And as the elements making up the server have different upgrade cycles, when a more powerful processor is released, it can be slotted in without having to upgrade all the other elements. This can extend the life of the platform to as long as 20 years [12]. There are also environmental cost benefits. If the processors generate more heat than flash memory, they can be dealt with as a microenvironment and cooled appropriately (see Fig. 7.7).

image
Figure 7.7 Device-specific environmental conditioning enabled by disaggregation.

However, the disaggregated server design only works if the various elements can be connected with sufficiently low latency, and that requires high-speed, cheap interconnect. Intel chose to use silicon photonics. The silicon photonics transceiver operated at 1310 nm and was coupled to multimode fiber. Intel worked with a leading fiber manufacturer Corning to develop the multimode fiber and demonstrated it transmitting at 28 Gb/s over nearly a 1-km distance [13], although distances of a few 100 m are sufficient for these applications.

Intel also worked with US Conec to develop a custom connector that had high bandwidth and was robust. The resulting connector used a fiber ribbon, 16 strands wide and 4 rows high, yielding a cable, called the MXC, with 64 lanes, each at 28 Gb/s, for a total duplex bandwidth of 1.6 Tb/s [14].

Intel developed an expanded beam connector to ensure the MXC’s ease of use. This enabled robust coupling between the light source and the fiber strands ensuring a tolerance to dust and dirt, the cause of most optical link failures.

By choosing multimode fiber and developing a low-loss, terabit-plus connector, Intel hoped to offer a novel server architecture for the data center with silicon photonics at its core. However, in early 2015, Intel announced a delay in its silicon photonics modules and said they would ship by the year’s end [15]. In August 2016, Intel did indeed announce silicon photonics modules, but not for the Rack Scale Server. Rather, these are pluggable 100-Gb PSM4 and CWDM4 QSFP28 transceivers for switch equipment in the data center [16].

So where does that leave Intel’s Rack Scale Architecture?

According to an Intel spokesperson, silicon photonics is not being used in Intel’s Rack Scale design, primarily because optical connectivity is not required at the data rates and distances used inside a rack today. Copper continues to push back the technical boundaries and keeps an all-optical interconnection the subject of “future” plans. But the spokesperson added that as server resources disaggregate, and throughput and distance requirements increase beyond the capability of current interconnect technologies such as copper and vertical-cavity surface-emitting lasers (VCSELs), then silicon photonics is expected to provide the high-speed optical connectivity between a new generation of pooled resources.

So despite Intel’s technological heft and the design’s development effort and ambition, the company concluded that a silicon photonics design is premature for this application. Moreover, copper already enables such disaggregated designs. Equipment makers such as Cisco Systems and Dell have server disaggregation products that use copper-based high-speed connections. Ericsson has also announced a disaggregated hardware system based on Intel’s Rack Scale Architecture where the connectivity is optical but it is not based on silicon photonics [17].

7.4.3 Compass-EOS: Copackaging Optics and Silicon

Compass-EOS was arguably the first company to offer optics copackaged with a complex chip. The company, which later became Compass Networks, provides a valuable case study from a technology and commercial perspective.

The ambitious Israeli start-up developed an IP core router to compete with the likes of Cisco Systems, Juniper Networks, Alcatel-Lucent (now Nokia), and Chinese giant Huawei.

It developed a way to integrate optics with its complex traffic manager chip design, resulting in a simpler and lower-power optically enabled core router platform. However, despite the novel chip, the company ultimately failed commercially, largely because its software team of 60 engineers could not compete with its much larger IP core router rivals.

Simply put, a router is a networking platform at the heart of the Internet. It takes IP traffic in the form of packets on its input ports and forwards them to their destination via its output ports. To do this, two chip types are used: a network processor and a traffic manager.

The network processor does all the packet processing—it takes the packet’s header and uses a lookup routing table to determine its destination and inserts a new updated header frame in the packet.

The traffic manager oversees billions of packets. The chip implements the queueing protocols and, based on a set of rules, determines which packets have priority on what ports. In a conventional IP router, there are also switch fabric chips that send the packets between the cards to the right router output ports.

Compass-EOS designed its router between 2007 and 2008 and used a merchant chip for the network processor. But it designed its own complex traffic manager ASIC and added a twist by figuring out a way to add optics to the chip. As a result, no switch fabric was needed. Instead of the traffic manager going via the router’s electrical backplane to a traffic manager on another card, each optically enabled traffic manager had sufficient bandwidth to connect to all the other traffic managers. Eight traffic manager chips in total were used on four line cards—all linked in a fully connected optical mesh.

As lead engineer Kobi Hasharoni put it, there was no backplane, which is why the Compass-EOS routers were so much more compact than those of its competitors.

At the time of the design, Compass-EOS did not consider using silicon photonics, which was deemed too immature. Instead, Compass-EOS used its ingenuity to figure out how to couple multiple VCSELs and photodetectors onto the chip.

VCSELs in 2007 were at 10 Gb/s, and Compass-EOS chose to operate them at a more relaxed rate of 8 Gb/s. In total, 168 VCSELs and 168 photodetectors were used on the traffic manager chip, enabling 1.344 Tb of traffic for transmit and the same for receive. The resulting chip with optical input–output consumed one-fifth of the power of a chip using electrical-only connections and achieved a 12-fold bandwidth-density improvement compared with electrical for the same chip area. Compass-EOS was developing a second-generation chip with a 16-Tb input–output before the project was canceled.

Hasharoni says that were he and his hardware colleagues to tackle a similar design today, they would use silicon photonics instead of VCSELs. The design would better support single-mode fiber, and the packaging would be easier with both the ASIC and optics being silicon.

What Compass-EOS demonstrated in 2010—arguably at least a decade ahead of its time—is how optics can be integrated alongside a complex chip to benefit the system architecture. In this case, it resulted in a more compact, lower-power IP router that was less costly to operate [18]. But despite all the router’s hardware ingenuity, the venture ultimately failed.

The start-up’s experiences were different from Intel’s with its Rack Scale Architecture. Compass-EOS’ hardware was a more ambitious copackaged chip, and the company did sell its IP router to several leading telcos. But the technical benefits of integrated optics within a system did not guarantee commercial success.

Before discussing the particular component technologies that will enable wide adoption of optics, first on the printed circuit board, then copackaged with silicon, the issues associated with pluggable optical modules are the subject of the next section.

7.5 Data Center Input–Output Challenges

Fig. 7.8 shows the evolution of optics: how it will move from the faceplate of systems in the form of pluggable modules to the printed circuit board and then will be combined with silicon in the same package. Ultimately, optics and complementary metal-oxide semiconductor (CMOS) circuitry will coexist on the same die. The wider issues associated with these four approaches are discussed in the remaining part of the chapter. We start with the current approach based on pluggable optical modules.

image
Figure 7.8 The evolution of system optics: From pluggables modules to on-board optics to copackaged optics to on-chip optics.

7.5.1 Pluggable Optical Transceivers Evolution: Photonic Integration for 100-Gb

The 100-Gb transceivers being adopted to connect leaf and spine switches, and between the spine switches, use multiple data lanes, each lane having a transmitter and receiver. The main two approaches are to use wavelength-division multiplexing or parallel optics, as described in Appendix 1, Optical Communications Primer. For the former, four lasers are used to generate 4 wavelengths, each at 25 Gb/s, which are multiplexed onto a single-mode fiber, while with parallel optics one laser can be shared to generate the four physical transmit and four receive lanes, each at 25 Gb/s.

The cost of a laser is important. A single-mode transceiver is dominated by the laser cost, which can account for over 30% of the total bill of materials.

The parallel optics 100-Gb PSM4 optical module specification has a cost advantage because it can share a single laser across all four lanes, each lane being independently modulated. But link cost increases with distance because a fiber ribbon cable, eight lanes wide, is used. In contrast, a single fiber pair is all that is needed for wavelength-division multiplexing using the 100-Gb CWDM4 or CLR4 modules. Photonic integration is the only technological approach that provides optical module vendors with the opportunity to meet the demanding price goals.

For these 100-Gb modules, there is ostensibly no difference in optical performance between a silicon photonics-based transceiver design and an indium phosphide one. But like what happened with 10-Gb devices, which went from an initial parallel design to a serial one, developers are plotting to build a 100-Gb serial design that uses one transmitter and one receiver only. Such a design requires increasing the electrical lanes to 50-GBd signaling coupled with the multilevel signaling scheme of four-level pulse-amplitude modulation (PAM4) that allows 2 bits to be transmitted per symbol [19]. PAM4 is becoming an important modulation scheme to address evolving data center requirements.

Moreover, 100-Gb is the latest but not the last speed stop in the data center. Internet content providers are already eyeing 200- and 400-Gb links and even 1-Tb ones. All these designs require photonic integration and high-speed electronics that will benefit from a close coupling between the optics and electronics, whether the optics are embedded on the printed circuit board near the chip or are copackaged with the chip.

Using an advanced CMOS packaging process for silicon photonics provides the most promising path for high performance and low cost.

7.5.2 The Move to Single-mode Fiber

Both single-mode and multimode optical transceivers are used in data centers. Traditionally, links from top-of-rack switches to spine switches use multimode fiber while links between spine switches use single-mode fiber. Microsoft is one Internet content provider that plans to use only single-mode fiber for all its future data centers. Many other Internet content providers are following Microsoft, but enterprise players, e.g., continue to use multimode.

Multimode fiber links are the cheapest because they use inexpensive lasers and hence inexpensive transceivers based on VCSELs. However, with each increase in data rate, the transmission distance standardized over a multimode fiber diminishes. This, in part, explains Microsoft’s decision to adopt single-mode fiber, to support long spans within its data centers, to simplify its networking decisions and future-proof its fiber investment as data rates inevitably rise. Using only single-mode fiber also reduces the module types that a data center manager must keep as inventory and hence costs.

Data center managers’ adoption of single-mode fiber is good news for photonic integration and for silicon photonics because it increases the overall volume for single-mode transceivers, essential for reducing costs.

7.5.3 Pluggable Transceivers for the Server–to–Top-of-Row Switch Optical Opportunity

Current links between servers and top-of-row switches are at 10 Gb/s because that is the link speed needed by the servers used by the largest Internet content providers. Copper cabling is commonly used to connect the two given the link distance is typically is under 3 m.

Optical cables have started replacing copper for such links as costs have come down. And optics beats copper when it comes to power consumption. Comparing optical to electrical switches in, e.g., a 10-Gb port, the power consumption of copper is 4–6 W, whereas for optics it is 0.5 W [20].

Another trend benefiting the switchover to optics is that as transmission speeds increase, the distance supported by a copper link decreases. As network interface cards for servers move to 25 Gb/s, copper can support up to 3-m link length. But moving to 50-Gb and then 100-Gb links, copper’s reach may become too short for server-to-leaf switch connections.

Optics will play an increasingly important role connecting servers to switches. Here, silicon photonics is battling multimode VCSEL-based links. For silicon photonics to replace VCSELs, it must match VCSELs’ optical performance and low power consumption and be lower cost—a challenging set of requirements. Silicon photonics has shown it can be competitive with VCSELs when used in active optical cables.

7.5.4 Transition One: From Pluggables to On-Board Optics

Given how switch capacity will continue to grow, a key issue that needs to be addressed is increasing the capacity that can be supported on the faceplate of platforms. The platform itself will not get any larger so what is required is to pack more, higher-capacity pluggable modules on a switch’s front panel.

The market currently uses 100-Gb pluggable modules, and the smallest optical transceiver package available for 100-Gb is the QSFP28 form factor. Thirty-two QSFP28s are typically used on the front panel matching the 3.2 Tb of capacity of current state-of-the-art switch chips, and this can be stretched to as many as 36 QSFP28 modules.

Mellanox is one equipment company that is confident it can get to 200 Gb/s in a QSFP28 module using its silicon photonics technology. The module is expected in 2017 and will enable a faceplate density of 7.2 Tb/s using 36 200-Gb modules [21].

According to Mellanox’s Mehdi Asghari, the company’s 25-Gb silicon photonics modulator and photodetector can already operate at 50 Gb/s. This means it can use the simpler nonreturn-to-zero modulation scheme rather than PAM4.

The challenge using nonreturn-to-zero signaling is getting the associated electronics to work at 50 Gb. This is a complex radio frequency design challenge that requires skilled analog design engineers. The design challenges include getting the drive electronics, the assembly, the wirebonds used to connect the chip to the packaging, and the printed circuit board tracks to all work with good signal integrity at 50 Gb. The benefit of going from 25 to 50 Gb/s using nonreturn-to-zero is that there is less signal loss than using PAM4 [22]. Using PAM4 also introduces extra latency. Nonreturn-to-zero is always the best option if you can do it, says Asghari.

The next-generation data rate is 400 Gb/s, and the proposed pluggable optical transceiver packages to house the optics are the quad small form-factor pluggable double density (QSFP-DD) and the µQSFP. Fig. 7.9 highlights these new pluggable form factor options being developed for next-generation switches as summarized by the industry organization, the Ethernet Alliance [23].

image
Figure 7.9 Solutions being developed for next-generation switches. From Ethernet Alliance.

The QSFP-DD uses eight lanes that can operate at 25 Gb using nonreturn-to-zero modulation and 50 Gb using PAM4 to support 200- or 400-Gb ports. Thirty-six modules can be packed on a one-rack-unit front panel, for an aggregate bandwidth of 14.4 Tb.

The µQSFP is smaller—approximately, the same size as the small form-factor pluggable (SFP) pluggable module. As in the QSFP, there are four electrical channels, each lane using 25-Gb nonreturn-to-zero signaling. The multisource agreement is intended to support 50-Gb using PAM4 as well. While operating at 25 Gb, an aggregate one-rack-unit faceplate density of 7.2 Tb is achieved using 72 ports.

Optical transceiver vendors are well on the way to defining and preparing the next high-bandwidth transceivers with a small enough size to support next-generation 12.8-Tb switches. Internet content providers are already clamoring for these switches and associated optical interconnects.

One way to tackle the issue of limited faceplate density is to adopt on-board optics. Embedded, mid-board or on-board optics moves the optical transceiver off the front plate and onto the printed circuit board, closer to the ASIC.

Several advantages result. First, the high-speed electrical signal path from the ASIC to the optics is shorter, simplifying the printed circuit board design. And using on-board optics only, a fiber connector is fed to the front of the equipment. This means the number of transceivers depends on how many embedded optics modules can be fitted on the board, not how many pluggable modules can be fitted on the equipment’s front panel. Freeing the front panel of pluggable modules also allows for more ventilation holes, improving the air flow used to cool the equipment.

However, on-board optics has its own issues. One is that all the switch slots are populated from the start with optics, something that can be avoided with pluggable modules. This may be acceptable for the Internet content providers, but many enterprises prefer a pay-as-you-grow model.

Another is that if an optical link fails, the entire board has to be replaced—an expensive proposition. Solutions lie in optical redundancy, in understanding the failure mechanism, and in design and planning, such as making the optical modules field-replaceable.

There is also no industry standard for embedded optics. COBO, the Consortium for On-Board Optics, is an industry initiative backed by Microsoft that is working to develop the first standard solutions for embedded optics [24]. COBO is working on 400-Gb/s designs and a way of combining two side-by-side to deliver 800-Gb/s interfaces. The embedded optics will support multimode and single-mode fiber over different reaches. COBO is considering enabling mid-board optics to support distances of 100 km using coherent technology, as discussed in Section 5.5.1.

Luxtera is providing 2-by-100-Gb PSM4 embedded optics modules in volume to several equipment makers. Luxtera is also a member of COBO, as is Mellanox. Yet while both companies welcome the advent of a standard, they admit there is an industry caution among companies about embracing on-board optical technology.

Luxtera cites the issue of field serviceability as one concern customers have. “Front faceplate pluggable modules are much easier to replace but come with a higher implementation cost,” says Luxtera’s Peter De Dobbelaere. He claims that the greater reliability of silicon photonics when used for embedded optics also helps allay customers concerns.

Mellanox says that the key issue is the input–output bottleneck on the ASIC and embedded optics, while increasing the input–output of equipment, does nothing to address the chip’s pinch point, as is explained in the next section. Asghari also warns not to dismiss pluggable optics which he expects to continue to evolve for at least two more generations.

The use of silicon photonics suits mid-board optics design due to the small size, low power, and low cost needed. The potential for high volumes is also attractive. However, embedded modules can use any optical technology as long as it provides adequate performance. And while silicon photonics may be well suited, it is up against traditional VCSEL technology for short-link lengths and indium phosphide–based offerings for longer reach ones.

The advent of on-board optics is an important milestone in the development of photonics in the data center. Brad Booth, the chair of COBO, says that the deployment of technology in systems will help the industry learn what embedded optics brings and what some of the challenges are. There is no revolutionary change that happens with technology, he says, it all has to be evolutionary. In other words, embedded optics represents an important stepping-stone for the next development—and the most significant development for silicon photonics—that of copackaged optics.

7.5.5 Transition Two: From On-Board Optics to Copackaged Optics

The introduction of higher-capacity switch chips is driving pluggable module developments and a need for optics inside systems.

Current generation switch chips support 3.2 Tb of capacity, and the 6.4 Tb Tomahawk II from Broadcom, at the time of writing, is already sampling. The QSFP28 pluggable module developments will be able to support these next-generation chips, as discussed in the previous section. The next switch chip advance—to 12.8-Tb, a capacity already sought by the Internet content providers—is expected in 2018. Here too, next-generation pluggables such as the QSFP56, QSFP-DD, and the µQSFP will deliver sufficient input–output capacity at the system’s front panel to support such switch silicon.

But faceplate density—having sufficient input–output capacity to support the faster switch silicon—is only one of the system design issues. To go from 3.2 to 6.4 Tb, the input–output of the switch chip must double. Broadcom’s 6.4 Tb switch chip crams 256 25 Gb/s signals on-chip by using a more advanced CMOS process node.

Each of these 25-Gb input–output signals is generated using a serdes (serializer/deserializer) circuit. The role of the serdes is to serialize the switch data and drive it across electrical lines on the printed circuit board to the pluggable optical transceiver. Clearly, going from 3.2 to a 6.4-Tb chip doubles the number of 25-Gb lines driving signals across the printed circuit board to the 100-Gb pluggable modules. Inside the optical module, there is a retimer circuit that cleans up the received electrical signal before being passed to the optics for transmission.

According to Luxtera’s De Dobbelaere, switch vendors have to sometimes put an additional retimer chip between the switch ASIC and module to address signal integrity issues associated with longer printed circuit board traces. Retimer circuitry adds to the board’s design complexity and dissipates power, and the problem only worsens as channel count and data rate go up.

And of course that is just what will happen with the advent of 12.8-Tb switch silicon. To achieve 12.8 Tb, two approaches are possible: either go to an even more advanced—and expensive—CMOS process node such as 7 nm to get the serdes to 50 Gb/s or continue to use 25-GBd/s serdes but add PAM4 to double the switch capacity. According to De Dobbelaere, the industry has settled on 25 GBd and PAM4 for now. Note, whichever approach had been chosen, it would only exacerbate the chip’s power consumption—the serdes are expected to consume over half the chip’s total power—and the issue of retimers.

This is where using on-board optics close to the switch chip helps. The shorter traces between the chip and the on-board optics consume less power. But embedded optics does nothing to solve the fundamental problem associated with switch ASICs: the finite number of input–output lines a chip can support.

Simply put, the chip’s high-speed input–output signaling must be placed on the chip’s edge due to printed circuit board design considerations. The chip’s perimeter is finite, and the input–output signals must be spaced a certain distance apart. The serdes circuitry connects to the printed circuit board through little balls on the chip packaging’s base, an arrangement known as a ball grid array. There are only so many balls that can be supported on a chip, and only so much of the ASIC’s area that can be used for input–output without exceeding the maximum chip size.

The next switch chip after 12.8 Tb will be 25.6 Tb, scheduled for 2020. For the chip to support such a capacity, 100-Gb signals will be required using 50-GBd and PAM4. Such a solution will burn an even greater proportion of the chip’s power, complicate the packaging and printed circuit board designs, and diminish the transmission distances possible over the board’s traces. On-board optics simplifies the signaling between an ASIC and the on-board optics and lowers the power consumed, but it cannot address the chip’s input–output bottleneck.

What is required is a way to expand the number of input–output signals off the chip and simplify the drive requirements on the individual serdes. This is where interposer technology—developed by the chip industry—can help. By using interposer technology to copackage the ASIC with silicon photonics, both these goals are achieved.

The semiconductor industry is adopting interposer technology as part of its work developing 2.5D- and 3D-packaging technology. The chip industry has long used 2D-packaging techniques whereby more than one chip die—the bare silicon chip before it is packaged—is bonded onto a common substrate. The substrate material includes laminates (a form of printed circuit board with fine copper lines), ceramic, or silicon. 2D packaging is known by a variety of names such as system in package and multichip module. The advantage of 2D packaging is the ability to combine different chip technologies in one package. This benefits yield and allows device evolution without having to redesign a single more complex chip [25].

The 2.5D chip extends the concept by adding a silicon interposer between the different dice and the substrate, as shown in Fig. 7.10. The interposer is a slice of silicon which has metal traces on both its surfaces. The chips are bonded to the upper surface while the lower surface is connected to the common substrate using standard flip-chip bumps. Through silicon vias (TSV) technology is used to connect the two metal layers to allow the dice to communicate with the printed circuit board. For completeness, 3D packaging extends the concept by having dice stacked on top of each other and which are connected to the interposer.

image
Figure 7.10 Photonic integrated circuit interposer. TSV is a through silicon via, a vertical electrical connection approach, EIC refers to an electronic integrated circuit, PIC is photonic integrated circuit, and BGA is a ball grid array. From © STMicroelectronics. Used with permission.

What Fig. 7.10 shows is that the chip dice are connected to the interposer’s upper surface using micro-bumps. These are a tenth the size of the flip-chip bumps and the bumps used by the ball grid array. And this is the key benefit: the interconnect on the interposer allows for much greater interconnect bandwidth between the dice, enabling a tremendous increase in capacity and device performance.

For silicon photonics, the technology holds huge potential. By adding a silicon photonics die on the interposer, an additional input–output route using fiber off the 2.5D chip becomes available.

Using 2.5D technology, silicon photonics vendors can thus benefit from yet another technology that has been developed by the chip industry. Using such a technology, designers can develop designs similar to the ASIC with optical input–output pioneered by Compass-EOS.

For the data center, 2.5D packaging offers a solution that overcomes the input–output bottleneck of the switch chips, something that optical pluggable modules and on-board optics cannot address. It is also a technology that silicon photonics is inherently able to exploit. This is not the case with the established optical technologies.

And it doesn’t stop there: high-performance computing processors and future field-programable gate arrays (FPGAs) can benefit from being copackaged with silicon photonics. An FPGA is an important chip, with the more advanced versions containing huge numbers of logic gates as well as 25-Gb/s serdes that can be customized to implement datacom and telecom chip designs [26].

Professor Bergman points out that her team’s research work includes ways to embed photonics within electronic switches to improve networking performance while reducing the energy required, as discussed in Section 7.4.1. This will be enabled using 2.5D integration technology, and silicon photonics allows the embedding of optical input–output right next to the electronics switch chip.

Another part of her data center work is looking at computing nodes. Such nodes comprise general multicore devices, specialist graphic processing units, and high bandwidth memory. Chip-to-memory communications are becoming a key pinch point in these systems and are ripe for the use of interposer technology.

Bergman points out that only so many dice can fit on an interposer but that the bandwidth that silicon photonics brings means that 2.5D packages can be linked. “The really beautiful thing with photonics is that you add more interposers because the bandwidth off that 2.5D chip is just as high as on the chip,” she says.

Silicon photonics luminary Professor Lionel Kimerling, at MIT’s Department of Materials Science and Engineering, agrees that board-level integration is an important area of focus and the question is what solution will scale for the next 20 years. The chip industry will not make a big investment in something that will not scale for more than a couple of generations, he says.

In his view, the chip is becoming almost irrelevant given it will no longer scale with the demise of Moore’s law. “So it is all about how can I scale the number of chips in a package,” he says. This is promoting system-in-package technology. Once this starts, there is no doubt that there will need to be optical interconnect coming out of such a package, he says: “And no one really doubts that before long, you will need optical interconnect between the chips inside the package.”

This is the potential interposer technology offers. But combining photonics and electronics chips using an interposer is not without challenges. The semiconductor industry itself has only recently started to embrace the technology and adding photonics presents its own issues. Thermal management and where to place the laser being two major ones designers must resolve.

Will companies pass on on-board optics and adopt a 2.5D design when they transition their systems away from pluggable optical transceivers? It is possible but a more likely path is that the industry will embrace on-board optics first. New technologies take time, and there is much to be learned from embedded optics before copackaged optics will be ready and the industry will be willing to make the jump.

Meanwhile, startups are now becoming active in this area.

Ayar Labs is a US start-up that is developing a 3.2-Tb optical interconnect chip designed to sit alongside switch silicon. The chip aims to replace 32 100-Gb pluggable modules or, in future, eight more complex 400-Gb modules, and it is expected to emerge starting in 2018.

Another start-up developing a silicon photonics device to sit alongside chips in the data center is Sicoya from Germany. It is targeting servers first but says the technology can be used for other equipment in the data center such as switches and routers. It says its silicon photonics device is designed for chip-to-chip communications and could be placed very close to the processor or even copackaged in a system-in-package design [27].

7.6 Adding Photonics to Ultralarge-Scale Chips

The benefits of copackaging include higher-performance chips, lower-power consumption, and greater signal integrity although it will require considerable development work before it is commercially deployed in volume.

Will an optics and photonics union end at copackaging? Extrapolating, could not optics end up in the chip, not just bolted alongside? Integrating optics within the chip would enable optical communications inside the chip and to other such chips.

A group of US academics in a paper published in the scientific journal Nature have demonstrated just that: a microprocessor that integrates logic and silicon photonics in one chip, with the optics enabling communications between chips [28].

Vladimir Stojanovic, one of the academics involved in the project and based at the University of California, Berkeley, claims it is the first time a microprocessor has communicated with the external world using something other than electronics.

The chip features two processor cores, as shown in Fig. 7.11, and 1 MB of on-chip memory, and comprises 70 million transistors and 850 optical components. The chip is also notable in that the researchers achieved their goal of fabricating it on a standard IBM 45-nm CMOS line without any alteration. They managed to implement the photonics functions using a CMOS process tuned for digital logic—what they call “zero-change” silicon photonics.

image
Figure 7.11 Optical interconnection integrated with microprocessors. Reprinted by permission from Macmillan Publishers Ltd: Chen Sun et.al., Single-chip microprocessor that communicates directly using light, Nature 528, 534–538 (24 December 2015), copyright 2015.

Pursuing a zero-change process was initially met with skepticism, says Stojanovic. People thought that making no changes to the process would prove too restrictive and lead to very poor optical device performance. Indeed, the first designs didn’t work. But the team slowly mastered the process, making simple optical devices before moving onto more complex designs.

The chip uses a micro-ring resonator for modulation, while the laser source is external to the chip. The modulator, which is known to be sensitive to small temperature changes, is corrected with on-chip electronics.

The team has demonstrated two of its microprocessor chips talking to each other. One processor talked to the memory of the second chip that was 4 m away. Two chips were used rather than one—going off-chip before returning—to prove that the communication was indeed optical as there is an internal electrical bus on the chip linking the processor and memory.

For the demonstration, a single laser operating at 1183 nm feeds the two paths linking the memory and processor. Each link operates at 2.5 Gb/s for a total bandwidth of 5 Gb/s.

However, for the demonstration, the microprocessor was clocked at one-eightieth of its 1.65-GHz clock speed—20.7 MHz—because only one wavelength was used to carry data and a higher clock speed would have flooded the link.

The microprocessor design can support 11 wavelengths for a total bandwidth of 55 Gb/s, while the silicon photonics technology itself will support between 16 and 32 wavelengths overall.

The group is lab-testing a new iteration of the chip that promises to run the processor at full clock speed. The factor-of-80 speedup is supported by using 10 wavelengths instead of one, each at 10 Gb/s, while the design will support duplex communications. The latest chip also features improved optical functions. “It has better devices all over the place: better modulators, photodetectors, and gratings. It keeps evolving,” says Stojanovic.

The microprocessor development is hugely impressive. The demonstration is in effect two generations ahead of what’s being developed today with transceivers. But the work already shows two things. One is that on-chip photonics can work alongside complex logic in the form of a two-core processor and memory. And the researchers achieved the photonics design without altering a standard CMOS process.

Optical communications using silicon photonics inside a chip is a long way off. The researchers deliberately chose to demonstrate chip-to-chip communications because they recognize that is the next big opportunity. As Stojanovic says, that is where the biggest bang for the buck is.

7.7 Pulling It All Together

Data centers are factories of profit for the Internet giants. Such “computer buildings” require scalable networks to connect a huge and growing number of servers. The networks need to be predictable, power-efficient, and cost-effective. Silicon photonics is best suited to support these demands.

Reducing optical transceiver cost is the first identifiable opportunity for silicon photonics. Transceiver cost is a key issue given that the number of interconnects is growing exponentially as servers are added to the network.

Silicon photonics is a good candidate technology for optical transceivers for other reasons too. The Internet content providers need 100-Gb single-mode transceivers to support their hyperscale data centers, and such designs need photonic integration. That suits silicon photonics, which is also single-moded, thereby matching end-customer needs.

Delivering low-cost transceivers is one approach to enable highly scalable data centers. Another is to make high-radix switches. But what is really required is bringing optics inside the switch platform, to help switch chips grow their port count and the ports’ speed, and to embed complementary optical switching to further enhance overall performance and reduce the energy used.

This is leading to optics moving off the faceplate and onto the board hosting the switch chip. And from there, bringing the optics even closer to the switch silicon using 2.5D copackaging technology. Silicon photonics is being used for pluggables and can play a role for embedded, but it comes into its own with copackaging.

Silicon photonics has even been shown working within a complex multicore microprocessor. The market may not be ready for all these solutions, but silicon photonics has already crossed the finishing line.

Key Takeaways

• The leaf-spine switching architecture has become the default approach to link servers, but as the number of servers grows, the number of switches and the interconnect between them grows exponentially.

• As data rates grow the trend is towards single-mode integrated transceivers, a shift that benefits silicon photonics. But to get better network scaling efficiencies, higher-capacity radix switches are needed.

• To improve systems performance in data centers, optics is being moved closer to the electronics. A disaggregated server using silicon photonics and a novel IP router have already demonstrated such benefits. But these examples were not successful commercially, partly because they were too early for the marketplace.

• On-board optics and copackaged optics alongside silicon represent key milestones in the evolution of silicon photonics that promise to benefit many of the systems used in data centers: server nodes, switch silicon, networking platforms and storage.

• The end game is optics and electronics in one chip. This has already been demonstrated with optics piggybacking on a standard CMOS process. We can think of no better example that shows the long-term potential of silicon photonics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset