Datacenter background

25 Top Papers

25 years is a very long time in technology, and warehouse-scale computing has gone from being a niche optimization specialized for Google workloads to being commonplace. This evolution is reflected in a vibrant collection of academic research papers.

Picking the top 25 for this chapter was challenging. To help with our selection, we focused on papers from Google. While this reflects our familiarity with work at Google (and is a nod to the outsized impact Google has had in inventing and scaling warehouse-scale computing), there is a lot of excellent work from outside Google, as demonstrated in the numerous references in this book.

We then asked prominent authors to choose their top papers individually and then combined these scores to identify the top 50 papers. Finally (with great difficulty), we narrowed the list down to the 25 you are going to read about next. Our goal was to give you, the readers, a sense of the top papers in the area, but also provide some breadth across different areas of WSCs.

In this chapter, we also experimented with AI by using Gemini to read through our top paper list and provide summaries. We combined that with our own notes and wrote the final summaries you see below. Each paper includes a short paragraph on what the paper is about and then a short (subjective) comment on why we picked it for this top-25 list. We hope you enjoy reading these excellent papers and have as much fun as we had in compiling this list.

Complete Chapter

A Selection of 25 Top Papers for 25 Years of WSC

Selected Papers

01

The Anatomy of Search: Google's Early Architecture

The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, Computer Networks, 30 (1998), pp. 107-117

This paper marked the beginning of modern web search, demonstrating that the web's hyperlink graph measures page authority (captured by the PageRank algorithm) and greatly improves ranking. The Google prototype also revealed the significant descriptive power of anchor text, successfully utilizing these external labels to improve accuracy. We all take ranking for granted, but this paper showed how to get superior search relevance over content-only methods. Last but not least, their prototype demonstrated one of the key WSC approaches by sharding their web index across inexpensive commodity hardware.

Why we picked this paper:

"It's the classic search engine paper. And it got rejected by the major conference in the field."

02

Web Search for a Planet: The Google Cluster Architecture

Web Search for a Planet: The Google Cluster Architecture, Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, IEEE Micro, 23 (2003), pp. 22-28

By 2003, Google's architecture had evolved quite a bit from the 1998 prototype that maxed out at two queries per second, and the conceptual ideas of the original prototype had been proven to work in practice. Clusters of 15,000 commodity PCs used sharding to provide horizontal scaling and redundancy. Some of the more frequent hardware failures were now handled automatically. The servers themselves were managed in a relatively manual way because Borg and Linux containers had not been conceived yet. Last but not least, the paper is one of the first ones to discuss power and energy as important system attributes.

Why we picked this paper:

"It's a short but detailed description of the first real-world WSC stack, optimized for websearch."

03

The Google File System

The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19th ACM Symposium on Operating Systems Principles, ACM, Bolton Landing, NY (2003), pp. 20-43

The Google File System (GFS) was the first real storage system designed at Google, and the first horizontally scalable file system tailored for large-scale, data-intensive applications. Before GFS, the only high-capacity storage systems available required putting expensive high-performance disks with RAID controllers into a rack. In contrast, GFS assembled a sea of disks, distributed across thousands of machines, into a file system. Files were split into individual chunks, and each chunk was placed on multiple disks to avoid data loss and increase read throughput. A single master node managed metadata. GFS presented a POSIX-like API but relaxed consistency semantics to simplify the system and enhance performance for its target applications. GFS was instrumental in allowing Google to significantly increase its search index size above that of other search engines, and prepared it for storage-heavy future applications like Gmail and YouTube.

Why we picked this paper:

"How do you write a file system capable of managing tens of thousands of unreliable disks while providing ultrahigh throughput, and how do you implement it in just 18 months? By making judicious choices to simplify key aspects without giving up too much."

04

MapReduce: Simplified Data Processing on Large Clusters

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004), pp. 137-150

MapReduce complements GFS: whereas GFS made it easy to store large amounts of data across many devices, MapReduce made it easy to process large datasets across many commodity servers. Users specify “map” functions that process key/value pairs to generate intermediate pairs, and “reduce” functions that merge intermediate values associated with the same key. The runtime system transparently handles parallelization, fault tolerance, data distribution, and load balancing. This abstraction allowed programmers without distributed systems expertise to leverage large clusters effectively. The paper demonstrated MapReduce's utility across diverse tasks like distributed sorting, web access log analysis, and inverted index construction, significantly simplifying large-scale data processing.

Why we picked this paper:

"Lisp invented the map-reduce pattern, and MapReduce (later entering open source as Hadoop) quickly became the default choice for large-scale batch processing."

05

Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), USENIX (2006), pp. 205-218

Complementing GFS and MapReduce, Bigtable was the initial solution to hyperscale databases and the start of the NoSQL movement that sacrificed strong transactional consistency in exchange for horizontal scalability. Bigtable stores its data across thousands of commodity servers and exposes a sparse, distributed, persistent multi-dimensional sorted map data model. Data is indexed by row key, column key, and timestamp. Bigtable uses GFS for underlying storage and provides high scalability, availability, and performance. Its strength in providing low-latency reads/writes and efficient scans over massive datasets strongly influenced subsequent NoSQL designs.

Why we picked this paper:

"It started the NoSQL movement, which is still going strong today (e.g., MongoDB, Redis, Firestore, DynamoDB). Over the past decade, however, Spanner has replaced BigTable for many applications at Google because it offers fully transactional storage with comparable performance."

06

The Chubby Lock Service for Loosely-coupled Distributed Systems

The Chubby lock service for loosely-coupled distributed systems, Mike Burrows, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), USENIX (2006)

WSC services need a way to bootstrap their configuration data, and multiple replicas of a single service need to agree who's the leader. Chubby provides this functionality in a single, simple API. It uses Paxos for fault-tolerant consensus among its own replica servers. Its client-side caching mechanism, coupled with keep-alive sessions and event notification, minimizes server load while maintaining consistency.

Why we picked this paper:

"Chubby is a classic example of a WSC service with a simple but indispensable API that solves common problems. It inspired today's popular OSS equivalents, Apache ZooKeeper and Kubernetes' etc."

07

The Case for Energy-Proportional Computing

The Case for Energy-Proportional Computing, Luiz André Barroso, Urs Hölzle, IEEE Computer, 40 (2007)

Computer systems, particularly servers in data centers, should consume power in proportion to the amount of work performed. Most servers aren't energy proportional: they consume a significant fraction of their peak power even when idle or operating at low utilization. This short paper defined the problem and argued that better energy proportionality would significantly reduce data center operational costs and environmental impact.

Why we picked this paper:

"Sometimes a simple observation, phrased in a way everyone can understand, can influence the direction of an industry."

08

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339

Dremel implements interactive, ad-hoc querying of extremely large, read-only, nested datasets. Its high-throughput parallel query execution engine may use thousands of servers for a single query. Columnar storage allows better compression and significantly reduces I/O load because queries read only partial rows, skipping unused record fields. Its query engine uses a massively parallel tree architecture to execute SQL-like queries over petabytes of data in seconds. For many analytics tasks, Dremel reduced the programming effort and the execution time by several orders of magnitude compared to MapReduce.

Why we picked this paper:

"Dremel was the precursor to today's leading data lakehouse product, BigQuery, and rapidly displaced MapReduce for many use cases. If you'd like to understand horizontal scalability, read this paper."

09

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Google Technical Report dapper-2010-1, Apr. 2010 (2010)

The Dapper tracing infrastructure observes cluster-scale applications with low overhead, application-level transparency, and scalability. It provides developers visibility into requests as they propagate through large microservice-based architectures. By assigning globally unique IDs to requests and propagating context, Dapper reconstructs causal paths and timings across different services. The system employs adaptive sampling to manage data volume while capturing representative traces. Without Dapper (which has evolved into OSS as OpenTelemetry) it would be difficult to understand why a service is slow today, or why some requests are much slower than others.

Why we picked this paper:

"Dapper is an elegant solution to a problem that is much harder to solve than it looks at first sight."

10

Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers

Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers, Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, Robert Hundt, IEEE Micro (2010), pp. 65-79

Google-Wide Profiling (GWP) provides continuous, low-overhead profiling for WSC applications. It collects performance data (CPU usage, memory allocation, synchronization contention) from running jobs with minimal performance impact, using sampling. Its ubiquitous data collection identifies performance bottlenecks, attributes resource consumption accurately across and inside services, and tracks performance regressions over time. GWP provides the fleet-wide visibility essential for improving the efficiency of warehouse-scale applications without requiring application-specific instrumentation. GWP's in-depth, accurate profiling data supports many use cases, including understanding hardware differences, enabling feedback-directed compiler optimization, and discovering frequently used library code across all binaries.

Why we picked this paper:

"You'll be surprised at how essential this information is, and that it's possible to collect it with just 0.01% overhead."

11

Spanner: Google's Globally-Distributed Database

Spanner: Google's Globally-Distributed Database, James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes et al, Symposium on Operating Systems Design and Implementation (OSDI) (2012)

Spanner is the first database that uniquely combines transactional consistency (external consistency, the highest consistency standard) with high availability and horizontal scalability across geographically distributed data centers. It achieves this through features like automatic sharding, replication using Paxos, and synchronous replication. A key innovation is the TrueTime API, which uses GPS and atomic clocks to generate tightly synchronized timestamps with bounded uncertainty, enabling consistent distributed transactions, consistent snapshot-reads across the database, and lock-free reads. Spanner supports a relational data model and SQL-like query language, providing familiar database semantics at a global scale.

Why we picked this paper:

"At first glance, Spanner seems to violate the CAP theorem which originally motivated NoSQL systems to forgo full consistency to get availability and scalability. Many see this paper as the #1 database paper of the past 25 years."

12

B4: Experience with a Globally Deployed Software Defined WAN

B4: Experience with a Globally Deployed Software Defined WAN, Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong et al, Proceedings of the ACM SIGCOMM Conference, Hong Kong, China (2013), pp. 3–14

The B4 paper details Google's experience designing, deploying, and operating a private Software-Defined Wide Area Network (SDN WAN) to connect its global data centers. The system separates the control plane from the data plane, using centralized traffic engineering controllers to manage bandwidth allocation and routing across the physical network. B4 employs custom switch hardware and OpenFlow protocols, demonstrating SDN's effectiveness in managing large, complex WANs for improved efficiency, flexibility, and cost-effectiveness compared to traditional WAN architectures.

Why we picked this paper:

"In many ways, B4 is to large networks what Borg is to large server pools: the only way to manage tens of thousands of devices without losing your mind."

13

The Tail at Scale

The Tail at Scale, Jeffrey Dean, Luiz André Barroso, Communications of the ACM, 56 (2013), pp. 74-80

Tail latency—high-percentile latency—in large-scale distributed systems has a disproportionate impact on user experience. In systems involving thousands of servers for a single request, even rare, small delays on individual components often dominate overall response time due to fan-out. Many factors cause latency variability, including resource sharing, background activities, and queueing delays. The paper outlines several techniques to mitigate tail latency, such as hedging requests (sending the same request to multiple replicas), latency-induced probation, and micro-partitioned data.

Why we picked this paper:

"Horizontal scaling is awesome until it isn't, and tail latency can spoil the party."

14

Profiling a Warehouse-scale Computer

Profiling a warehouse-scale computer, Svilen Kanev, Juan Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, David Brooks, ISCA '15 Proceedings of the 42nd Annual International Symposium on Computer Architecture, ACM (2015), pp. 158-169

This study complemented Google-Wide Profiling by collecting significantly more data, mostly at the instruction level. The authors analyze instruction mix, CPI (Cycles Per Instruction), cache misses, branch mispredictions, and front-end stalls across diverse workloads. They demonstrate how microarchitectural features interact with software at scale, revealing bottlenecks like front-end stalls that are often masked in smaller studies.

Why we picked this paper:

"Real-world WSC loads don't resemble common benchmarks at all. While hardware has changed dramatically over the past decade, most of the observations first made in this paper still hold today."

15

Large-scale Cluster Management at Google with Borg

Large-scale cluster management at Google with Borg, Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, John Wilkes, Proceedings of the European Conference on Computer Systems (EuroSys), ACM, Bordeaux, France (2015) pp. 18:1–18:17

Borg is a large-scale cluster manager responsible for scheduling, deploying, monitoring, and managing applications across tens of thousands of machines. It handles both long-running services and batch jobs, focusing on high reliability, scalability, and efficient resource utilization through techniques like resource reclamation and fine-grained sharing. Borg uses a logically centralized primary scheduler that is replicated for fault tolerance. Distributed agents (Borglets) provide fine-grained resource management on each machine. Borg allocates resources based on application priorities and quotas, supports declarative job specifications, and provides features like rolling updates and health checking.

Why we picked this paper:

"Borg laid the groundwork for Kubernetes. Its message to manually managed servers was simple: you will be assimilated!"

16

Site Reliability Engineering: How Google Runs Production Systems

Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, O'Reilly (2016)

Managing thousands of services across myriads of servers shouldn't be an art, it should be a science. Site Reliability Engineering (SRE) embeds software engineering principles into infrastructure and operations tasks, focusing on automation, reliability, and scalability. It provides a blueprint for organizations aiming to build and maintain highly reliable, scalable software systems. Crucially, absolute reliability is a non-goal; rather, it's balancing reliability and velocity, and learning from failures via blameless postmortems.

Why we picked this paper:

"While we don't expect everyone to read the whole book, you should absolutely read the introduction by Ben Treynor Sloss: “Hope is not a strategy.”"

17

In-Data center Performance Analysis of a Tensor Processing Unit

In-Data center Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al, ISCA (2017), pp. 1–12

This paper disclosed Google's first-generation Tensor Processing Unit (TPU), the first custom ASIC deployed at scale to accelerate AI inference workloads. The authors describe the TPU's architecture, especially its large matrix multiplier unit and on-chip memory optimized for deep learning computations. TPUs demonstrate significantly higher performance and energy efficiency on production AI workloads relative to contemporary CPUs and GPUs. The analysis highlights the benefits of domain-specific hardware acceleration for critical large-scale workloads, and summarizes key learnings underpinning the substantial improvements in throughput and energy efficiency for machine learning inference tasks.

Why we picked this paper:

"The original TPU paper concluded: “Order-of-magnitude differences between products are rare in computer architecture, which may lead to the TPU becoming an archetype for domain-specific architectures.”"

18

Attack of the Killer Microseconds

Attack of the killer microseconds, Luiz André Barroso, Mike Marty, David Patterson, Parthasarathy Ranganathan, Communications of the ACM, 60(4) (2017), pp. 48-54

WSCs need to optimize for microsecond-level latencies, driven by faster data center networks, new memory hierarchies, and accelerators. Using two case studies (how to waste a fast data center network, and how to waste a fast data center processor), the paper points out how system optimizations targeting nanosecond- and millisecond-scale events are inadequate for events in the microsecond range. New techniques can achieve high performance at microsecond latencies but need co-design across hardware, kernel, and application layers.

Why we picked this paper:

"Much like energy proportionality paper discussed earlier, this paper highlights a simple, yet important, challenge for WSCs, and led to a lot of follow-on optimizations in the WSC systems stack."

19

Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization

Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization, Mike Dalton, David Schultz, Ahsan Arefin, Alex Docauer et al, 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, pp. 373–387.

Andromeda describes Google Cloud Platform's network virtualization stack. Its SDN (software-defined network) design provides high-performance, secure networking for virtual machines. Andromeda decouples the control plane from a distributed data plane implemented partially on-host and partially offloaded to hardware. This architecture enables rapid provisioning of network services (VPCs, firewalls, load balancers), offers performance approaching bare-metal networks, and ensures strong security isolation between tenants. The design prioritizes product velocity and operational agility, simplifying new network features and scaling infrastructure efficiently.

Why we picked this paper:

"Network virtualization wants to be implemented in hardware, since it touches every single packet in high-speed cluster networks. By carefully splitting functionality across hardware and software, Andromeda combines the benefits of both. That's easier said than done."

20

Snap: a Microkernel Approach to Host Networking

Snap: a Microkernel Approach to Host Networking, Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld et al, In ACM SIGOPS 27th Symposium on Operating Systems Principles, ACM, New York, NY, USA (2019), pp. 416–431

Network hardware provides hundreds of Gbps, but traditional kernel APIs impose too high an overhead to actually use this capacity. Snap implements a high-performance host networking stack outside the kernel using a microkernel-inspired architecture. This modularity enhances security by isolating components and allows deploying specialized network functions (e.g., custom protocols, hardware offloads) without modifying the core kernel.

Why we picked this paper:

"An excellent overview of modern host networking, and how user-space packet handling can be used at scale."

21

Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale

Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale, Shaohong Li, Xi Wang, Xiao Zhang, Vasileios Kontorinis, Sreekumar Kodakara, David Lo, Parthasarathy Ranganathan, 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), USENIX Association (2020), pp. 1241-1255

Thunderbolt enforces power caps in heavily oversubscribed large data centers. It uses workload classification and QoS feedback to apply power caps intelligently, prioritizing latency-sensitive tasks and selectively throttling background or batch jobs. This approach maximizes useful work and hides the existence of power capping from the most important users, improving data center efficiency by safely running power infrastructure near its limits when necessary. (If you have time, you should pair your reading of this paper with the original “Power provisioning for a warehouse-sized computer” paper that introduced power oversubscription at Google.)

Why we picked this paper:

"Oversubscription lets you have your cake and eat it too, but that requires a very carefully designed control plane. The effort is worthwhile since it leads to dramatic power utilization improvements."

22

Swift: Delay is Simple and Effective for Congestion Control in the Data center

Swift: Delay is Simple and Effective for Congestion Control in the Data center, Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan Wassel, et al, SIGCOMM 2020 (2020), pp. 583–598

TCP has dealt with internet congestion effectively by using packet loss to detect congestion. However, it doesn't work well in data center networks which exhibit very high speeds combined with very low latencies and small-buffer switches. Swift shows that a very simple protocol based on accurate measurements of packet delays outperforms all others. Its simple, end-host-based approach achieves extremely low network queues, high link utilization, and fast convergence.

Why we picked this paper:

"Swift is so simple, and its performance so good, that it completely changed people's expectations of what a WSC network can do. In fact, initially it caused some confusion: twice, bugs were incorrectly filed for monitoring failures because highly-utilized links reported zero loss."

23

Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild

Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild, Parthasarathy Ranganathan, Danner Stodolsky, Jeff Calow, Jeremy Dorfman et al, Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2021), pp. 600-615

Video Coding Units (VCUs) dramatically accelerate video processing, and represent the most popular WSC accelerator after AI accelerators. The authors describe the VCU hardware, software drivers, server integration, and job management systems, as well as the challenges of integrating specialized hardware into a massive, general-purpose computing environment, including fault tolerance, monitoring, and programming models.

Why we picked this paper:

"VCUs are a poster child for hardware-software codesign for warehouse-scale accelerators. And they won a technical Emmy award!"

24

Jupiter Evolving: Transforming Google's Data center Network via Optical Circuit Switches and Software-Defined Networking

Jupiter Evolving: Transforming Google's Data center Network via Optical Circuit Switches and Software-Defined Networking, Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, et al, Proceedings of ACM SIGCOMM 2022, pp. 1–17

Starting from a traditional Clos topology, the Jupiter network evolved to a more flexible direct-connect architecture, with MEMS-based Optical Circuit Switches (OCS) that provide dedicated optical paths to dynamically reconfigure the network topology. WSC networks are continually growing and changing, and thus many design aspects of Jupiter support flexibility, incremental updates, sophisticated traffic engineering and dynamically reconfigured traffic flows Automated network operations support incremental capacity delivery and topology engineering without disrupting live services.

Why we picked this paper:

"If we had to pick one paper about Google's data center network, this is that paper. Following up from the original “Jupiter Rising” paper, this paper is excellent both in explaining technical details as well as retrospectives explaining how the design evolved to where it is today."

25

Pathways: Asynchronous Distributed Dataflow for ML

Pathways: Asynchronous Distributed Dataflow for ML, Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, et al, MLSys 2022 (2022)

Traditional ML training runs a large cluster's accelerators in a highly synchronous, lock-step way that requires all accelerators to be the same. However, models have become too large to fit into a single cluster. Pathways is a new orchestration layer for accelerator systems to train and serve even larger-scale AI models. Pathways' architecture utilizes asynchronous distributed dataflow and routes computation across heterogeneous hardware accelerators (CPUs, GPUs, TPUs), in combination with a single-controller model and centralized scheduling. This approach allows moving beyond current dense, single-task models towards sparse, multi-task models that can learn many tasks simultaneously and leverage sparsity for efficiency.

Why we picked this paper:

"This paper is an excellent example of model-systems codesign and distributed systems innovation needed to scale deep learning."