-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fathom summary and update schedule page
- Loading branch information
Max Liu
committed
Jun 6, 2024
1 parent
4c4b5ef
commit 4a8ebbd
Showing
2 changed files
with
20 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
title: "Fathom: Understanding Datacenter Application Network Performance" | ||
layout: post | ||
--- | ||
|
||
|
||
Research Question: How to design a system for debugging and understanding network performance in cloud-scale datacenters with visibility, interpretability and scalability? | ||
|
||
Key Contributions: This paper presents the design and usage of Fathom, a system identifying the network performance bottlenecks of any service running in the Google fleet. There is not much research novelty in this paper, but it provides empirical experience, engineering techniques and case studies. The design goal of Fathom is to use and build upon existing telemetry data at Google to do fine-grained network performance analysis that covers multiple and different layers of abstractions. Fathom breaks down a RPC's latency into subcomponents, time spent on client application, RPC's queue, buffer, network stack from TCP queueing delay to WAN rate limiter, to NIC etc. Fathom achieves this fine granularity by 1) tracking the byte boundaries of an RPC in the serialization buffers in use space, 2) collecting kernel timestamps at various stages for payload of the RPC on the end-to-end path, 3) using aggregation techniques that preserves data distributions, especial at the tail, 4)using a Gaussian Mixture Model to project high-dimensional metrics data onto interested features to get a few blobs for easy analysis. All the extended kernel timestamp changes have been upstreamed to Linux v.3.17. Fathom incurs only 0.4% fleet-wide total RPC/TCP/kernel cycles. | ||
|
||
The two major use cases of Fathom are 1) at micro-level, diagnosing application performance issues for a specific application/service and 2) at macro-level, characterizing applications' network performance before and after a roll-out. | ||
|
||
Opportunities for future work: With a set of internal monitoring systems at Google, a future work is to synthesize and to combine Fathom data, switch data, topology data, and CPU profiling data to further pinpoint resource bottlenecks in Google's datacenters. | ||
|
||
Presenter: Max Liu |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters