Skip to content

Commit

Permalink
Fathom summary and update schedule page
Browse files Browse the repository at this point in the history
  • Loading branch information
Max Liu committed Jun 6, 2024
1 parent 4c4b5ef commit 4a8ebbd
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 0 deletions.
15 changes: 15 additions & 0 deletions _posts/2024-06-05-Fathom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Fathom: Understanding Datacenter Application Network Performance"
layout: post
---


Research Question: How to design a system for debugging and understanding network performance in cloud-scale datacenters with visibility, interpretability and scalability?

Key Contributions: This paper presents the design and usage of Fathom, a system identifying the network performance bottlenecks of any service running in the Google fleet. There is not much research novelty in this paper, but it provides empirical experience, engineering techniques and case studies. The design goal of Fathom is to use and build upon existing telemetry data at Google to do fine-grained network performance analysis that covers multiple and different layers of abstractions. Fathom breaks down a RPC's latency into subcomponents, time spent on client application, RPC's queue, buffer, network stack from TCP queueing delay to WAN rate limiter, to NIC etc. Fathom achieves this fine granularity by 1) tracking the byte boundaries of an RPC in the serialization buffers in use space, 2) collecting kernel timestamps at various stages for payload of the RPC on the end-to-end path, 3) using aggregation techniques that preserves data distributions, especial at the tail, 4)using a Gaussian Mixture Model to project high-dimensional metrics data onto interested features to get a few blobs for easy analysis. All the extended kernel timestamp changes have been upstreamed to Linux v.3.17. Fathom incurs only 0.4% fleet-wide total RPC/TCP/kernel cycles.

The two major use cases of Fathom are 1) at micro-level, diagnosing application performance issues for a specific application/service and 2) at macro-level, characterizing applications' network performance before and after a roll-out.

Opportunities for future work: With a set of internal monitoring systems at Google, a future work is to synthesize and to combine Fathom data, switch data, topology data, and CPU profiling data to further pinpoint resource bottlenecks in Google's datacenters.

Presenter: Max Liu
5 changes: 5 additions & 0 deletions schedule.html
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,11 @@
<td>Zhaoqi Zhang</td>
<td><a href="https://www.usenix.org/conference/atc23/presentation/huye">Lifting the veil on Meta's microservice architecture: Analyses of topology and request workflows</a></td>
</tr>
<tr>
<td>6/6/24</td>
<td>Max Liu</td>
<td><a href="https://dl.acm.org/doi/pdf/10.1145/3603269.3604815"></a>Fathom: Understanding Datacenter Application Network Performance</td>
</tr>
</table>
</div>
</body>
Expand Down

0 comments on commit 4a8ebbd

Please sign in to comment.