Fathom summary and update schedule page

docc-lab · Jun 6, 2024 · 4a8ebbd · 4a8ebbd
1 parent 4c4b5ef
commit 4a8ebbd
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 0 deletions.
diff --git a/_posts/2024-06-05-Fathom.md b/_posts/2024-06-05-Fathom.md
@@ -0,0 +1,15 @@
+---
+title: "Fathom: Understanding Datacenter Application Network Performance"
+layout: post
+---
+
+
+Research Question: How to design a system for debugging and understanding network performance in cloud-scale datacenters with visibility, interpretability and scalability?
+
+Key Contributions: This paper presents the design and usage of Fathom, a system identifying the network performance bottlenecks of any service running in the Google fleet. There is not much research novelty in this paper, but it provides empirical experience, engineering techniques and case studies. The design goal of Fathom is to use and build upon existing telemetry data at Google to do fine-grained network performance analysis that covers multiple and different layers of abstractions. Fathom breaks down a RPC's latency into subcomponents, time spent on client application, RPC's queue, buffer, network stack from TCP queueing delay to WAN rate limiter, to NIC etc. Fathom achieves this fine granularity by 1) tracking the byte boundaries of an RPC in the serialization buffers in use space, 2) collecting kernel timestamps at various stages for payload of the RPC on the end-to-end path, 3) using aggregation techniques that preserves data distributions, especial at the tail, 4)using a Gaussian Mixture Model to project high-dimensional metrics data onto interested features to get a few blobs for easy analysis. All the extended kernel timestamp changes have been upstreamed to Linux v.3.17. Fathom incurs only 0.4% fleet-wide total RPC/TCP/kernel cycles.
+
+The two major use cases of Fathom are 1) at micro-level, diagnosing application performance issues for a specific application/service and 2) at macro-level, characterizing applications' network performance before and after a roll-out.
+
+Opportunities for future work: With a set of internal monitoring systems at Google, a future work is to synthesize and to combine Fathom data, switch data, topology data, and CPU profiling data to further pinpoint resource bottlenecks in Google's datacenters.
+
+Presenter: Max Liu
diff --git a/schedule.html b/schedule.html
@@ -189,6 +189,11 @@
       <td>Zhaoqi Zhang</td>
       <td><a href="https://www.usenix.org/conference/atc23/presentation/huye">Lifting the veil on Meta's microservice architecture: Analyses of topology and request workflows</a></td>
     </tr>
+    <tr>
+      <td>6/6/24</td>
+      <td>Max Liu</td>
+      <td><a href="https://dl.acm.org/doi/pdf/10.1145/3603269.3604815"></a>Fathom: Understanding Datacenter Application Network Performance</td>
+    </tr>
 </table>
   </div>
 </body>