Skip to content

Commit

Permalink
Add summary for BufScope
Browse files Browse the repository at this point in the history
  • Loading branch information
RoyZhang7 committed Aug 9, 2024
1 parent 3b6c4d4 commit 260ec91
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions _posts/2024-08-07-BufScope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: "Buffer-based End-to-end Request Event Monitoring in the Cloud"
layout: post
---


Research Question: How to accurately diagnose request latency anomalies (RLAs) in cloud environments by monitoring the end-to-end datapath of requests with consistent request-level semantics and low overhead?

Key Contributions: Key Contributions: This paper presents the design and usage of BufScope, A high-coverage request event monitoring system that models the end-to-end datapath of requests as a buffer chain and monitors buffer-related abnormal events. It consists of several noval approaches: 1) A buffer event modeling approach that defines a complete event library based on buffer properties, 2) A concise request-level semantic injection mechanism implemented on SmartNICs to achieve consistent semantics with low overhead, and 3) Implementation and evaluation on commodity programmable switches and SmartNICs, showing BufScope can diagnose 95% of RLAs with <0.07% network bandwidth overhead. A high-leve takeaway is this paper is similiar to Theo's Foxhound, except Foxhound targets at real request ID in network packets and BufScope targets on RPC ID in packets.

However, some of assumption this paper build upon and arguments the authors conclude do not stand strongly. First, the request ID that author maniupluated in packet are actualy RPC ID which actually holds a many to one relationship to request ID unless forced to make one-to-one mapping and your application is simple enough to complete requests with just one RCP call. Based on what we learned from other Alibaba observability researches, their internal practice does not force one to one mapping between RCP ID and request ID. To maintain a map between RPC IDs and a request ID, it requires extra overhead to weaken scalability of BufScope. Second, evaulation section is not clear enough. 1) Some metrics evaulated are essentially meaningless but plotted as useful--the most presentative one is bandwidth overhead. More data a observability infra collects, more bandwidth it uses (basically liner). We don't think it's a good metric to measure in term of overhead. 2) Metris like collector overhead and data persistency cost are missing, which amplifies our concern about this method. Third, the author make some unreasonable arguments when talking about usage of BufScope. Accroding to the plentiful reserach context of distributed tracing, performance issue localization and root analysis in application layer is absurd and challenging--each small step involves a research paper. But the author fail to explain clearly about how to localize and diagnose issues after BufScpoe collecting observability data. Another weird arguement the authors make is contention in other resource could also be detected. We feel confused about 1) how to relate other resource's event since BufScope works on an idea of connecting events across-layers with RPC ID in network packet, 2) contention diagnosis acutally involves understanding about concurrent requests while BufScope is incapable of.


Opportunities for future work:

1. Solid approach to link observability data across layers
2. Reasonable overhead control mechanism for observability tool anywhere

Presenter: Zhaoqi(Roy) Zhang

0 comments on commit 260ec91

Please sign in to comment.