Consider documenting key Telemetry events #6976

markfields · 2021-08-03T20:29:44Z

CC: @tylerbutler @vladsud @anthony-murphy @timtwang

The Plan

Well, it's more of a position we're taking than a plan of action:

The Fluid Runtime's execution will be observable primarily via first-class APIs and error contracts, not the logs it emits. FF APIs will carefully express failure cases to empower the host to handle and log them in context to the user experience.

In contrast, telemetry data sent from the Runtime to the host's ITelemetryBaseLogger is a supplement for in-depth diagnostics and debugging. Its exact shape is nuanced and subject to change, and the logging events may lack context relative to the user experience. For these reasons it is not very well-suited for consumption in a data pipeline for monitoring the health of an experience built on Fluid.

From here:

Instead let's direct documentation efforts towards our error contracts (specifically the errorType field, since this is currently the only property that we expect consumers to key off of programatically)
A partner team should not build a monitor or high-fidelity dashboard involving queries against our logs
A partner team may choose to document (at their own risk) certain events we log, to aid in diagnostics/debugging. But we will not guarantee the shape of any log or document them on our docs site.
Note - It's still in our own best interest to avoid churn! But we have license to do so.
If the current API surface and error contracts are not sufficient for a partner's ability to observe how the FF code is operating, we can discuss how to fill the gaps.

Old Content

Update 2021-10-27

Per discussion below and some offline follow up conversations, we are leaning away from making any guarantees about the shape of data logged by the FF. Exact position/guidance for partners forthcoming... In the meantime, striking out parts of this description that we don't necessarily agree with at this point.

Problem statement

The data logged throughout the Fluid Framework via the host-provided ITelemetryBaseLogger has a particular schema, but it's undocumented and fairly volatile. ~~We should identify key telemetry events and "harden" them, adding documentation and taking measures to prevent unintentional changes.~~

~~This is important because~~ telemetry systems consume this data programmatically (e.g. ~~alerts or~~ queries). ~~Someone may change a property name in the code and unknowingly disable - or trigger - a live site alert.~~

Telemetry may also be a useful tool for someone ramping up on the Fluid Framework, when paired with documentation and code samples. Imagine reading about how the Summarizer works, spinning up a project from our code samples, and then watching the key Summarizer events flow by in the console.

Of course, care should be taken to avoid a draconian approach that would destroy the flexibility expected from a telemetry logging system. The flexibility in schema is an important design element of our logging APIs, so the hardening of events should be an "opt-in" model.

One possible approach

Update the sendFoo methods on ITelemetryLogger to take an optional generic type parameter, specifying the type of the event passed in (default would be ITelemetryBaseEvent). Then do API generation on those types (identify the types via convention or configuration) and check that in. Now they are documented, and we see when they change.

A related concern - naming consistency

Even for non-documented events, there's a ton of value in consistency in naming of properties across the repo. It makes analyzing data across related events easier, and even saves COGS in terms of our telemetry DBs. See #6395.

A similar telemetry type could also be used as a generic type param on ChildLogger to give canonical names to common fields that may be added. i.e. on ChildLogger<T>, the sendFoo methods would expect the event param to implement Partial<T>, giving some intellisense hints about spelling of those common properties.

The text was updated successfully, but these errors were encountered:

markfields · 2021-10-21T22:52:06Z

Another good idea from @tylerbutler (here) is to wrap any log we want documented in a function, and document the function in such a way that we can pipe those comments to a telemetry section in the docs site.

This is more natural and comprehensive than the idea I shared above about providing documented types to send methods.

vladsud · 2021-10-23T15:49:13Z

Do you see most developers doing it most of the time? I don't (due to laziness) :)
It also does not work well for something like Container.logConnectionStateChangeTelemetry - that's already a function to do logging, but synthetizes a lot of properties - asking for one extra function just to call sendPerformanceEvent() would be unnatural.

Unrelated to this suggesting directly, but how do we deal with composition of loggers (i.e. all properties that are added by chain of parent loggers)? There are some substantial amount of properties that Container adds to every event that we rely on.

markfields · 2021-10-25T20:14:23Z

Great questions.

Do you see most developers doing it most of the time? I don't (due to laziness) :)

This is about hand-picking specific events to call out and document, so there is no expectation that every log line will get documented. I agree with you, that won't happen, and isn't necessarily desirable.

asking for one extra function just to call sendPerformanceEvent() would be unnatural

We will iterate on the approach. For Container.logConnectionStateChangeTelemetry that you mentioned, of course we would not then have yet another function that just wraps the single call to log. But that function itself can be annotated with an explanation of the event that's logged.

There's even more gray area when you consider the PerfEvent.exec paradigm, where the developer passes a callback which receives the event and then marks it up along the way. In this case there is no single line to wrap in a function, the logging is intertwined with the business logic, which is fine. In that case we'd want the smallest chunk of code around that call to exec to be factored into a method which would include documentation of the telemetry event along with the functionality.

properties that are added by chain of parent loggers

I was just pondering about the child logger common properties too. Those would have to be documented separately, but if nothing else, they could be mentioned in the docs for a particular event when instructional/relevant to do so. Not going to venture a guess for anything fancier than that. Primary concern at first is getting words on a page.

markfields · 2021-10-25T20:23:47Z

@tylerbutler and I chatted this morning and came up with a general direction to take this:

Tactically speaking, the words for the docs will be typed into a TSDoc comment for a function that is "very local" to the instrumentation. This is subjective. Sometimes it will warrant introducing a helper just for this purpose, other times a relevant function will be available already.
Today, these functions must not be private because private functions are omitted from the docs pipeline. Instead, you can mark the function as @internal, but it's not the right long-term solution here.
We'll come up with some light syntax to denote the portion of the TSDoc comment that is documenting the telemetry, not the function itself. Then we can update the docs pipeline to strip that content out and send it over to a new section in the docs.

PS - Tyler also had a futuristic idea about validating and/or supplementing the telemetry docs by taking output from test runs and inspecting the actual data found on documented events. Something to revisit later...

anthony-murphy · 2021-10-25T20:26:37Z

I don't really like the idea of telemetry becoming an api, i'd almost go so far as to say it's explicitly unsupported. i'd rather we expose the necessary observability so consumers can observe their scenarios and our logs are for diagnostic reasons only.

I think having more structure with our logging infra is goodness in general, but i'm very concerned about back compat, and our ability to make internal changes if we start providing guarantees around what we log. I can't think of a single other system that exposes this kind of thing.

I would be highly against any approach that required code change, or increasing our exported surface area to do document events.

markfields · 2021-10-25T21:05:39Z

Good to push back, maybe the phrase "Telemetry becoming an API" which we have thrown around needs qualification. Telemetry is not an API, but it is used like one sometimes. There's tons of gray area, so let's avoid either extreme.

I've used the example of a 1P or 3P partner building monitoring on our telemetry events, and to your point, that might be beyond what is advisable. But even "just" for diagnostics, being able to document some key events will be a big help to us and our partners. Of course we should never introduce formal backwards compability requirements/guarantees, but we'll do ourselves a favor if we do make "breaking" changes to our telemetry with forethought, and this will help keep us honest.

increasing our exported surface area to do document events

I agree in an ideal sense, but I'm not afraid to break the rules to get some traction, especially today when our exported surface is already relatively large and unrefined. As for changing the code, the kinds of changes being proposed (introducing helpers around key logs) is in line with how code is already written in places. We're not talking about dogmatic sweeping changes, but guidance as we try to bootstrap a reasonable approach.

anthony-murphy · 2021-10-25T21:58:23Z

There is substantial work being planned to tighten our surface area for 1.0 in #6272 and just because things are not perfect right now should not be an excuse to make things worse

tylerbutler · 2021-10-25T22:13:21Z

being able to document some key events will be a big help to us and our partners.

The more I think about this, the more I like the approach of consuming the output of the telemetry system, and basing our documentation on that. That seems closer to what developers will actually consume.

Some advantages to this approach:

How the telemetry data is packaged and sewn together becomes irrelevant - we're consuming the output, just like downstream developers are.
We can track how the shape of telemetry data changes over time. This has proven useful for tracking API changes. Note that I'm not advocating for "telemetry is an API," rather I'm saying it's useful to track changes to the telemetry output in addition to changes in the code that produce that output.
Related to 2, if the telemetry changes in some way, it's easy to detect that change and take appropriate action, automated or otherwise.

An example of how this might work for a single package -- it generalizes, I think:

We create a new telemetry logger that consumes telemetry events and logs data about them in a structured format (probably JSON). Note that this telemetry logger doesn't log the data itself, but rather logs metadata about the event -- the fields it has, their types, etc. (Open question: Is the metadata we want to know about the event available from runtime inspection, or is the type system needed?)
A build task runs some tests with the telemetry logger attached, and outputs the JSON.
Combine that data + <some other "sidecar" data> to produce the desired output. The sidecar data would contain the documentation words -- but could also be sourced from the code with more work.

vladsud · 2021-10-25T23:36:40Z

Are we re-inventing the wheel? It's not unique to Fluid problem, right?
How is this solved elsewhere?

I think in most of the cases, telemetry within component is strictly for this component consumption.
This maybe direct or indirect (imaging that FF publishes a tool that can consume FF telemetry and draw some nice charts - that's indirect consumption, or this component also provides dashboard, or maybe repo itself ships a query that changes with product versions but ensures it will present same visuals over time).

I think most of the cross-component / cross-repo consumption should come through API where appropriate workflows are applied (i.e. if we promise them to be stable, then existing workflows we use for APIs are applied here).
If we do not do that, then we are essentially duplicating efforts on two fronts of APIs using different tooling and processes - it adds complexity.

markfields · 2021-10-26T05:18:01Z

Definitely want to take a step back given the feedback here. The problem to be solved can be framed by example - When Office Fluid team is trying to deploy real-world, production-scale experiences built on Fluid, the data we publish to their ITelemetryBaseLogger - plus our team's expertise - is regularly critical in diagnosing and debugging. We know this does not scale even to Microsoft's 1P partners, let alone the OSS community of partnerships we dream of.

If documenting telemetry is not the right next step for empowering partners to diagnose and debug when things go wrong, then let's identify what is and prioritize that work. I personally don't know what other tools are in the toolbelt when it comes to analyzing real customer telemetry data. Maybe a new tool to analyze the logging output? Seems like a cool project, but missing the short-term need to communicate some basics to our partners.

I am feeling that the generalized scheme/guidance/process that Tyler and I are brainstorming about may be a little "out there", so I'll personally be focused on getting stuff written down somewhere* that our partners can use. And then iterate with this landscape in mind.

* And the closer to the code where it's instrumented, the better.

markfields · 2021-10-26T05:38:02Z

To Vlad's point, the other approach I can imagine is we start thinking about where our logs have been critical in diagnosing live site and other partner issues, and where we could introduce some API changes to communicate state/status for them to log in their own way and be empowered to debug from it.

That's quite a large departure from the way we've approached this ever that I know of, but worth considering/exploring.

vladsud · 2021-10-26T16:21:51Z

There is problem of two sorts here - shape of logging (changing over time), but also assumptions on when and how logs are produced. I.e. internal behavior. Forcing no changes on both is essentially pinning (see http://wiki.c2.com/?PinningTests), requires writing pinning tests on our side and disallows future changes in behavior (or makes it extremely expensive). Internal behavior becomes part of contract to maintain.

I think we should not confuse "empowering partners to diagnose and debug" with pinning. We can empower diagnostics / debugging without pinning ourselves into the corner.

I'll use an example here: kernel ETW events from Windows are mostly easy to understand and reason about, but only if you have enough background in the specific area (like file io, registry, threading & scheduling, etc.). I.e. general background info on specific area, access to source code and good naming of events and properties is sufficient (in my experience) to diagnose and debug most of the problems. In fact, there were plenty of examples where no amount of individual event documentation would be more useful than inspecting code (areas like - writing to a file being blocking due to extension of VDL - now I can at least find some articles on this topic, back in the day there was only source code, though official docs are mention it in passing)
On the other hand I found areas where I lacked general understanding (networking stack) hardest to analyze, and

For fun, I searched for Microsoft-Windows-Kernel-Registry / 70EB4F03-C1DE-4F73-A051-33D13D5413BD and I can barely find any info on the web (this seems the only doc I can find, nothing like that existed back in the day)

Translating to Fluid - we should document higher-level flows (such as - ODSP driver uses snapshots, use better name for "TreesLatest" event). And make code more readable.

There is another topic - places where contracts need to be more strict / there is clear down-level dependency in a form of code (of some sorts). I'll break it into couple examples:

Someone (ideally ODSP team) owning client & server dashboards for snapshot download. Same team (ideally, that's not true today) owns code and dashboard, i.e. there is no crossing the line
- FFX can track higher level boot perf (based on how long it takes to create Container) and could leverage other dashboards to get deeper insights, or do ad-hoc diagnostics / debugging using existing "full" self-explanatory telemetry
Summarizer errors - out team owns code and any dashboards in this space.
FFX wants to track connectivity perf. Instead of relying on our telemetry, they should have their own (in their code base) relying on public APIs only (events).

I.e. I think there are should be very few cases where telemetry should flow between team / component boundaries with requirement of being pinned re shape and behavior.
I personally fine with saying that there are 10 or so events that we guarantee will be properly documented and maintained exactly as they are today forever, with appropriate tests ensuring that.

markfields · 2021-11-08T20:14:24Z

Planning to close this today or tomorrow, key stakeholders on my mind are aware of the plan and on board.

markfields added documentation Improvements or additions to documentation design-required This issue requires design thought labels Aug 3, 2021

markfields added this to the Next milestone Aug 3, 2021

markfields mentioned this issue Aug 3, 2021

sequence numbers are reported in telemetry under too many properties #6395

Closed

markfields mentioned this issue Oct 21, 2021

First crack at documenting a telemetry event in the docs site #7787

Closed

markfields self-assigned this Oct 25, 2021

markfields added the epic An issue that is a parent to several other issues label Oct 25, 2021

markfields modified the milestones: Next, November 2021 Oct 25, 2021

markfields mentioned this issue Oct 25, 2021

Bootstrap Telemetry Documentation #7749

Closed

markfields mentioned this issue Oct 26, 2021

Epic: Observability Improvements to aid reliability efforts #7999

Closed

16 tasks

markfields removed the epic An issue that is a parent to several other issues label Oct 26, 2021

markfields changed the title ~~Document key Telemetry events~~ Consider documenting key Telemetry events Oct 28, 2021

anthony-murphy added the design-ready label Nov 9, 2021

markfields closed this as completed Nov 15, 2021

markfields mentioned this issue Nov 15, 2021

Standardized mechanism for exposing load performance data to containers #2293

Closed

markfields mentioned this issue Apr 6, 2022

Document key telemetry events and errors to aid in troubleshooting #5740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider documenting key Telemetry events #6976

Consider documenting key Telemetry events #6976

markfields commented Aug 3, 2021 •

edited

Loading

markfields commented Oct 21, 2021

vladsud commented Oct 23, 2021

markfields commented Oct 25, 2021

markfields commented Oct 25, 2021

anthony-murphy commented Oct 25, 2021 •

edited

Loading

markfields commented Oct 25, 2021

anthony-murphy commented Oct 25, 2021

tylerbutler commented Oct 25, 2021

vladsud commented Oct 25, 2021

markfields commented Oct 26, 2021 •

edited

Loading

markfields commented Oct 26, 2021

vladsud commented Oct 26, 2021

markfields commented Nov 8, 2021

Consider documenting key Telemetry events #6976

Consider documenting key Telemetry events #6976

Comments

markfields commented Aug 3, 2021 • edited Loading

The Plan

Old Content

Update 2021-10-27

Problem statement

One possible approach

A related concern - naming consistency

markfields commented Oct 21, 2021

vladsud commented Oct 23, 2021

markfields commented Oct 25, 2021

markfields commented Oct 25, 2021

anthony-murphy commented Oct 25, 2021 • edited Loading

markfields commented Oct 25, 2021

anthony-murphy commented Oct 25, 2021

tylerbutler commented Oct 25, 2021

vladsud commented Oct 25, 2021

markfields commented Oct 26, 2021 • edited Loading

markfields commented Oct 26, 2021

vladsud commented Oct 26, 2021

markfields commented Nov 8, 2021

markfields commented Aug 3, 2021 •

edited

Loading

anthony-murphy commented Oct 25, 2021 •

edited

Loading

markfields commented Oct 26, 2021 •

edited

Loading