Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics instrumentation #522

Closed
6 tasks
vyzo opened this issue Jan 20, 2019 · 19 comments
Closed
6 tasks

Metrics instrumentation #522

vyzo opened this issue Jan 20, 2019 · 19 comments

Comments

@vyzo
Copy link
Contributor

vyzo commented Jan 20, 2019

We need to instrument the various subsystems for metrics collection.
This issue is here to track progress towards instrumentation.

Subsystems that need to be instrumented:

  • dialer
  • swarm
  • relay
  • DHT
  • pubsub
  • daemon -- metrics collection endpoint

cc @Stebalien @raulk @bigs @anacrolix

@anacrolix
Copy link
Contributor

What are the metrics interfaces to be supported? Where is this recorded?

@Stebalien
Copy link
Member

We use prometheus in ipfs and it works pretty well. We use the interfaces in go-metrics-interface and the implementation in go-metrics-prometheus so we can swap out metrics implementations.

@bigs
Copy link
Contributor

bigs commented Jan 22, 2019

yeah, let's continue down the prometheus/grafana train. as a note, i'd love to see our metrics interface extended to support "labels", arbitrary KV pairs. these labels will allow us to get deeper insight into our relatively abstract gauges/counters/timers. two things i'd love to see instrumented off the bat:

  • dht: outbound dials (by peer) being made, tied back (via label) to the key they're being made in pursuit of
  • dialer: outbound dials (by addr) being made, tied back (via label) to the peer they're being made in pursuit of

@anacrolix
Copy link
Contributor

anacrolix commented Mar 27, 2019

I was initially in favour of OpenCensus, however after experimentation, here are my observations on it:

  • Ultimately if we're trying to export Prometheus, the mental overhead of figuring out how to have things exposed in the way I wanted in Prometheus quickly became tedious: There are differences in recommendations on naming conventions, namespacing etc., and various "automatic" conversions that occur.
  • The OC form of things is much longer winded, and favours exposing a lot more detail. (To justify separate metrics from views, we felt we needed a separate package so as not to "waste" them. At least all views needed to be exposed, we couldn't just provide a bundle, or automatically register them like in Prometheus. There are also interactions with contexts, various forms of tag mutators, multiple packages etc.)
  • No support yet for Gauges, or GaugeFunc in particular. It's coming, but I'm sure it will be several months before it settles, or longer before it reaches parity with Prometheus. It was tricky enough to expose GaugeFunc with labels, I can't imagine how much harder it will be to do a dance from OC to get this to work for Prometheus, especially without there even being a spec for it in OC yet.
  • No support for histograms summaries. Maybe that's not such a problem, they're sort of not recommended, but it's a thing.
  • Prometheus has much better community penetration. There are various packages that expose things for you through Prometheus, which would require starting all over with OC, or juggling multiple metrics systems to try to get everything exported correctly.
  • OC includes tracing: I'm not sure that's actually a good thing, and perhaps we can ignore it. I'm not sure all this stuff needs to be in a single implementation. At least it appears that the tracing will build on the standard library stuff. Different backends are required to export traces anyway, to my understanding.
  • Prometheus has good support for conversion to other systems. By comparison, if you look at the stats backends OC supports for Go you have Datadog, Stackdriver, and Prometheus. Datadog does not support other language implementations of libp2p. Stackdriver has support for Prometheus.

@raulk
Copy link
Member

raulk commented Mar 27, 2019

OC includes tracing: I'm not sure that's actually a good thing, and perhaps we can ignore it. I'm not sure all this stuff needs to be in a single implementation.

One advantage of using OC for both metrics and traces is that it automatically saves "exemplar" traces for values in each histogram bucket.

I also asked for a clarification of the direction they're taking with the whole metrics and stats duality: census-instrumentation/opencensus-specs#253. You didn't mention this, but the opacity there is unnerving, and is definitely a drawback for me.

No support yet for Gauges, or GaugeFunc in particular. It's coming,

It's not obvious from the issue that they understood you were asking about gauge functions. I've explicitly asked in a follow-up question.


Representing {Phantom}Drift (libp2p visualiser), we can make the metrics export for the introspection protocol work with whichever system.

I do think that lazily computed gauge functions are important to us, and that doesn't even appear to be on the roadmap or sensical for OC because they aspire to be primarily push-based (encouraging use of OpenCensus Agent/Service). Not sure if I'd tag them as a dealbreaker, but certainly close to it.

Also, the maturity and stability of Prometheus should not go undervalued. OpenCensus has to mature a great deal, I'm sure there will be significant API changes (as evidenced by the stats vs. metrics thing now), and if we choose it, it would be more of a risky bet on the grounds of seeing potential that we're willing to support with time and effort, versus choosing the proven, robust option.

OTOH, if we want to incorporate tracing, automatic "examplars" are a nicety we only get with OC, but not the panacea either.

@raulk
Copy link
Member

raulk commented Mar 28, 2019

OC does support lazy metrics calculation via thunks: census-instrumentation/opencensus-go#1080 (comment)

@raulk
Copy link
Member

raulk commented Mar 28, 2019

OC Metrics/Stats difference here: census-instrumentation/opencensus-specs#253 (comment)

@lanzafame
Copy link
Contributor

lanzafame commented Mar 28, 2019

No support for histograms. Maybe that's not such a problem, they're sort of not recommended, but it's a thing.

An OpenCensus Distribution provides you with the data buckets to create a histogram, example prometheus query histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Prometheus has much better community penetration. There are various packages that expose things for you through Prometheus, which would require starting all over with OC, or juggling multiple metrics systems to try to get everything exported correctly.

Please link to something that would take no more than 2 hours to port from Prometheus to OpenCensus, as evidence of this point. On top of this, you can still register these tools when using OC. Snippet from cluster:

	// setup Prometheus
	registry := prom.NewRegistry()
	goCollector := prom.NewGoCollector()
	procCollector := prom.NewProcessCollector(prom.ProcessCollectorOpts{})
	registry.MustRegister(goCollector, procCollector, prommod.NewCollector("ipfs-cluster"))
	pe, err := prometheus.NewExporter(prometheus.Options{
		Namespace: "cluster",
		Registry:  registry,
	})
	if err != nil {
		return err
	}

The OC form of things is much longer winded, and favours exposing a lot more detail. (To justify separate metrics from views, we felt we needed a separate package so as not to "waste" them. At least all views needed to be exposed, we couldn't just provide a bundle, or automatically register them like in Prometheus. There are also interactions with contexts, various forms of tag mutators, multiple packages etc.)

This may be a pain for you as developers of an upstream package, but that is what makes the metrics of your package documented and accessible to your downstream users.

primarily push-based

Regardless of the OC Agent/Service aspect, which is secondary to this discussion, OC is neither push nor pull as that is based on the backend that you configure it to use. If you configure prometheus as the metrics backend, you still end up with the metrics endpoint that prometheus then scrapes.

it would be more of a risky bet on the grounds of seeing potential that we're willing to support with time and effort

I don't think OC is going to need your support, its come out of Google and now has Microsoft support. Not a guarantee of success by any means but this isn't some vapourware project that we are talking about.

choosing the proven, robust option

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Edit: removed points addressed between refreshes of issue. Added code snippet.

@raulk
Copy link
Member

raulk commented Mar 28, 2019

@lanzafame

I don't think OC is going to need your support, its come out of Google and now has Microsoft support.

This sounds like appeal to authority, and it's a poor interpretation of my argument. They won't need us, but it's undeniable that we are investing more time and effort in wrapping our heads around OC than around Prometheus. The takeaway for several of us is that it's immature and heavily in evolution, and the communication around ongoing changes isn't great (e.g. stats vs. metrics thing). So yes, it's an investment from our side to keep up with their evolution, much more than just going with stable Prometheus.

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Can you point me to a project that relies – or builds upon – OpenMetrics?

Just to put some magnitude on this. Number of packages importing each:

@raulk
Copy link
Member

raulk commented Mar 28, 2019

So far, these are my conclusions. With a few details clarified (e.g. lazy gauges), they appear to be at feature parity. The devil might be in the details, though.

Note that this evaluation is for the METRICS functionality. For tracing, I think we're onboard with using OpenCensus.

Prometheus:

👍 Safe bet, mature. Widely used in Go. Mature instrumentation for Node, Rust, Java, Python, C++.

👎 Ties us to a specific metrics backend (even if there are converters out there), export format is textual.

OpenCensus:

👍 Open standard not tied to any particular solution (good for libp2p; as a library we should not impose a world view onto our users); segways synergistically into tracing; push support.

👎 In heavy evolution, little traction since 2018, heavily WIP, configuring to export to Prometheus in a specific way proved debatably challenging, verbose configuration, not great DX due to lack of docs and opacity around changes (e.g. stats vs. metrics thing), will require active participation and involvement upstream. Multi-language support still in the works (many libraries are alpha; not good for libp2p). Seems like a leap of faith (which we might want to take); all backends supported by OpenCensus already support Prometheus scraping; key features we want to use are still experimental and with big fat warnings of API instability.


Let's hear another round of feedback, then propose a way forward by Friday. Whichever one it is, let's set a checkpoint in 1 month to evaluate how it's playing out. We will mark this feature as experimental to convey to our users things are subject to change under the hood.

If in 1 month we find that we're subjectively unhappy with the choice, the system gets in our way, certain features don't work as advertised, DX is bad, we feel frustrated, etc. We can either get involved upstream, or make a switch if we lack the capacity. I doubt we'll have any of these issues with Prometheus.


EDIT: added new pros/cons; clarified this evaluation is for the metrics aspect.

@lanzafame
Copy link
Contributor

This sounds like appeal to authority, and it's a poor interpretation of my argument.

I clearly added a caveat to clarify this wasn't an appeal to authority but a rebuttal to the implication that OC was some small-time project.

Yes, it is proven, and robust hence work like OpenMetrics, a Cloud Native Foundation Sandbox project.

Can you point me to a project that relies – or builds upon – OpenMetrics?

You misunderstood, I am agreeing that Prometheus is robust hence OpenMetrics taking and extending their data model, with OC implementing the more generalised standard.

Just to put some magnitude on this. Number of packages importing each:

OpenCensus stats package: 137 importers.
Prometheus: 8047 importers.

To add context to your numbers:

OpenCensus released: Janurary 2018
Prometheus release: November 2012

configuring to export to Prometheus in a specific way proved debatably challenging

Evidence of this please? I had no issue doing so in Cluster...

(e.g. stats vs. metrics thing)

Can you please explain what the issue is here? I don't have any trouble understanding the difference.

@raulk
Copy link
Member

raulk commented Mar 28, 2019

Just to put some magnitude on this. Number of packages importing each:

OpenCensus stats package: 137 importers.
Prometheus: 8047 importers.

To add context to your numbers:

OpenCensus released: Janurary 2018
Prometheus release: November 2012

Can't say I'm impressed by the traction. EDIT: Not implying it's not a useful investment, I'm just pointing out we'd be very early adopters, and we don't have a seat at the table.

configuring to export to Prometheus in a specific way proved debatably challenging

Evidence of this please? I had no issue doing so in Cluster...

@anacrolix, could you point to concrete code samples? I was summarising the first bullet point from your feedback.

(e.g. stats vs. metrics thing)

Can you please explain what the issue is here? I don't have any trouble understanding the difference.

Contextualising the citation:

not great DX due to lack of docs and opacity around changes (e.g. stats vs. metrics thing)

What I'm saying is that these are two disjoint ways of generating metrics, and if I were an OC user that had invested into complying with their stats API, I'd be blind-sided. I understand the difference now after asking and digging into source: their docs aren't clear at all, and this area is not spec'ed. That's what I mean with my remark.

@anacrolix
Copy link
Contributor

No support for histograms. Maybe that's not such a problem, they're sort of not recommended, but it's a thing.

I meant summaries, which was in the linked document.

Please link to something that would take no more than 2 hours to port from Prometheus to OpenCensus, as evidence of this point. On top of this, you can still register these tools when using OC. Snippet from cluster:

That's not a good argument, as much as I love implementing stuff.

Regardless of the OC Agent/Service aspect, which is secondary to this discussion, OC is neither push nor pull as that is based on the backend that you configure it to use. If you configure prometheus as the metrics backend, you still end up with the metrics endpoint that prometheus then scrapes.

https://godoc.org/go.opencensus.io/stats/view#SetReportingPeriod and https://github.com/anacrolix/go-libp2p-dht-tool/blob/feat/oc-metrics/main.go#L374

@anacrolix
Copy link
Contributor

@raulk there are a lot of code links further down in my summary. Regarding the overhead, it took me most of a day to convert a small number of metrics, from OC to Prometheus. Here are some conversions from OC metrics to Prometheus for go-libp2p-dht-tool, and for go-libp2p-kad-dht. Note that the Prometheus side of things has a few more features in the DHT, as I've continued adding metrics there. Some of the file diffs are collapsed.

@raulk
Copy link
Member

raulk commented Mar 29, 2019

After carefully weighing pros and cons, and speaking to a number of stakeholders (incl. @frrist and @scout today) for alignment, I recommend to go with OpenCensus for both metrics and tracing, and really strive to make it work.

We talked about a 1-month evaluation period, but I want us to try as hard as possible to make this a success. For me, that pathway is an ultimate backstop/escape hatch, and we'll need to justify it clearly and quantitively if we use that card.

Here are my closing notes:

  • libp2p is a low-level library for apps, and we should not impose our world views onto others. OpenCensus is an abstraction layer like slf4j in Java, so users can easily pick the backend they prefer.
  • OC and Prometheus are at feature parity. The things we thought were unsupported have made it into OC recently (e.g. lazy gauge evaluation).
  • Interplanetary adoption: Filecoin and IPFS cluster have adopted OC, and @frrist and @lanzafame are willing to collaborate with us. Converging makes life easier for Infra too, and the push model is much nicer for cloud-hosted metrics than Prometheus' pull model which requires port opening, NAT manipulation, reverse proxy tweaking, and exposes a security surface.
  • Yes, the ergonomics for stats/view configuration is questionable in OC, and exporting things like we want to see them in Prometheus may be tricky. Let's build interplanetary API sugar to alleviate the pain. @frrist is happy to work shoulder-to-shoulder with us.
  • OC metrics and tracing synergyse pretty nicely. An example are exemplars (sounds like a pun, but it isn't).
  • By all accounts, we have to acknowledge we are an early adopter of the stats/metrics part of OC. After deduping and removing Google-related projects like knative, istio, GCP, we are probably project number 15 or so. That makes me somewhat nervous, as we're betting on a future that we do not influence.
  • Cross-language libraries are not as solid as Prometheus (e.g. node is still alpha), but by the time we spec out the metrics we want to standardise across implementations, they will have probably caught up.
  • Kubernetes uses Prometheus, and there is no talk about adopting OC there. That worries me.
  • While OC is spec-first, some processes like decision-making, discussion and prioritisation seem to be conducted behind closed doors. I don't like this. Let's try to reach out to the OC leads, and pick their brains about future direction.

Thank you all for the constructive input! Let's instrument all the things! 💪

@bigs
Copy link
Contributor

bigs commented Mar 29, 2019

thanks for collecting all of this, @raulk! sounds good to me!

@lanzafame
Copy link
Contributor

@raulk on the k8s front, searching the org does turn up results, but only on the tracing front, granted k8s just rebuilt their metrics collection tooling, heapster -> metrics-server.

Also, I have mentioned this elsewhere but to reiterate, I am more than happy to help out on any rough edges and getting this implemented smoothly across all the orgs.

Thanks for diving deep on this, sorry if the discussion got too passionate at times 🙏

@marten-seemann
Copy link
Contributor

Closing. The current metrics effort is tracked in #1356.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants