Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(host-metrics): solve some TODO comments #124

Merged
merged 6 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 39 additions & 23 deletions packages/opentelemetry-node/lib/metrics/host.js
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@
* should export all the necessary building blocks like `LastValueAccumulation`
* class and `AggregatorKind` enum
* - we might look for an alternate solution in this case (no proposal for now)
*
* Most of this code will be removed when https://github.com/open-telemetry/opentelemetry-js/issues/4616
* is solved.
*/
/**
* @typedef {import('@opentelemetry/api').HrTime} HrTime
Expand All @@ -52,10 +55,12 @@ const {Aggregation, DataPointType, View} = metrics;
const {HostMetrics} = require('@opentelemetry/host-metrics');

/**
* Copied from `@opentelemetry/sdk-metrics` since it's not exported
* TODO: Copied from `@opentelemetry/sdk-metrics` since it's not exported
* https://github.com/open-telemetry/opentelemetry-js/blob/f86251d40fbf615be87319c8a1f5643afb820076/packages/sdk-metrics/src/aggregator/LastValue.ts#L34
*
* @todo remoce this class and require it when exported
* Remove when https://github.com/open-telemetry/opentelemetry-js/issues/4616 is fixed
*
* @todo remove this class when sdk.metrics exports it
* @class
* @implements {Accumulation}
*/
Expand Down Expand Up @@ -100,8 +105,9 @@ class LastValueAccumulation {
* @implements {Aggregator<LastValueAccumulation>}
*/
class SystemCpuUtilizationAggregator {
// TODO: hardcoded the value of `AggregatorKind` enum for GAUGE
// remove when exported
// TODO: Hardcoded the value of `AggregatorKind` enum for GAUGE. Remove
// when issue below is fixed
// Issue: https://github.com/open-telemetry/opentelemetry-js/issues/4616
// https://github.com/open-telemetry/opentelemetry-js/blob/f86251d40fbf615be87319c8a1f5643afb820076/packages/sdk-metrics/src/aggregator/types.ts#L23
kind = 2;

Expand Down Expand Up @@ -158,8 +164,19 @@ class SystemCpuUtilizationAggregator {
}

/**
* Does the sum of data points grouping by `system.cpu.logical_number` so we have the total
* utilization per CPU. Basically the value would be 1 - idle_value
* Groups data points by `system.cpu.logical_number` so we have the total
* utilization per CPU.
*
* We cannot sum up the utilization of all the states since `os.cpus()` is
* not returning all of the possible states but limited to: user, nice, sys, idle, irq
* https://nodejs.org/api/all.html#all_os_oscpus
*
* where in linux we have more: user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice
* https://man7.org/linux/man-pages/man5/proc.5.html
*
* So in order to have the most accurate metric of utilization we use
* the formula 1 - (idle utilization)
*
* As an example given the data points:
* - { value: 0.1, attributes: { 'system.cpu.logical_number': 0, 'system.cpu.state': 'idle' } }
* - { value: 0.5, attributes: { 'system.cpu.logical_number': 0, 'system.cpu.state': 'system' } }
Expand Down Expand Up @@ -188,15 +205,6 @@ class SystemCpuUtilizationAggregator {
accumulationByAttributes,
endTime
) {
// We cannot sum up the utilization of all the states since `os.cpus()` is
// not returning all of the possible states but limited to: user, nice, sys, idle, irq
// https://nodejs.org/api/all.html#all_os_oscpus
//
// where in linux we have more: user, nice, system, idle, iowait, irq, softirq, steal, guest, guest_nice
// https://man7.org/linux/man-pages/man5/proc.5.html
//
// So in order to have the most accurate metric of utilization we use
// the formula 1 - (idle utilization)
return {
descriptor,
aggregationTemporality,
Expand Down Expand Up @@ -225,27 +233,35 @@ class SystemCpuUtilizationAggregation extends Aggregation {
/** @type {HostMetrics} */
let hostMetricsInstance;
function enableHostMetrics() {
// TODO: make this configurable, user might collect host metrics with a separate utility
hostMetricsInstance = new HostMetrics({
name: '',
});
// @ts-ignore - config interface expects a `name` property but there is a default value
hostMetricsInstance = new HostMetrics({});
hostMetricsInstance.start();
}

// It is known that host metrics sends a lot of data so for now we drop some
// instruments that are not handled by Kibana and doing aggregations
// for others that we want to include shorly (CPU metrics)
// Ref (data amount issue): https://github.com/elastic/elastic-otel-node/issues/51
// Ref (metrics in Kibana): https://github.com/elastic/kibana/pull/174700
/** @type {metrics.View[]} */
const HOST_METRICS_VIEWS = [
// drop `system.network.*` metrics for now
// drop `system.network.*` (not in Kibana)
new View({
instrumentName: 'system.network.*',
aggregation: Aggregation.Drop(),
}),
// drop `system.cpu.time` also
// TODO: check if we can do an aggregation here
// drop `system.cpu.time` (not in Kibana)
new View({
instrumentName: 'system.cpu.time',
aggregation: Aggregation.Drop(),
}),
// use the aggregation we craeted above
// drop `process.*` (not in Kibana)
new View({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewer: dropping process.* metrics for now since there is no UI showing them. We probably want an aggregation for process.cpu.utilization but I think it would be easier and shorter once open-telemetry/opentelemetry-js#4616 is done

instrumentName: 'process.*',
aggregation: Aggregation.Drop(),
}),
// Do an aggregation to avoid cardinality problems because of the possible
// permutations of state & logical_number attributes
new View({
instrumentName: 'system.cpu.utilization',
aggregation: new SystemCpuUtilizationAggregation(),
Expand Down
7 changes: 4 additions & 3 deletions packages/opentelemetry-node/lib/sdk.js
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,15 @@ class ElasticNodeSDK extends NodeSDK {
// TODO metrics exporter should do for metrics what `TracerProviderWithEnvExporters` does for traces, does that include `url` export endpoint?
// TODO what `temporalityPreference`?
// TODO make the Millis values configurable. What would otel java do?
// TODO config to disable metrics? E.g. otherwise `http.server.duration` will send every period forever and data can be distracting
const metricsDisabled = process.env.ETEL_METRICS_DISABLED === 'true'; // TODO hack for now

// Disable metrics by config
const metricsDisabled =
process.env.ELASTIC_OTEL_METRICS_DISABLED === 'true';
if (!metricsDisabled) {
const metricsInterval =
Number(process.env.ETEL_METRICS_INTERVAL_MS) || 30000;
defaultConfig.metricReader =
new metrics.PeriodicExportingMetricReader({
// exporter: new metrics.ConsoleMetricExporter(),
exporter: new OTLPMetricExporter(),
exportIntervalMillis: metricsInterval,
exportTimeoutMillis: metricsInterval, // TODO same val appropriate for timeout?
Expand Down