-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: OpenTelemetry Metrics and Traces #2607
feat: OpenTelemetry Metrics and Traces #2607
Conversation
3f96dad
to
3409cf1
Compare
44c0e36
to
39f9a7e
Compare
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
.setUnit(COUNT) | ||
.buildWithCallback( | ||
measurement -> { | ||
measurement.record(this.numSessionsInUse, attributesInUseSessions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access to properties like this.numSessionsInUse
and this.sessions
is normally guarded with a synchronized (lock)
block to prevent concurrent reads/writes to take place. Do we need that for this as well? (When are these callbacks being called?)
And if we were to add synchronized (lock)
in these callbacks: Do the threads that call these callbacks take any other locks as well? If yes, can we guarantee that there won't be a deadlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frequency of callback will depend on what exporter customers will be using with what export interval.
For instance cloudmonitoringexporter will be pushing data every 10 sec as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not necessarily worried about the frequency, more about potential race conditions, and maybe also not so much about these specific cases. But we also call sessions.size()
for example without taking a lock. sessions.size()
is undefined if it is called while other threads are modifying the collection.
google-cloud-spanner/src/test/java/com/google/cloud/spanner/SessionPoolTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/SessionPoolTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/SpannerRpcMetricsTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/SpannerRpcMetricsTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/SpannerRpcMetricsTest.java
Outdated
Show resolved
Hide resolved
a40db75
to
854b673
Compare
google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/HeaderInterceptor.java
Outdated
Show resolved
Hide resolved
e0e8292
to
f5353f5
Compare
google-cloud-spanner/src/main/java/com/google/cloud/spanner/AbstractReadContext.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/AbstractReadContext.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/AbstractResultSet.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/TransactionRunnerImpl.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/SpannerRpcViews.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/OpenTelemetrySpanTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/OpenTelemetrySpanTest.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/OpenTelemetrySpanTest.java
Outdated
Show resolved
Hide resolved
bb094a0
to
050b164
Compare
*/ | ||
@VisibleForTesting | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, please use @ObsoleteApi instead, see go/introduce-obsolete-api-java-sdk for details.
google-cloud-spanner/pom.xml
Outdated
@@ -16,6 +16,7 @@ | |||
<properties> | |||
<site.installationModule>google-cloud-spanner</site.installationModule> | |||
<opencensus.version>0.31.1</opencensus.version> | |||
<opentelemetry.version>1.30.1</opentelemetry.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, once this PR is finalized, please let us know so we can add the OpenTelemetry dependencies to java-shared-dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -98,27 +114,20 @@ private void processHeader(Metadata metadata, TagContext tagContext) { | |||
measureMap.put(SPANNER_GFE_LATENCY, latency); | |||
measureMap.put(SPANNER_GFE_HEADER_MISSING_COUNT, 0L); | |||
measureMap.record(tagContext); | |||
|
|||
SpannerRpcMetrics.recordGfeLatency(latency, attributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are recording OpenTelemetry metrics in addition to OpenCensus, this is fine if we use either OpenCensus exporters or OpenTelemetry exporters. But would it cause duplicated metrics if we use the shim?
Ideally, we can have an option(client options, environment variables etc.) for customers to choose either OpenCensus or OpenTelemetry instrumentation as suggested by OpenTelemetry team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. For OpenTelemetry metrics and Spans to works , customers would need to create OpenTelemetry object and register as global else it will be no-op.
For Shim customers need not create opentelemetry object.
private Attributes getMetricAttributes(String method, SpannerProperties spannerProperties) { | ||
AttributesBuilder attributesBuilder = Attributes.builder(); | ||
attributesBuilder.put("database", spannerProperties.databaseId); | ||
attributesBuilder.put("instance_id", spannerProperties.instanceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's a good practice for Traces, but for Metrics,database
, instance_id
and project_id
are not good choices for labels because of cardinality issue.
See what Cloud Monitoring suggested that
Do not use high-resolution values like timestamps, any kind of unique identifier, user IDs, IP addresses, unparameterized URLs, and so forth, for metric labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. But here we are migrating existing metrics, these attributes were already present. So if we remove these attributes now, it will be a breaking change for customers. We will keep this in mind with new metrics
.upDownCounterBuilder(NUM_SESSIONS_IN_POOL_NAME) | ||
.setDescription(NUM_SESSIONS_IN_POOL_DESCRIPTION) | ||
.setUnit(COUNT) | ||
.buildWithCallback( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a counter has to be built with call back, is it due to async counter has better performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not performance, but the values will keep changing. So async callbacks makes more sense here.
No region tags are edited in this PR.This comment is generated by snippet-bot.
|
Warning: This pull request is touching the following templated files:
|
075f3b9
to
026030e
Compare
d7cf07f
to
59f999c
Compare
google-cloud-spanner/pom.xml
Outdated
@@ -16,6 +16,8 @@ | |||
<properties> | |||
<site.installationModule>google-cloud-spanner</site.installationModule> | |||
<opencensus.version>0.31.1</opencensus.version> | |||
<opentelemetry.version>1.32.0</opentelemetry.version> | |||
<graalvm.version>22.3.3</graalvm.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This seems unrelated, and also does not appear to be used. Remove?
"Id", | ||
transaction.getId().toStringUtf8(), | ||
"Timestamp", | ||
Timestamp.fromProto(transaction.getReadTimestamp()).toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you can just use timestamp
here, instead of decoding the proto again (that already happens on line 382)
if (val == null || val instanceof String) { | ||
String strVal = (String) val; | ||
otAttributesBuilder.put(key, strVal); | ||
ocAttributeValues.put(key, AttributeValue.stringAttributeValue(strVal)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that AttributeValue.stringAttributeValue(strValue)
requires strValue
to be non-null, meaning that this will fail when val == null
. Can we otherwise add a test that proves that a null value is OK?
|
||
@Override | ||
public void setStatus(Status status) { | ||
openTelemetrySpan.setStatus(StatusCode.ERROR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this always be set to StatusCode.ERROR
?
return ImmutableMap.of(); | ||
} | ||
|
||
private Attributes getOpenTelemetryExceptionAnnotations(Throwable e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
private Attributes getOpenTelemetryExceptionAnnotations(Throwable e) { | |
private Attributes createOpenTelemetryExceptionAnnotations(Throwable e) { |
private static boolean sessionMetricsEnabled = false; | ||
private static boolean rpcMetricsEnabled = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that all of this is static because we only use a global OpenTelemetry instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is correct
a1cdc2b
to
a5aaedb
Compare
a5aaedb
to
644f64d
Compare
644f64d
to
c850e91
Compare
This PR contains opentelemetry Session metrics and grpc metrics implementation