Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBG-4187: Add stat to track number of assertion failures #7127

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions base/devmode.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@ func IsDevMode() bool {
// AssertfCtx panics when compiled with the `cb_sg_devmode` build tag, and just warns otherwise.
// Callers must be aware that they are responsible for handling returns to cover the non-devmode warn case.
func AssertfCtx(ctx context.Context, format string, args ...any) {
SyncGatewayStats.GlobalStats.ResourceUtilization.AssertionFailCount.Add(1)
assertLogFn(ctx, format, args...)
}
6 changes: 6 additions & 0 deletions base/stats.go
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,10 @@ func (g *GlobalStat) initResourceUtilizationStats() error {
if err != nil {
return err
}
resUtil.AssertionFailCount, err = NewIntStat(ResourceUtilizationSubsystem, "assertion_fail_count", StatUnitNoUnits, AssertionFailCountDesc, StatAddedVersion3dot2dot1, StatDeprecatedVersionNotDeprecated, StatStabilityCommitted, nil, nil, prometheus.CounterValue, 0)
if err != nil {
return err
}
resUtil.CpuPercentUtil, err = NewFloatStat(ResourceUtilizationSubsystem, "process_cpu_percent_utilization", StatUnitPercent, ProcessCPUPercentUtilDesc, StatAddedVersion3dot0dot0, StatDeprecatedVersion3dot2dot0, StatStabilityCommitted, nil, nil, prometheus.GaugeValue, 0)
if err != nil {
return err
Expand Down Expand Up @@ -379,6 +383,8 @@ type ResourceUtilization struct {
SystemMemoryTotal *SgwIntStat `json:"system_memory_total"`
// The total number of warnings logged.
WarnCount *SgwIntStat `json:"warn_count"`
// The total number of assertion failures logged.
AssertionFailCount *SgwIntStat `json:"assertion_fail_count"`
// The total uptime.
Uptime *SgwDurStat `json:"uptime"`
}
Expand Down
3 changes: 2 additions & 1 deletion base/stats_descriptions.go
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ const (

SystemMemoryTotalDesc = "The total memory available on the system in bytes."

WarnCountDesc = "The total number of warnings logged."
WarnCountDesc = "The total number of warnings logged."
AssertionFailCountDesc = "The total number of assertion failures logged."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say what an assertion failure is? Something like:

Assertions count is already included in the warning stat and mark problems that are not caused by users and are a software precondition.

This message isn't good, and but for these failures there's no meaningful action a user could take. That's the difference I see between these and other warnings, which could be things that a user could act upon.

Calling AssertfCtx will call WarnfCtx directly and this value already increments the WarnCount metric. We also have no way to determine that a message was an assertion failure vs a standard warning from a log file.

I don't know how useful this stat is without being easily able to ID which are assertion failures from the log files without doing a search for the messages.

I don't know if it makes sense to just add a prefix like "[WRN] Assertion Error: msg" or change AssertfCtx -> WarnfCtx -> logTo to put a new log level? I don't think that a log level change is right because I think that will break the loose log compatibility we have.

Finally, does it actually make sense to log these as errors?

I'm not opposed to the stat, I just want to have it be clear from the description how dev, support, end users are intended to use it. Should this be specially flagged from nutshell, etc as something that needs immediate dev attention? Do we expect a special alerting from capella?


UptimeDesc = "The total uptime."
)
Expand Down
Loading