Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems related to pruning old AppUsageEvent and ServiceUsageEvent records #4182

Open
joyvuu-dave opened this issue Jan 21, 2025 · 0 comments

Comments

@joyvuu-dave
Copy link
Contributor

Problem

A problem we've been facing involves the pruning of old AppUsageEvent and ServiceUsageEvent records. Often, these records are removed before the corresponding Apps or Services have actually stopped, making it difficult to determine how long those resources have been running. If a consumer starts polling after the start record is pruned, it may never know the true start time of that App or Service.

Challenges and Use of Purge/Seed

A specific pain point relates to the destructively_purge_all_and_reseed endpoints for App and Service Usage Events. These endpoints are sometimes used when event tables become inconsistent — often because start records were removed prematurely. While destructively_purge_all_and_reseed recreates running resources in the database, it assigns new start timestamps that do not reflect actual creation or launch times. As a result, usage metrics can become misleading.

Core Problems

  • Pruning Before Completion
    • The system prunes old records to manage database growth. However, if an App/Service remains running for a long period, its start record may be deleted before the stop record exists
    • A newly added or recovering consumer will not see accurate start times
  • Extended Downtime Leading to Missed Events
    • Sometimes, a usage-event polling service may go offline for an extended period (e.g. an unnoticed crash). By the time it resumes polling, older events may have been pruned, leaving gaps in historical data
  • Accurate State Visibility
    • It becomes challenging to piece together which Apps or Services are still running when critical events have already been removed, forcing reliance on destructively_purge_all_and_reseed to reset the data (where we lose accurate historical start times)

Potential Approaches

After running into this issue repeatedly, I’ve created a set of code changes designed to:

  • Keep start Records for Active Apps/Services
    • Records remain in place until the corresponding stop event is encountered, preventing the loss of essential lifecycle information.
  • Consumer Registration
    • By including consumer_guid and after_guid in usage-event requests, consumers can register themselves, allowing the Cloud Controller to avoid pruning events they have not yet processed
  • Threshold-Based Pruning
    • A configurable limit (threshold_for_keeping_unprocessed_records) ensures the database does not grow indefinitely if a registered consumer stays offline. If the record count exceeds this threshold, older entries can still be pruned
  • Endpoints for Managing Consumers
    • Operators or automated systems can view, remove, or otherwise manage registered consumers. This enables consumers to deregister themselves and make more informed decisions about when to request destructively_purge_all_and_reseed

Questions for the Community

  • Have folks run into a similar challenge with start events being pruned prematurely, leading to confusion about how long resources have been running?
  • Have you had to use destructively_purge_all_and_reseed in a similar manner?
  • Does retaining usage events of running Apps and Services sound like a beneficial idea?
  • Do consumer registration and threshold-based pruning strike a reasonable balance between data retention and database size management?
  • Are there alternative approaches that could better manage event pruning while preserving critical usage data?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant