diff --git a/requests-for-comments/2024-07-01-alerting.md b/requests-for-comments/2024-07-01-alerting.md new file mode 100644 index 0000000..9eec76d --- /dev/null +++ b/requests-for-comments/2024-07-01-alerting.md @@ -0,0 +1,160 @@ +### Request for comments (RFC): Threshold-based alerting + +#### Introduction + +This RFC proposes an architecture for an alerting feature in PostHog, which allows users to define conditions that +trigger alerts based on their data. The system will evaluate these conditions at regular intervals and notify users +when conditions are met. The design includes detailed logging for debugging and auditing, and efficient handling of +task execution using Celery. + +#### Goals + +1. **Condition Evaluation**: Define and evaluate alert conditions at specified intervals. +2. **Notification System**: Notify users when alert conditions are met. +3. **Logging and Auditing**: Store detailed logs of each alert check. +4. **Efficiency and Scalability**: Handle tasks efficiently to reduce load on the system. + +#### Architecture + +1. **Models**: + - **Alert**: Represents the alert configuration. + - **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses. + +2. **Task Execution**: + - Use Celery to handle periodic checks and notifications. + - Implement concurrency limits and task chaining for efficient processing. + +#### Models + +##### Alert Model + +```python +class Alert(models.Model): + team = models.ForeignKey("Team", on_delete=models.CASCADE) + insight = models.ForeignKey("posthog.Insight", on_delete=models.CASCADE) + name = models.CharField(max_length=100) + target_value = models.TextField() + anomaly_condition = models.JSONField(default=dict) +``` + +Credits: @nikitaevg in https://github.com/PostHog/posthog/pull/22554 + +This main model is responsible for storing alert configurations. We may consider adding fields for alert frequency, +notification preferences, and other settings based on user requirements. Future integration with CDP is also tbd. + +##### AlertCheck Model + +The `AlertCheck` model logs each evaluation of an alert: +- `alert`: Foreign key to the Alert model. +- `created`: Time of the check. +- `calculated_value`: Result of the check. +- `anomaly_condition`: Copy of the condition for the alert in case original got updated. +- `threshold_met`: Whether the condition was met. +- `notification_sent`: Whether notification was sent. +- `error_message`: Any errors encountered. + +This model provides detailed logging for debugging and auditing purposes. It gives accountability and transparency to +users by recording the results of each alert evaluation. +Users may ask why an alert was triggered and if the system is working as expected. The `AlertCheck` model helps answer +these questions by providing a detailed history of alert evaluations. + +We might consider discarding old `AlertCheck` records after a certain period to manage database size. + +#### Task Execution with Celery + +##### Why Use Groups, Chains, and Chunks + +**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for +different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load +and ensure efficient resource utilization. + +**Chains**: Enable sequential execution of tasks, ensuring that each alert check is followed by its corresponding +notification task. This sequential execution is crucial for maintaining the logical flow of operations, ensuring that +notifications are only sent after checks are completed. + +**Chunks**: Divide large lists of tasks into smaller, more manageable chunks. This method is particularly useful for +handling bulk operations without overwhelming the system. Chunks ensure that large sets of alert checks are processed +in smaller batches, which are then sequentially handled by chains within groups. Celery will make sure to not launch a +full task for each item in the list, but rather a task for each chunk. + +Combining these three constructs (groups, chains, and chunks) provides a robust framework for managing the complexity +of alert checks and notifications. This approach ensures scalability, efficiency, and maintainability, making it +suitable for handling potentially large volumes of alerts and checks in PostHog. + +##### Check Alert Task + +The `check_alert` task evaluates an alert condition and logs the result. If the condition is met, it triggers a +notification task. + +Rough code outline: + +```python +from celery import shared_task, chain +from .models import Alert, AlertCheck +from django.utils import timezone + +@shared_task # Think about expiration time and timeout +def check_alert(alert_id): + alert = Alert.objects.get(id=alert_id) + + # Keep in mind idempotency in case of retries - e.g., check if check is already in the database for this interval + + calculated_value = 42 # Example calculated value - hand off calculation to existing insight code + threshold_met = True # Example threshold check + + alert_check = AlertCheck.objects.create( + alert=alert, + calculated_value=calculated_value, + threshold_met=threshold_met, + ... + ) + + if threshold_met: + send_notification.s(alert_check.id).delay() # Launch notification task based on check object + + return alert.id +``` + +##### Send Notification Task + +The `send_notification` task handles sending notifications and updates the `AlertCheck` with the notification status. + +```python +@shared_task # Think about expiration time and timeout +def send_notification(alert_check_id): + alert_check = AlertCheck.objects.get(id=alert_check_id) + + # Keep in mind idempotency in case of retries - e.g., check if notification was already sent + + success = True # Assume success + notification_status = {"status": "success"} + + ... # Send notification logic + + alert_check.notification_sent = True + alert_check.save() + + return success +``` + +##### Scheduling Alert Checks + +The `schedule_alert_checks` task runs at regular intervals, creating groups and chains of alert check tasks to manage +concurrency and sequence. + +```python +from celery import group, chain + +@shared_task +def schedule_alert_checks(): + alerts = Alert.objects.all() + alert_ids = [alert.id for alert in alerts] + + alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)] + + group_of_chains = group( + chain(check_alert.chunks(group, 10)) for group in alert_id_groups + ) + + group_of_chains.apply_async() +```