PostHog · webjunkie · Jul 2, 2024 · pauldambra · Jul 2, 2024 · nikitaevg
diff --git a/requests-for-comments/2024-07-01-alerting.md b/requests-for-comments/2024-07-01-alerting.md
@@ -0,0 +1,160 @@
+### Request for comments (RFC): Threshold-based alerting
+
+#### Introduction
+
+This RFC proposes an architecture for an alerting feature in PostHog, which allows users to define conditions that
+trigger alerts based on their data. The system will evaluate these conditions at regular intervals and notify users
+when conditions are met. The design includes detailed logging for debugging and auditing, and efficient handling of
+task execution using Celery.
+
+#### Goals
+
+1. **Condition Evaluation**: Define and evaluate alert conditions at specified intervals.
+2. **Notification System**: Notify users when alert conditions are met.
+3. **Logging and Auditing**: Store detailed logs of each alert check.
+4. **Efficiency and Scalability**: Handle tasks efficiently to reduce load on the system.
+
+#### Architecture
+
+1. **Models**:
+    - **Alert**: Represents the alert configuration.
+    - **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses.
+
+2. **Task Execution**:
+    - Use Celery to handle periodic checks and notifications.
+    - Implement concurrency limits and task chaining for efficient processing.
+
+#### Models
+
+##### Alert Model
+
+```python
+class Alert(models.Model):
+    team = models.ForeignKey("Team", on_delete=models.CASCADE)
+    insight = models.ForeignKey("posthog.Insight", on_delete=models.CASCADE)
+    name = models.CharField(max_length=100)
+    target_value = models.TextField()
+    anomaly_condition = models.JSONField(default=dict)
+```
+
+Credits: @nikitaevg in https://github.com/PostHog/posthog/pull/22554
+
+This main model is responsible for storing alert configurations. We may consider adding fields for alert frequency,
+notification preferences, and other settings based on user requirements. Future integration with CDP is also tbd.
+
+##### AlertCheck Model
+
+The `AlertCheck` model logs each evaluation of an alert:
+- `alert`: Foreign key to the Alert model.
+- `created`: Time of the check.
+- `calculated_value`: Result of the check.
+- `anomaly_condition`: Copy of the condition for the alert in case original got updated.
+- `threshold_met`: Whether the condition was met.
+- `notification_sent`: Whether notification was sent.
+- `error_message`: Any errors encountered.
+
+This model provides detailed logging for debugging and auditing purposes. It gives accountability and transparency to 
+users by recording the results of each alert evaluation.
+Users may ask why an alert was triggered and if the system is working as expected. The `AlertCheck` model helps answer 
+these questions by providing a detailed history of alert evaluations.
+
+We might consider discarding old `AlertCheck` records after a certain period to manage database size.
+
+#### Task Execution with Celery
+
+##### Why Use Groups, Chains, and Chunks
+
+**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for 
+different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load 
+and ensure efficient resource utilization.
+
+**Chains**: Enable sequential execution of tasks, ensuring that each alert check is followed by its corresponding 
+notification task. This sequential execution is crucial for maintaining the logical flow of operations, ensuring that 
+notifications are only sent after checks are completed.
+
+**Chunks**: Divide large lists of tasks into smaller, more manageable chunks. This method is particularly useful for 
+handling bulk operations without overwhelming the system. Chunks ensure that large sets of alert checks are processed 
+in smaller batches, which are then sequentially handled by chains within groups. Celery will make sure to not launch a 
+full task for each item in the list, but rather a task for each chunk.
+
+Combining these three constructs (groups, chains, and chunks) provides a robust framework for managing the complexity 
+of alert checks and notifications. This approach ensures scalability, efficiency, and maintainability, making it 
+suitable for handling potentially large volumes of alerts and checks in PostHog.
+
+##### Check Alert Task
+
+The `check_alert` task evaluates an alert condition and logs the result. If the condition is met, it triggers a 
+notification task.
+
+Rough code outline:
+
+```python
+from celery import shared_task, chain
+from .models import Alert, AlertCheck
+from django.utils import timezone
+
+@shared_task # Think about expiration time and timeout
+def check_alert(alert_id):
+    alert = Alert.objects.get(id=alert_id)
+
+    # Keep in mind idempotency in case of retries - e.g., check if check is already in the database for this interval
+
+    calculated_value = 42  # Example calculated value - hand off calculation to existing insight code
+    threshold_met = True  # Example threshold check
+
+    alert_check = AlertCheck.objects.create(
+        alert=alert,
+        calculated_value=calculated_value,
+        threshold_met=threshold_met,
+        ...
+    )
+
+    if threshold_met:
+        send_notification.s(alert_check.id).delay() #  Launch notification task based on check object
+
+    return alert.id
+```
+
+##### Send Notification Task
+
+The `send_notification` task handles sending notifications and updates the `AlertCheck` with the notification status.
+
+```python
+@shared_task # Think about expiration time and timeout
+def send_notification(alert_check_id):
+    alert_check = AlertCheck.objects.get(id=alert_check_id)
+
+    # Keep in mind idempotency in case of retries - e.g., check if notification was already sent
+
+    success = True  # Assume success
+    notification_status = {"status": "success"}
+
+    ...  # Send notification logic
+
+    alert_check.notification_sent = True
+    alert_check.save()
+
+    return success
+```
+
+##### Scheduling Alert Checks
+
+The `schedule_alert_checks` task runs at regular intervals, creating groups and chains of alert check tasks to manage 
+concurrency and sequence.
+
+```python
+from celery import group, chain
+
+@shared_task
+def schedule_alert_checks():
+    alerts = Alert.objects.all()
+    alert_ids = [alert.id for alert in alerts]
+
+    alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)]
+
+    group_of_chains = group(
+        chain(check_alert.chunks(group, 10)) for group in alert_id_groups
+    )
+
+    group_of_chains.apply_async()
+```