-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerting models, task architecture RFC #216
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
### Request for comments (RFC): Threshold-based alerting | ||
|
||
#### Introduction | ||
|
||
This RFC proposes an architecture for an alerting feature in PostHog, which allows users to define conditions that | ||
trigger alerts based on their data. The system will evaluate these conditions at regular intervals and notify users | ||
when conditions are met. The design includes detailed logging for debugging and auditing, and efficient handling of | ||
task execution using Celery. | ||
|
||
#### Goals | ||
|
||
1. **Condition Evaluation**: Define and evaluate alert conditions at specified intervals. | ||
2. **Notification System**: Notify users when alert conditions are met. | ||
3. **Logging and Auditing**: Store detailed logs of each alert check. | ||
4. **Efficiency and Scalability**: Handle tasks efficiently to reduce load on the system. | ||
|
||
#### Architecture | ||
|
||
1. **Models**: | ||
- **Alert**: Represents the alert configuration. | ||
- **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses. | ||
|
||
2. **Task Execution**: | ||
- Use Celery to handle periodic checks and notifications. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just checking whether it should be temporal rather than celery (i don't even know how we'd choose 🙈) |
||
- Implement concurrency limits and task chaining for efficient processing. | ||
|
||
#### Models | ||
|
||
##### Alert Model | ||
|
||
```python | ||
class Alert(models.Model): | ||
team = models.ForeignKey("Team", on_delete=models.CASCADE) | ||
insight = models.ForeignKey("posthog.Insight", on_delete=models.CASCADE) | ||
name = models.CharField(max_length=100) | ||
target_value = models.TextField() | ||
anomaly_condition = models.JSONField(default=dict) | ||
``` | ||
|
||
Credits: @nikitaevg in https://github.com/PostHog/posthog/pull/22554 | ||
|
||
This main model is responsible for storing alert configurations. We may consider adding fields for alert frequency, | ||
notification preferences, and other settings based on user requirements. Future integration with CDP is also tbd. | ||
|
||
##### AlertCheck Model | ||
|
||
The `AlertCheck` model logs each evaluation of an alert: | ||
- `alert`: Foreign key to the Alert model. | ||
- `created`: Time of the check. | ||
- `calculated_value`: Result of the check. | ||
- `anomaly_condition`: Copy of the condition for the alert in case original got updated. | ||
- `threshold_met`: Whether the condition was met. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: maybe is_anomaly? My first impression was that threshold_met=true means the value is below the threshold. Also, we might consider other anomaly conditions, not only threshold. On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here. So what about check_result = ANOMALY | NORMAL_VALUE | ERROR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤷 sounds good
I'd prefer having an enum here, IMO it will need more values than 2. You think check_result is not good? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding "anomaly": Is there any background or reasoning for this term? I think it evokes other expectations (anomaly detection) and is not quite what we have here with strict set thresholds. I would actually vote to use another term. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We need to decide whether to send out a notification (= threshold met) yes or no. That is boolean. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It just popped in my head. What term would you prefer? I would say a value that is above or below a user-defined threshold is an anomaly.
Sorry, missed this comment last time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i've built a couple of similar matchers in the past and they always evolve to support i'd definitely have in other parts of posthog we ended up having tbh there's a predicate condition and a predicate result maybe it's better to have it that generic this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but i should leave you @webjunkie to it 😊 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That worries me a bit, I usually find difficult to interpret optional booleans. My rule of thumb is that if something is slightly more difficult than yes/no then I default to an enum - it's more explicit and more extensible. I'm not advocating for different error types in the check_result. I'm just suggesting we use an enum to express this absence of the outcome more explicitly, have a required enum with three states. And still "Enums are cheap, IMO provide better readability, and are extensible". It's better to have a door open to extensions, especially with databases, isn't it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One other thing that comes to mind, which might influence the modeling here: We currently would alert each time we check, if the threshold is continued to be breached, right? Regarding the boolean discussion: I'm happy to talk about a specific alternative if you may present one.
This was also my thinking.. anomaly is a too strong word, and we can leave it more generic.. But naming is hard 🤷🏻 |
||
- `notification_sent`: Whether notification was sent. | ||
- `error_message`: Any errors encountered. | ||
|
||
This model provides detailed logging for debugging and auditing purposes. It gives accountability and transparency to | ||
users by recording the results of each alert evaluation. | ||
Users may ask why an alert was triggered and if the system is working as expected. The `AlertCheck` model helps answer | ||
these questions by providing a detailed history of alert evaluations. | ||
|
||
We might consider discarding old `AlertCheck` records after a certain period to manage database size. | ||
|
||
#### Task Execution with Celery | ||
|
||
##### Why Use Groups, Chains, and Chunks | ||
|
||
**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for | ||
different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load | ||
and ensure efficient resource utilization. | ||
Comment on lines
+67
to
+69
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder maybe limit_concurrency (that you introduced) would work here? Basically that's all we need - limit the number of tasks in flight, no? If we add limit_concurrency and simply process all alerts in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not. e.g. 10 per worker is fine with 4 pods, then under load we scale to 100 pods and we're running 1000 concurrent not 40 (maybe teaching everyone to suck eggs 🙈) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
IIUC the limit_concurrency implementation, it looks at the number of tasks in flight and drops the new one if it's above the limit, so I'd say it will keep the limit if we increase the number of pods, right?
I haven't worked with Celery before, so not at all! What an odd expression though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🤣 i have a bad habit of speaking in idioms - sorry! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The limit concurrency implementation I did recently acts more as a safeguard. The reason is that if you put 100 tasks in the queue and they hit the limit, they will start to retry and retry but only do so a couple of times and then fail. So this system isn't good for "slam stuff in the queue and wait for it to be done". Setting up groups and chains allows us to control that there are 10 groups launched (= 10 running at the same time) and they sequentially check the alerts. That's a simple way I would say to control how much is done at a time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right this would also break the chain 🤔 I also thought the same if we can limit worker concurrency... the problem is, this concurrency setting is for one worker. We might spin up several pods though. And they also might autoscale to even more. Unfortunately we currently cannot control the worker amount and autoscaling granular per worker type. I can think about it more, but maybe the case that alerts fail from time to time is not that severe for a first version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose there's a way to create a dedicated pod with exact number of CPUs for this queue. But yeah, I see it's not that simple. Let's try the chains and see how it goes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is indeed possible, but someone would need to dig into the configs and templates in the other repo 😓 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just noticed that our infra team pushed an unrelated adjustment to auto scaling, which now allows for exact worker/pod configuration of maximum instances. I'll test this with some other tasks and we might end up just using the same mechanism here then after all 🎉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, cool! What do you say we keep the mechanism with chains for now (to unblock development of other features and to merge that old PR) and if the configuration works out I'll adjust it? |
||
|
||
**Chains**: Enable sequential execution of tasks, ensuring that each alert check is followed by its corresponding | ||
notification task. This sequential execution is crucial for maintaining the logical flow of operations, ensuring that | ||
notifications are only sent after checks are completed. | ||
|
||
**Chunks**: Divide large lists of tasks into smaller, more manageable chunks. This method is particularly useful for | ||
handling bulk operations without overwhelming the system. Chunks ensure that large sets of alert checks are processed | ||
in smaller batches, which are then sequentially handled by chains within groups. Celery will make sure to not launch a | ||
full task for each item in the list, but rather a task for each chunk. | ||
|
||
Combining these three constructs (groups, chains, and chunks) provides a robust framework for managing the complexity | ||
of alert checks and notifications. This approach ensures scalability, efficiency, and maintainability, making it | ||
suitable for handling potentially large volumes of alerts and checks in PostHog. | ||
|
||
##### Check Alert Task | ||
|
||
The `check_alert` task evaluates an alert condition and logs the result. If the condition is met, it triggers a | ||
notification task. | ||
|
||
Rough code outline: | ||
|
||
```python | ||
from celery import shared_task, chain | ||
from .models import Alert, AlertCheck | ||
from django.utils import timezone | ||
|
||
@shared_task # Think about expiration time and timeout | ||
def check_alert(alert_id): | ||
alert = Alert.objects.get(id=alert_id) | ||
|
||
# Keep in mind idempotency in case of retries - e.g., check if check is already in the database for this interval | ||
|
||
calculated_value = 42 # Example calculated value - hand off calculation to existing insight code | ||
threshold_met = True # Example threshold check | ||
|
||
alert_check = AlertCheck.objects.create( | ||
alert=alert, | ||
calculated_value=calculated_value, | ||
threshold_met=threshold_met, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since we're adding thresholds... i'd love to be able to add thresholds in the UI that don't alert just so I can get some nice visuals on insights without needing alerts that has the benefit we can ship it as soon as its made and start benefitting and getting user feedback on the UX There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Uh, that's a nice idea... So basically a threshold in the UI could be an alert that doesn't notify anyone (yet) and doesn't even need to be checked/calculated. But you could easily turn on notifications for that threshold. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, i want to see the threshold on the graph because that's useful but don't need an alert for every threshold e.g. measuring web vitals, i want to colour the graphs because its hard to remember the 4 different thresholds (well, 8 because they have good, medium, poor) but I don't need an alert when they change cos they don't always indicate something i need to immediately react to |
||
... | ||
) | ||
|
||
if threshold_met: | ||
send_notification.s(alert_check.id).delay() # Launch notification task based on check object | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder maybe we don't extract it into a separate task? The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again. I'd say profit is negligible assuming that notifications fail rarely 🤷 I'm not saying notifications should be in the same function with alerts though, I might as well extract it for readability There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
importantly
that's a really valuable thing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we absolutely should be able to retry a notification N times but only calculate the value once (or at least <N times since we don't want to retry for 6 hours without checking if the alert has recovered 🙈) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The other reason I put this is that we can easily launch other things based on the trigger, e.g. hand it off to CDP, launch a Slack notification task, and so on. In this way it's prepared to be pluggable. |
||
|
||
return alert.id | ||
``` | ||
|
||
##### Send Notification Task | ||
|
||
The `send_notification` task handles sending notifications and updates the `AlertCheck` with the notification status. | ||
|
||
```python | ||
@shared_task # Think about expiration time and timeout | ||
def send_notification(alert_check_id): | ||
alert_check = AlertCheck.objects.get(id=alert_check_id) | ||
|
||
# Keep in mind idempotency in case of retries - e.g., check if notification was already sent | ||
|
||
success = True # Assume success | ||
notification_status = {"status": "success"} | ||
|
||
... # Send notification logic | ||
|
||
alert_check.notification_sent = True | ||
alert_check.save() | ||
|
||
return success | ||
``` | ||
|
||
##### Scheduling Alert Checks | ||
|
||
The `schedule_alert_checks` task runs at regular intervals, creating groups and chains of alert check tasks to manage | ||
concurrency and sequence. | ||
|
||
```python | ||
from celery import group, chain | ||
|
||
@shared_task | ||
def schedule_alert_checks(): | ||
alerts = Alert.objects.all() | ||
alert_ids = [alert.id for alert in alerts] | ||
|
||
alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)] | ||
|
||
group_of_chains = group( | ||
chain(check_alert.chunks(group, 10)) for group in alert_id_groups | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when we do implement this it definitely needs a comment - i think i understand what it's doing but i bet i'm wrong 🙈 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency |
||
) | ||
|
||
group_of_chains.apply_async() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a set of scheduled tasks which try and keep the cache warm and we're not monitoring them well (IMHO as the person who left them not well monitored last time they were changed)
this is effectively the same mechanism for a different reason
we could overlap these mechanisms so that caching overall gets better too
but we should definitely learn from the past and make sure that we know how often these tasks run, when they fail, and have appropriate alerting in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.
I should look into it actually, because alerts that use cached stale values are not very good alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly I don't think we should have a cache task processing X per minute and an alert task processing X for per minute, since we're going to overload processing infrastructure
or at least we shouldn't ignore that we have both
the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)
we can always allow more, but can't restrict them (AWS has soft and hard limits to protect service, but you can ask them to increase soft limits)
(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with that, Mixpanel allows 50 alerts per client, probably there's a reason for that.
2 and 10 alerts seems a bit small though:) maybe anything above some number should be paid?
I'd prefer keeping it in the open source, at least while I'm participating - I might require an extra approval from my employer to contribute to EE directories .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We overhauled the query runners cache recently, so to get fresh data (and fill the cache) you can just use the execution mode
CALCULATE_BLOCKING_ALWAYS
in the alert task.