Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerting models, task architecture RFC #216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
160 changes: 160 additions & 0 deletions requests-for-comments/2024-07-01-alerting.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a set of scheduled tasks which try and keep the cache warm and we're not monitoring them well (IMHO as the person who left them not well monitored last time they were changed)

this is effectively the same mechanism for a different reason

we could overlap these mechanisms so that caching overall gets better too
but we should definitely learn from the past and make sure that we know how often these tasks run, when they fail, and have appropriate alerting in place

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.

I should look into it actually, because alerts that use cached stale values are not very good alerts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly I don't think we should have a cache task processing X per minute and an alert task processing X for per minute, since we're going to overload processing infrastructure

or at least we shouldn't ignore that we have both

the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)

we can always allow more, but can't restrict them (AWS has soft and hard limits to protect service, but you can ask them to increase soft limits)

(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more i think about it i think this might need limits - free customers can have 2 alerts, paid can have 10, (fill in numbers as you see fit)

Yeah, I agree with that, Mixpanel allows 50 alerts per client, probably there's a reason for that.

2 and 10 alerts seems a bit small though:) maybe anything above some number should be paid?

(similarly we should consider if this is going to be in the open source or EE licensed portion of the code)

I'd prefer keeping it in the open source, at least while I'm participating - I might require an extra approval from my employer to contribute to EE directories .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that it might be not a good idea to use cache for alerts, users probably want the freshest value to be tested, at most 5-10 minutes of staleness.

We overhauled the query runners cache recently, so to get fresh data (and fill the cache) you can just use the execution mode CALCULATE_BLOCKING_ALWAYS in the alert task.

Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
### Request for comments (RFC): Threshold-based alerting

#### Introduction

This RFC proposes an architecture for an alerting feature in PostHog, which allows users to define conditions that
trigger alerts based on their data. The system will evaluate these conditions at regular intervals and notify users
when conditions are met. The design includes detailed logging for debugging and auditing, and efficient handling of
task execution using Celery.

#### Goals

1. **Condition Evaluation**: Define and evaluate alert conditions at specified intervals.
2. **Notification System**: Notify users when alert conditions are met.
3. **Logging and Auditing**: Store detailed logs of each alert check.
4. **Efficiency and Scalability**: Handle tasks efficiently to reduce load on the system.

#### Architecture

1. **Models**:
- **Alert**: Represents the alert configuration.
- **AlertCheck**: Logs each evaluation of an alert, including results and notification statuses.

2. **Task Execution**:
- Use Celery to handle periodic checks and notifications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking whether it should be temporal rather than celery (i don't even know how we'd choose 🙈)

- Implement concurrency limits and task chaining for efficient processing.

#### Models

##### Alert Model

```python
class Alert(models.Model):
team = models.ForeignKey("Team", on_delete=models.CASCADE)
insight = models.ForeignKey("posthog.Insight", on_delete=models.CASCADE)
name = models.CharField(max_length=100)
target_value = models.TextField()
anomaly_condition = models.JSONField(default=dict)
```

Credits: @nikitaevg in https://github.com/PostHog/posthog/pull/22554

This main model is responsible for storing alert configurations. We may consider adding fields for alert frequency,
notification preferences, and other settings based on user requirements. Future integration with CDP is also tbd.

##### AlertCheck Model

The `AlertCheck` model logs each evaluation of an alert:
- `alert`: Foreign key to the Alert model.
- `created`: Time of the check.
- `calculated_value`: Result of the check.
- `anomaly_condition`: Copy of the condition for the alert in case original got updated.
- `threshold_met`: Whether the condition was met.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe is_anomaly? My first impression was that threshold_met=true means the value is below the threshold.

Also, we might consider other anomaly conditions, not only threshold.

On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here. So what about check_result = ANOMALY | NORMAL_VALUE | ERROR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe trigger_condition and trigger_condition_met ??

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trigger_condition

🤷 sounds good

trigger_condition_met

I'd prefer having an enum here, IMO it will need more values than 2. You think check_result is not good?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding "anomaly": Is there any background or reasoning for this term? I think it evokes other expectations (anomaly detection) and is not quite what we have here with strict set thresholds. I would actually vote to use another term.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second thought, maybe it shouldn't be a boolean, but an enum. If there's an error in calculation, I don't know what boolean value can be assigned here.

We need to decide whether to send out a notification (= threshold met) yes or no. That is boolean.
Errors can be stored separately (see error_message).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any background or reasoning for this term?

It just popped in my head. What term would you prefer? I would say a value that is above or below a user-defined threshold is an anomaly.

We need to decide whether to send out a notification (= threshold met) yes or no.

  1. What if there's a problem with the alert or insight configuration, for example the user misconfigured something, we should notify users about that, no? So whether we send out a notification might depend not only on meeting the threshold.
  2. If there's an error with the calculation, what value would the "threshold_met" have? I would say it's neither true nor false, and I wouldn't like to have an optional boolean as it doesn't help readability.
  3. In the future alerts might get other anomaly detection mechanisms, so it won't be only about thresholds. That's why I think it's better not mention "threshold" here.
  4. Enums are cheap, IMO provide better readability, and are extensible, why do you think it's better to use a boolean?

Sorry, missed this comment last time.

Copy link
Member

@pauldambra pauldambra Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've built a couple of similar matchers in the past and they always evolve to support && and || and i always ended up returning something like {matches: bool, explanation: string } - so familiarity with something similar is altering my opinion here (doesn't mean i'm right 😊)

i'd definitely have error as a separate case, if there's an error the condition was neither met nor not met

in other parts of posthog we ended up having error_count and last_error_date so that we could decide whether to carry on trying something


tbh there's a predicate condition and a predicate result

maybe it's better to have it that generic


this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i should leave you @webjunkie to it 😊

Copy link

@nikitaevg nikitaevg Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's an error the condition was neither met nor not met

That worries me a bit, I usually find difficult to interpret optional booleans. My rule of thumb is that if something is slightly more difficult than yes/no then I default to an enum - it's more explicit and more extensible. I'm not advocating for different error types in the check_result. I'm just suggesting we use an enum to express this absence of the outcome more explicitly, have a required enum with three states.

And still "Enums are cheap, IMO provide better readability, and are extensible". It's better to have a door open to extensions, especially with databases, isn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing that comes to mind, which might influence the modeling here: We currently would alert each time we check, if the threshold is continued to be breached, right?
If we find the threshold is met, we might check previous AlertChecks to see if this was already the cases there and if a notification has been sent already.

Regarding the boolean discussion: I'm happy to talk about a specific alternative if you may present one.

this might not always be an anomaly to the user - "tell me when an order value is about $1k" is a happy thing

This was also my thinking.. anomaly is a too strong word, and we can leave it more generic.. But naming is hard 🤷🏻

- `notification_sent`: Whether notification was sent.
- `error_message`: Any errors encountered.

This model provides detailed logging for debugging and auditing purposes. It gives accountability and transparency to
users by recording the results of each alert evaluation.
Users may ask why an alert was triggered and if the system is working as expected. The `AlertCheck` model helps answer
these questions by providing a detailed history of alert evaluations.

We might consider discarding old `AlertCheck` records after a certain period to manage database size.

#### Task Execution with Celery

##### Why Use Groups, Chains, and Chunks

**Groups**: Allow multiple tasks to be executed in parallel, providing the ability to run concurrent checks for
different alerts. By creating groups of tasks, we can limit the number of parallel operations to control system load
and ensure efficient resource utilization.
Comment on lines +67 to +69

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder maybe limit_concurrency (that you introduced) would work here? Basically that's all we need - limit the number of tasks in flight, no?

If we add limit_concurrency and simply process all alerts in a for, IMO it would have the same effect as if we used chains + groups + chunks, but it would be simpler, or do I miss something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.

e.g. 10 per worker is fine with 4 pods, then under load we scale to 100 pods and we're running 1000 concurrent not 40

(maybe teaching everyone to suck eggs 🙈)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the trick here (however we do it) is to limit the number of running tasks such that if the number of celery pods scale the max number of tasks does not.

IIUC the limit_concurrency implementation, it looks at the number of tasks in flight and drops the new one if it's above the limit, so I'd say it will keep the limit if we increase the number of pods, right?

maybe teaching everyone to suck eggs 🙈

I haven't worked with Celery before, so not at all! What an odd expression though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What an odd expression though.

🤣

i have a bad habit of speaking in idioms - sorry!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit concurrency implementation I did recently acts more as a safeguard. The reason is that if you put 100 tasks in the queue and they hit the limit, they will start to retry and retry but only do so a couple of times and then fail. So this system isn't good for "slam stuff in the queue and wait for it to be done".

Setting up groups and chains allows us to control that there are 10 groups launched (= 10 running at the same time) and they sequentially check the alerts. That's a simple way I would say to control how much is done at a time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right this would also break the chain 🤔

I also thought the same if we can limit worker concurrency... the problem is, this concurrency setting is for one worker. We might spin up several pods though. And they also might autoscale to even more. Unfortunately we currently cannot control the worker amount and autoscaling granular per worker type.

I can think about it more, but maybe the case that alerts fail from time to time is not that severe for a first version?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's a way to create a dedicated pod with exact number of CPUs for this queue. But yeah, I see it's not that simple. Let's try the chains and see how it goes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed possible, but someone would need to dig into the configs and templates in the other repo 😓

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that our infra team pushed an unrelated adjustment to auto scaling, which now allows for exact worker/pod configuration of maximum instances. I'll test this with some other tasks and we might end up just using the same mechanism here then after all 🎉

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, cool!

What do you say we keep the mechanism with chains for now (to unblock development of other features and to merge that old PR) and if the configuration works out I'll adjust it?


**Chains**: Enable sequential execution of tasks, ensuring that each alert check is followed by its corresponding
notification task. This sequential execution is crucial for maintaining the logical flow of operations, ensuring that
notifications are only sent after checks are completed.

**Chunks**: Divide large lists of tasks into smaller, more manageable chunks. This method is particularly useful for
handling bulk operations without overwhelming the system. Chunks ensure that large sets of alert checks are processed
in smaller batches, which are then sequentially handled by chains within groups. Celery will make sure to not launch a
full task for each item in the list, but rather a task for each chunk.

Combining these three constructs (groups, chains, and chunks) provides a robust framework for managing the complexity
of alert checks and notifications. This approach ensures scalability, efficiency, and maintainability, making it
suitable for handling potentially large volumes of alerts and checks in PostHog.

##### Check Alert Task

The `check_alert` task evaluates an alert condition and logs the result. If the condition is met, it triggers a
notification task.

Rough code outline:

```python
from celery import shared_task, chain
from .models import Alert, AlertCheck
from django.utils import timezone

@shared_task # Think about expiration time and timeout
def check_alert(alert_id):
alert = Alert.objects.get(id=alert_id)

# Keep in mind idempotency in case of retries - e.g., check if check is already in the database for this interval

calculated_value = 42 # Example calculated value - hand off calculation to existing insight code
threshold_met = True # Example threshold check

alert_check = AlertCheck.objects.create(
alert=alert,
calculated_value=calculated_value,
threshold_met=threshold_met,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're adding thresholds... i'd love to be able to add thresholds in the UI that don't alert just so I can get some nice visuals on insights without needing alerts

that has the benefit we can ship it as soon as its made and start benefitting and getting user feedback on the UX

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, that's a nice idea... So basically a threshold in the UI could be an alert that doesn't notify anyone (yet) and doesn't even need to be checked/calculated. But you could easily turn on notifications for that threshold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, i want to see the threshold on the graph because that's useful but don't need an alert for every threshold

e.g. measuring web vitals, i want to colour the graphs because its hard to remember the 4 different thresholds (well, 8 because they have good, medium, poor) but I don't need an alert when they change cos they don't always indicate something i need to immediately react to

...
)

if threshold_met:
send_notification.s(alert_check.id).delay() # Launch notification task based on check object

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder maybe we don't extract it into a separate task? The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again. I'd say profit is negligible assuming that notifications fail rarely 🤷

I'm not saying notifications should be in the same function with alerts though, I might as well extract it for readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only profit I see is that if a notification fails and we want to retry it, we won't calculate the metric value again

importantly

we won't calculate the metric value again.

that's a really valuable thing
some queries are inherently very slow, if we can avoid running them too often we should

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we absolutely should be able to retry a notification N times but only calculate the value once (or at least <N times since we don't want to retry for 6 hours without checking if the alert has recovered 🙈)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason I put this is that we can easily launch other things based on the trigger, e.g. hand it off to CDP, launch a Slack notification task, and so on. In this way it's prepared to be pluggable.


return alert.id
```

##### Send Notification Task

The `send_notification` task handles sending notifications and updates the `AlertCheck` with the notification status.

```python
@shared_task # Think about expiration time and timeout
def send_notification(alert_check_id):
alert_check = AlertCheck.objects.get(id=alert_check_id)

# Keep in mind idempotency in case of retries - e.g., check if notification was already sent

success = True # Assume success
notification_status = {"status": "success"}

... # Send notification logic

alert_check.notification_sent = True
alert_check.save()

return success
```

##### Scheduling Alert Checks

The `schedule_alert_checks` task runs at regular intervals, creating groups and chains of alert check tasks to manage
concurrency and sequence.

```python
from celery import group, chain

@shared_task
def schedule_alert_checks():
alerts = Alert.objects.all()
alert_ids = [alert.id for alert in alerts]

alert_id_groups = [alert_ids[i:i+10] for i in range(0, len(alert_ids), 10)]

group_of_chains = group(
chain(check_alert.chunks(group, 10)) for group in alert_id_groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do implement this it definitely needs a comment - i think i understand what it's doing but i bet i'm wrong 🙈

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my concern as well, it looks a bit difficult, and that's why I suggest for + limit_concurrency

)

group_of_chains.apply_async()
```