-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add person distinct ID overrides squash job (as dagster job) #27710
Conversation
|
||
def wait(self, client: Client) -> None: | ||
while not self.is_done(client): | ||
time.sleep(15.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arbitrarily chosen number, could be more, could be less.
def get_cluster() -> ClickhouseCluster: | ||
extra_hosts = [] | ||
for host_config in map(copy, CLICKHOUSE_PER_TEAM_SETTINGS.values()): | ||
extra_hosts.append(ConnectionInfo(host_config.pop("host"))) | ||
assert len(host_config) == 0, f"unexpected values: {host_config!r}" | ||
return ClickhouseCluster(default_client(), extra_hosts=extra_hosts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to the non-ee
package.
@@ -156,8 +157,7 @@ def default_client(host=settings.CLICKHOUSE_HOST): | |||
) | |||
|
|||
|
|||
@cache | |||
def make_ch_pool(**overrides) -> ChPool: | |||
def _make_ch_pool(*, client_settings: Mapping[str, str] | None = None, **overrides) -> ChPool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rename was necessary to allow bypassing the pool cache: the client settings map isn't hashable and therefore not usable in a cache key.
@@ -7,7 +7,7 @@ | |||
from posthog.test.base import PostHogTestCase, run_clickhouse_statement_in_parallel | |||
|
|||
|
|||
def create_clickhouse_tables(num_tables: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (and the other test changes to/around create_clickhouse_tables
) wasn't actually necessary: I thought I as going to need to do more invasive fixture surgery than just the dag_tests/conftest.py
edit and found the coupling to django_db_setup
confusing so I simplified the logic here before realizing I wouldn't need to call this function directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
I tried to follow along with the current manual squash process. Love your extremely clean code.
Left some questions to understand the process better, but nothing blocking!
* master: (103 commits) feat(postgres-estimated-rows): pg Estimated Rows on Data Warehouse Sync (#27634) fix: revert darkmode class toggle, updated content on fills (#27783) chore: upgrade posthog-js (#27790) chore(editor-3001): add back join actions (#27740) feat: Add person distinct ID overrides squash job (as dagster job) (#27710) fix(created-by-sources): Adding `created_by` to sources (#27751) Revert "feat(data-warehouse): V2 pipeline release " (#27791) fix: typo for feature flags (#27786) fix(defer-unmounting): Defer unmounting of react elements (#27742) feat(data-warehouse): V2 pipeline release (#27732) fix(data-warehouse): Ensure dates are actual datetime formats (#27777) fix: enable hot reload for the products dir (#27746) fix: assignee selector when null (#27737) chore: clarify rrweb imports (#27776) chore(deps): Update posthog-js to 1.207.3 (#27779) feat(retention): filters on start/return event (#27770) fix(experiments): only show supported math functions (#27589) feat(web-analytics): Set unique conversions graph when adding conversions goal (#27774) chore: color design system part 1: banner and accents (#27756) chore(experiments): Add tests for funnel attribution options (#27752) ...
Problem
Whenever a person merge occurs, we write the latest
person_id
that should be associated with the affecteddistinct_id
(s) to theperson_distinct_id_overrides
table, and use these records to "override" the existing value in theevents
(orsharded_events
) table'sperson_id
column. Another helpful way to think of this overrides table is that it represents a queue of pendingperson_id
updates to that haven't yet been applied to theevents
table.The number of overrides within this table can accumulate over time to a degree where read performance can be negatively impacted due to the size of the overrides join used to update stale values at query time. This can also lead to a situation where all of a team's queries that use
person_id
can run intoMemoryLimitExceeded
errors when using an overrides-based PoE mode if the number of overrides for a team grows large enough.Changes
This automates the manual squash process as a Dagster job.
Briefly,
person_id
column for all events that with adistinct_id
in the overrides dictionary, and afterwards the same overrides dictionary is used to identify the rows that can be safely deleted from the main overrides table as they are no longer needed to "override" anything (the override rows now match the values in the base table.)A few implementation notes:
ON CLUSTER
with distributed DDL. This allows partial failure modes to be cleanly handled on retries without inadvertently scheduling redundant mutations. Mutations that are killed can be restarted, however, without needing to restart the whole job from the beginning.I also tried to keep a lot of the design and implementation relatively orchestrator-agnostic in case we need to replatform this again.
Does this work well for both Cloud and self-hosted?
Yes.
How did you test this code?
Added test (we also ran this manually to verify the queries and general idea work.)