-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(llm-observability): query runner for LLM traces #27509
Conversation
…ervability-query-runner
columns?: string[] | ||
} | ||
|
||
export interface TracesQuery extends DataNode<TracesQueryResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What casing convention do we use for the query schemas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in the sync, it's a mix… but new additions should be camelCase
basically
Size Change: -150 B (-0.01%) Total Size: 1.13 MB ℹ️ View Unchanged
|
columns?: string[] | ||
} | ||
|
||
export interface TracesQuery extends DataNode<TracesQueryResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in the sync, it's a mix… but new additions should be camelCase
basically
@@ -313,6 +313,11 @@ export const QUERY_TYPES_METADATA: Record<NodeKind, InsightTypeMetadata> = { | |||
icon: IconHogQL, | |||
inMenu: false, | |||
}, | |||
[NodeKind.TracesQuery]: { | |||
name: 'LLM Observability Traces', | |||
icon: IconHogQL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do IconAI
, like in the navbar
""" | ||
select | ||
properties.$ai_trace_id as trace_id, | ||
min(timestamp) as trace_timestamp, | ||
max(person.properties) as person, | ||
sum(properties.$ai_latency) as total_latency, | ||
sum(properties.$ai_input_tokens) as input_tokens, | ||
sum(properties.$ai_output_tokens) as output_tokens, | ||
sum(properties.$ai_input_cost_usd) as input_cost, | ||
sum(properties.$ai_output_cost_usd) as output_cost, | ||
sum(properties.$ai_total_cost_usd) as total_cost, | ||
arraySort(x -> x.1, groupArray(tuple(timestamp, properties))) as events | ||
from events | ||
where | ||
event = '$ai_generation' | ||
group by | ||
trace_id | ||
order by | ||
trace_timestamp desc | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like dead code!
ast.Alias( | ||
expr=ast.Call( | ||
name="arraySort", | ||
args=[ | ||
ast.Lambda( | ||
args=["x"], | ||
expr=ast.Call(name="tupleElement", args=[ast.Field(chain=["x"]), ast.Constant(value=2)]), | ||
), | ||
ast.Call( | ||
name="groupArray", | ||
args=[ | ||
ast.Tuple( | ||
exprs=[ | ||
ast.Field(chain=["uuid"]), | ||
ast.Field(chain=["timestamp"]), | ||
ast.Field(chain=["properties"]), | ||
] | ||
) | ||
], | ||
), | ||
], | ||
), | ||
alias="events", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of those columns is pretty readable in direct AST form, but in this one's case I propose parse_expr
– it's a chonker
def _get_where_clause(self): | ||
event_expr = ast.CompareOperation( | ||
left=ast.Field(chain=["event"]), | ||
op=ast.CompareOperationOp.Eq, | ||
right=ast.Constant(value="$ai_generation"), | ||
) | ||
if self.query.traceId is not None: | ||
return ast.And( | ||
exprs=[ | ||
event_expr, | ||
ast.CompareOperation( | ||
left=ast.Field(chain=["id"]), | ||
op=ast.CompareOperationOp.Eq, | ||
right=ast.Constant(value=self.query.traceId), | ||
), | ||
] | ||
) | ||
return event_expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This MUST have a time range, otherwise we're going to be scanning ClickHouse parts furiously and slowly. A tricky-ish thing here, because we also must ensure traces are selected in full, without missing any generations at the beginning or end.
With traces we can make some reasonable assumptions to make this straightforward.
Let's assume a trace can never be longer than 10 minutes (won't always be true, but almost always true is enough). With this assumption in mind, we'll filter this way in the SQL timestamp > {date_from} - INTERVAL 10 MINUTE AND timestamp < {date_to} + INTERVAL 10 MINUTE
. Then in _map_results
we'll cut off traces that started before date_from
. This way we will be skipping traces that would be selected partially at the beginning of the range, and selecting full traces at the end of the range. The result: consistent results and pagination as simple as always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I missed this part, but you're totally right. One addition I've made is filtering for traces that first appeared after the date_to
as they are outside the date range.
Some flaky tests. |
027d646
to
42ea56d
Compare
should be ok now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Problem
We need a query runner for LLM traces retrieval.
Changes
This PR introduces the
TracesQueryRunner
, which has pagination and the ability to filter by a specific trace ID. It groups all$ai_generation
events by their$ai_trace_id
property, so generations are presented as an array sorted by the event timestamp.SQL Query
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Unit tests