Replies: 11 comments
-
We actually tried to use Timescale before moving to Clickhouse. The issue we ran into was trying to query time spent on a site, which required subqueries to do. Timescale didn't allow us to create a hypertable from a hypertable which would have solved the problem. But we are not completely ruling out using TS in the future. The reason event_data is a separate thing is because 1) it's unstructured data and 2) should never be returned as part of a list query. So you would never join it. You would either directly fetch a single record or query all the data as a whole, which is what we did in the latest release v1.39. This allows us to to potentially move the data into something more performant like Elasticsearch. |
Beta Was this translation helpful? Give feedback.
-
Have you at any point considered continuous aggregates which can be created from a hypertable? Although maybe it's been a while since you last evaluated it, for I can't think of a case where tsdb wouldn't allow me to run a subquery/CTE over a hypertable/cagg in a similar context. Also, if you're considering Elasticsearch— one certain zdb or RUM could both be relevant. |
Beta Was this translation helpful? Give feedback.
-
And yes, if you otherwise were to revert the async function relationalQuery(
{ websiteId },
{ session: { id: sessionId }, eventUuid, url, eventName, eventData },
) {
const data = {
websiteId,
sessionId,
url: url?.substring(0, URL_LENGTH),
eventName: eventName?.substring(0, EVENT_NAME_LENGTH),
eventUuid,
};
if (eventData) {
data.eventData = {
create: {
eventData: eventData,
},
};
}
return prisma.client.event.create({
data,
});
} I have been looking here and wondering if I could simply turn the relation into a normal column and expect the generation to proceed correctly, is that right? You mentioned that the reason you separated it in the first place was due to the fact it "should never be returned as part of a list query", does this allude to a limitation of the ORM here? I.e. if by default it would query the data when it otherwise shouldn't, what's stopping you from simply not querying this column? What are the side-effects (inc. performance-related) that I'm looking at here? |
Beta Was this translation helpful? Give feedback.
-
The decision to separate out Prisma only needs the relation so it can do joins or dependent queries. If you merge back Sidenote, were you able to resolve the session time query in TS? This is where we couldn't the get the performance to acceptable levels. |
Beta Was this translation helpful? Give feedback.
-
The session time query, can you link to it directly? I've been meaning to figure out what you mean by that, however I if I'm right let me show how you would create a "hypertable of a hypertable" using something called a continuous aggregate. First, to be clear I'm assuming you are looking for a query that looks something like this? SELECT
min(created_at) as a,
max(created_at) as b
FROM umami.event
GROUP BY session_id; This is the subquery, I would assume? You can calculate CREATE MATERIALIZED VIEW umami.session_time
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 day', created_at) as bucket,
session_id,
max(created_at)-min(created_at) as delta
FROM umami.event
GROUP BY bucket, session_id; The continuous aggregate is the best of two worlds of normal and materialised views insofar it retrieves the past data as materialised partials it can compress, and the recent data like a normal view would; this combined approach allows to make the aggregate available to retrieval very efficiently. To enable this functionality, you would need to attach a policy. This query is not exact, insofar it will break at the bucket boundary, and by increasing the bucket size this effect will be less pronounced as the boundaries would occur less, however I say for the majority of analytical queries where these session times will likely be aggregated further by a different quality, it doesn't matter. The continuous aggregates are hypertables so naturally they may be compressed, and retention-controlled. In a typical time series fashion, once the event, pageview, or session data is written, it doesn't really change. SELECT
s.session_id,
sum(s0.delta) as total_time
FROM umami.session s
JOIN umami.session_time s0 on s0.session_id = s.session_id
GROUP BY s.session_id; Many more can be made, and this is what makes Timescale attractive to us in this context. The continuous aggregates don't support JOIN so all of the data must be in the parent hypertable when it's created. This is the reason why we had to bring The extent to which you can push this is vast, so I would suggest to re-evaluate Timescale and seriously reconsider supporting it. I would be willing to contribute back some of our schema & config code, as well as help with leveraging it. |
Beta Was this translation helpful? Give feedback.
-
Also; would you be willing to comment re: #1470 and loss-of-focus, tab closs, delayed conversion and stuff like that? How would you recommend to go about collecting these type of events? |
Beta Was this translation helpful? Give feedback.
-
This is the query we were working with: select
time_bucket('1 day', created_at),
sum(c) as "total_views",
sum(time) as "total_time"
from (
select
pageview.session_id,
time_bucket('1 hour', created_at) as created_at,
count(*) c,
floor(extract(epoch from max(pageview.created_at) - min(pageview.created_at))) as "time"
from pageview
where pageview.website_id=2
and pageview.created_at between now() - interval '7 days' and now()
group by 1, 2
) t
group by 1 It's not as simple as getting min and max. If a user appears on Monday and again on Tuesday, they didn't spend 24 hours on the site. So we have to bucket sessions per hour in the subquery. So the subquery would need to be a cagg as well as the outer query. Ideally we would need something like this, timescale/timescaledb#4668. This issue is still open. Regarding #1470, we found that |
Beta Was this translation helpful? Give feedback.
-
So the session id persists over time instead of uniquely identifying distinct sessions, I didn't know that to be the case. I would guess that this is the primary issue at hand. (And the fact that it doesn't rely on events, which I now realised is a given if neither 1. Whenever a page view occurs, assign a new session_id if it occurs late enough for it to be a distinct session For this, we would have to encode a bit more data in the export async function getSession(req) {
const { payload } = getJsonBody(req);
if (!payload) {
throw new Error('Invalid request');
}
const cache = req.headers['x-umami-cache'];
if (cache) {
const result = await parseToken(cache, secret());
if (result) {
return result;
}
}
// ... Correct me if I'm wrong, but wouldn't this resolve the issue fair and square? As long as the session ids are never outdated, you would be able to group by it reliably, within or without the time bucket to it. I would get this to be the most reliable way forward as long as the header can be produced reliably over time... Granted of course that I understand the issue at hand. 2. Create a custom aggregate Even though the type of subquery request you were trying to write can't be done due to technical limitations of the current cagg implementation, since 2.7 ordered-set aggregates such as mode() and percentile_disc() are fully supported, meaning that you're also able to secure the ordering within any given group, including but not limited to Namely, if all you need is to calculate the time spent in the site while accounting for gaps, you can do away without a lag window. -- the custom aggregate state
CREATE TYPE elapsed__t AS (acc interval, cur timestamptz);
-- state transition function
CREATE OR REPLACE FUNCTION elapsed__sf(s elapsed__t, x interval, cur timestamptz)
RETURNS elapsed__t IMMUTABLE LANGUAGE sql
AS $$ SELECT
coalesce(s.acc, '0'::interval) + case
when cur-s.cur > x then null else cur-s.cur end,
cur $$;
-- final function
CREATE OR REPLACE FUNCTION elapsed__ff(s elapsed__t)
RETURNS interval IMMUTABLE LANGUAGE sql
AS $$ select s.acc $$;
-- elapsed('15 minutes', created_at order by created_at)
CREATE OR REPLACE AGGREGATE elapsed(interval, timestamptz) (
STYPE = elapsed__t,
SFUNC = elapsed__sf,
FINALFUNC = elapsed__ff); Now that this aggregate is in place all you need to do is put it into a continuous aggregate in place of a sum: CREATE MATERIALIZED VIEW umami.session_elapsed
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 day', created_at) bucket,
website_id,
session_id,
elapsed('15 minutes', created_at order by created_at) elapsed,
FROM umami.event e
GROUP BY bucket, website_id, session_id
ORDER BY bucket DESC, website_id; This is currently the solution to many issues whenever a JOIN seems inevitable. Another case would be something like 3. For session-time keeping, rely on events instead This to me sounds like most appropriate solution to the problem. We're running Umami on a set of landing pages, and most of the time users won't load a single other page yet we would like to look at the session time nonetheless. This is a bit of a pickle as long as the session calculation is page- and not event-dependent, you're simply not going to get good enough resolution for websites with limited navigation. In such case, we would need However, let's say I have no intention of capturing these mouse/scroll events, what's left? What's left is either some sort of long-polling heartbeat setup, or a websocket connection. In case of a websocket connection, we can always tell that the client is gone if they don't restore a hanging connection within a minute. Other than that the events can be written directly to a websocket and it's going to be substantially more efficient client-side with the back-end taking a slight hit from having to juggle all the connections. I wonder if this is a good compromise. Either way, page views don't seem to work for single-page sites so some kind of "implicit" events like load/unload would be required. Upd: brought in a much simpler version of the aggregate |
Beta Was this translation helpful? Give feedback.
-
Thanks for the informative reply. We may be able to do something with your custom aggregate example. The difficult part is finding the time to be able to dive back into Timescale. Session ids are calculated in real-time for each user and that's how we identify unique visitors. This is then mixed with a rotating monthly salt to bucket uniqueness at the month level. So your first proposal won't work because we can't just issue a new session id. For session time keeping only, you are right that we would need to insert event records that simply track time elapsed only so they could be easy to sum up. I don't think web sockets or long-polling would be feasible because of the many different hosting setups people are using. For example we use Vercel and they use serverless functions. I think client-side handling from the tracker script is the best option. The tracker can be made to send I should probably mention that we are actively working on a v2 of Umami. In the new data model there is only going to be a single events table. We are merging pageviews with events and just segmenting events by name. I think this will enable a lot of the functionality you mentioned. |
Beta Was this translation helpful? Give feedback.
-
You did mention that you're actively working on V2 and there's going to be significant changes to the schema, et cetera... well, I would like to offer you my help and help of my people potentially for bringing some new features, including perhaps timescale/promscale support which shall benefit us a lot. i know you've made a switch to clickhouse and it's kind of okay, but the truth is it's lacking in all respects because that's not real SQL, and you don't get things like promscale and postgresml for live regression/classification training as well as prediction, too. We have deployed Umami at a reasonably large scale, think 100k events a day on average and we're really looking to grow what's umami is capable of, and I was very glad to see that there's patches to add role capability which is definitely going to help people to use Umami in upscale organisations such as ours here. Timescale is getting cagg joins, nested caggs, potentially support for returning setof functions such as unnest and more, and more, especially in the realm of promscale; it functions like a prometheus/jaeger collector and it's very good for lots of other things, and we see Umami as bridging the gap to the front-end here, so it's in our primary interest to develop its capability and suit it to best utilise the currently offered and certain capabilities that we have been exploring. You mentioned that there's going to be a new event-only schema, I'm not sure that it's the best choice, really. The concept of a session as it's currently set up is just fine, and page views are useful for all reasons normalisation; the revised pageview (hit) hypertable is best referenced from the events table via hit_id with its schema planned in advance to better leverage the benefits that come with caggs, namely optimise them for joins. I mean, you're getting another grouping dimension this way and it can be super beneficial if you wish to evaluate particular url/referer pairs based on certain type of event data, think A/B where you're doing a cagg over hits HAVING A/B events in them only and that's how you aggregate: website_id, hit_id, version— this is the cagg that you're later going to join with the hits cagg to compare different versions across; perhaps this is something that could land in Umami UI which would be a powerful feature all things considered |
Beta Was this translation helpful? Give feedback.
-
Update: in the current incarnation will Umami utilise all the system cores available to it for concurrency? If not, what would be the steps to set it up in a way that would allow it? We've enabled Redis, and now attempting to determine if additional autoscaling is ever required. |
Beta Was this translation helpful? Give feedback.
-
Hello,
@mikecao I couldn't help but notice that you had at least considered at some point to bring Timescale to the table.
We've adapted Umami for Timescale somewhat in the most crude way possible, see schema.tsdb.sql for the schema that we have employed. We had to turn the migrations off, unfortunately because it didn't work out for some wicked reason and due to the fact Timescale doesn't really support foreign keys on hypertables which is unfortunate but hey, whatever that needs to be done! In our case, we have partitioned the event, page view and session tables and added the respective compression & retention policies to go with it. My biggest peeve with the schema and the query structure currently is with
event_data
as it stands, is there any particular reason you want to do that JOIN? There are just a few issues with it that suggest further de-normalisation:This will lead to much better performance and cleaner retention in absence of cascade deletes.
How hard do you think it would be to de-normalise
event_data
to bring it back to theevent
table? If there's no particular reason to have. We're looking to bring Umami to a range of our sites with hundreds of thousands users and billions of events total per annum. In our particular case, distributed hypertable sharding is definitely something that we would like to explore, I'm sure there's plenty of further use cases to massively speed up the dashboards and make it easier to extend the data.By default, Timescale only does time column partitioning (chunks) but it can also partition by space (i.e. session id) which is a much more scalable way. Also, on something that isn't clear to us at the moment— is there a way to treat multiple sites in a single session, or we would have to group by
session.hostname
to achieve that?Would you like to continue this conversation?
Best regards
Ilya
Beta Was this translation helpful? Give feedback.
All reactions