On scalability of the schema, Timescale support #1626

tucnak · 2022-10-27T13:37:02Z

tucnak
Oct 27, 2022

Hello,

@mikecao I couldn't help but notice that you had at least considered at some point to bring Timescale to the table.

We've adapted Umami for Timescale somewhat in the most crude way possible, see schema.tsdb.sql for the schema that we have employed. We had to turn the migrations off, unfortunately because it didn't work out for some wicked reason and due to the fact Timescale doesn't really support foreign keys on hypertables which is unfortunate but hey, whatever that needs to be done! In our case, we have partitioned the event, page view and session tables and added the respective compression & retention policies to go with it. My biggest peeve with the schema and the query structure currently is with event_data as it stands, is there any particular reason you want to do that JOIN? There are just a few issues with it that suggest further de-normalisation:

Timescale's continuous aggregates don't currently support JOIN so we aren't able to create the particulars;
The retention policy will mop up events automatically, however event_data would have to be cleared manually;
Yes, this is due to lack of triggers, but it wouldn't help you still— if you were to i.e. add a space partition;
The jsonb data can be indexed and compressed completely, or by-path using a GIN index efficiently;
The particulars could include a purpose-built cagg for a certain type of event with the downsampled past data points.

This will lead to much better performance and cleaner retention in absence of cascade deletes.

How hard do you think it would be to de-normalise event_data to bring it back to the event table? If there's no particular reason to have. We're looking to bring Umami to a range of our sites with hundreds of thousands users and billions of events total per annum. In our particular case, distributed hypertable sharding is definitely something that we would like to explore, I'm sure there's plenty of further use cases to massively speed up the dashboards and make it easier to extend the data.

By default, Timescale only does time column partitioning (chunks) but it can also partition by space (i.e. session id) which is a much more scalable way. Also, on something that isn't clear to us at the moment— is there a way to treat multiple sites in a single session, or we would have to group by session.hostname to achieve that?

Would you like to continue this conversation?

Best regards
Ilya

mikecao · 2022-10-27T22:39:11Z

mikecao
Oct 27, 2022
Maintainer

We actually tried to use Timescale before moving to Clickhouse. The issue we ran into was trying to query time spent on a site, which required subqueries to do. Timescale didn't allow us to create a hypertable from a hypertable which would have solved the problem. But we are not completely ruling out using TS in the future.

The reason event_data is a separate thing is because 1) it's unstructured data and 2) should never be returned as part of a list query. So you would never join it. You would either directly fetch a single record or query all the data as a whole, which is what we did in the latest release v1.39. This allows us to to potentially move the data into something more performant like Elasticsearch.

0 replies

tucnak · 2022-10-28T08:20:44Z

tucnak
Oct 28, 2022
Author

Have you at any point considered continuous aggregates which can be created from a hypertable? Although maybe it's been a while since you last evaluated it, for I can't think of a case where tsdb wouldn't allow me to run a subquery/CTE over a hypertable/cagg in a similar context.

Also, if you're considering Elasticsearch— one certain zdb or RUM could both be relevant.

0 replies

tucnak · 2022-10-28T13:17:49Z

tucnak
Oct 28, 2022
Author

And yes, if you otherwise were to revert the event_data change with the intention of bringing it back into the event table in form of a jsonb column— how would you do that? I'm looking to make this change but I'm not exactly familiar with prisma and the codebase at hand; it would help to have some assessment ready beforehand. First of all, correct me if I'm wrong but prisma does the one-to-one mapping automatically, is that right?

async function relationalQuery(
  { websiteId },
  { session: { id: sessionId }, eventUuid, url, eventName, eventData },
) {
  const data = {
    websiteId,
    sessionId,
    url: url?.substring(0, URL_LENGTH),
    eventName: eventName?.substring(0, EVENT_NAME_LENGTH),
    eventUuid,
  };

  if (eventData) {
    data.eventData = {
      create: {
        eventData: eventData,
      },
    };
  }

  return prisma.client.event.create({
    data,
  });
}

I have been looking here and wondering if I could simply turn the relation into a normal column and expect the generation to proceed correctly, is that right? You mentioned that the reason you separated it in the first place was due to the fact it "should never be returned as part of a list query", does this allude to a limitation of the ORM here? I.e. if by default it would query the data when it otherwise shouldn't, what's stopping you from simply not querying this column?

What are the side-effects (inc. performance-related) that I'm looking at here?

0 replies

mikecao · 2022-10-31T22:06:02Z

mikecao
Oct 31, 2022
Maintainer

The decision to separate out event_data was not a limitation of Prisma, we just wanted this data to be tracked separately. More of a product decision. We are currently in the middle of development on v2 which will have a new data model, so it's possible we can revisit this decision.

Prisma only needs the relation so it can do joins or dependent queries. If you merge back event_data and drop the relation, it should function the same if you modify that query.

Sidenote, were you able to resolve the session time query in TS? This is where we couldn't the get the performance to acceptable levels.

0 replies

tucnak · 2022-11-01T11:55:18Z

tucnak
Nov 1, 2022
Author

The session time query, can you link to it directly? I've been meaning to figure out what you mean by that, however I if I'm right let me show how you would create a "hypertable of a hypertable" using something called a continuous aggregate. First, to be clear I'm assuming you are looking for a query that looks something like this?

SELECT
    min(created_at) as a,
    max(created_at) as b
FROM umami.event
GROUP BY session_id;

This is the subquery, I would assume? You can calculate b-a and this is how you get the total time between the events, the same goes for page views... If you're then looking to retrieve this data you may put it in a view and access it that way, but this would incur a heavy toll because every time you'll be joining this view to retrieve the session time, this will incur an additional hash aggregation which is suboptimal. This is a classic bottleneck for the realtime time series, and there's a way around it that they have come up with at Timescale, a continuous aggregate which is a special type of materialised view that is implemented using a hypertable but it's otherwise compatible with how you would normally create a view in SQL:

CREATE MATERIALIZED VIEW umami.session_time
    WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', created_at) as bucket,
    session_id,
    max(created_at)-min(created_at) as delta
FROM umami.event
GROUP BY bucket, session_id;

The continuous aggregate is the best of two worlds of normal and materialised views insofar it retrieves the past data as materialised partials it can compress, and the recent data like a normal view would; this combined approach allows to make the aggregate available to retrieval very efficiently.

To enable this functionality, you would need to attach a policy.

This query is not exact, insofar it will break at the bucket boundary, and by increasing the bucket size this effect will be less pronounced as the boundaries would occur less, however I say for the majority of analytical queries where these session times will likely be aggregated further by a different quality, it doesn't matter. The continuous aggregates are hypertables so naturally they may be compressed, and retention-controlled. In a typical time series fashion, once the event, pageview, or session data is written, it doesn't really change.

SELECT
    s.session_id,
    sum(s0.delta) as total_time
FROM umami.session s
JOIN umami.session_time s0 on s0.session_id = s.session_id
GROUP BY s.session_id;

Many more can be made, and this is what makes Timescale attractive to us in this context.

The continuous aggregates don't support JOIN so all of the data must be in the parent hypertable when it's created. This is the reason why we had to bring event_data back and for the time being created a view called event_data refering back to the event table for compatibility with most of your queries. Now that the event data is all stored in the same table, we can create cagg particulars on certain urls, event names, or based on the jsonb contents of the event data. The normal solution in this position would be to create a GIN index on the event_data column and join accordingly, but the general-purpose GIN indexes for high-noise and high-cardinality data are huge and slow, so it's going to be slow regardless. In the cagg model, however we're able to create as many cagg views on the events that we would like, all materialised and functioning like an efficient compressed index representation.

The extent to which you can push this is vast, so I would suggest to re-evaluate Timescale and seriously reconsider supporting it.

I would be willing to contribute back some of our schema & config code, as well as help with leveraging it.

0 replies

tucnak · 2022-11-01T19:55:07Z

tucnak
Nov 1, 2022
Author

Also; would you be willing to comment re: #1470 and loss-of-focus, tab closs, delayed conversion and stuff like that? How would you recommend to go about collecting these type of events?

0 replies

mikecao · 2022-11-02T22:15:45Z

mikecao
Nov 2, 2022
Maintainer

This is the query we were working with:

select
    time_bucket('1 day', created_at),
    sum(c) as "total_views",
    sum(time) as "total_time"
from (
    select
        pageview.session_id,
        time_bucket('1 hour', created_at) as created_at,
        count(*) c,
        floor(extract(epoch from max(pageview.created_at) - min(pageview.created_at))) as "time"
    from pageview
    where pageview.website_id=2
        and pageview.created_at between now() - interval '7 days' and now()
    group by 1, 2
) t
group by 1

It's not as simple as getting min and max. If a user appears on Monday and again on Tuesday, they didn't spend 24 hours on the site. So we have to bucket sessions per hour in the subquery. So the subquery would need to be a cagg as well as the outer query. Ideally we would need something like this, timescale/timescaledb#4668. This issue is still open.

Regarding #1470, we found that navigator.sendBeacon and fetch keepalive simply do not have wide support or are blocked by adblockers. For events that don't cause the page to unload, like onBlur, you can simply call the tracker functions, https://umami.is/docs/tracker-functions.

0 replies

tucnak · 2022-11-03T14:53:29Z

tucnak
Nov 3, 2022
Author

So the session id persists over time instead of uniquely identifying distinct sessions, I didn't know that to be the case. I would guess that this is the primary issue at hand. (And the fact that it doesn't rely on events, which I now realised is a given if neither navigator.sendBeacon nor keepalive are able to reliably tell when the user would quit the site.) However, I would have two ways to potentially attack this without having to do a double bucket arrangement:

1. Whenever a page view occurs, assign a new session_id if it occurs late enough for it to be a distinct session

For this, we would have to encode a bit more data in the x-umami-cache header for it would have to include the time of the last interaction as well as the session id. The session middleware would then check against it, and create a new session if the old one has proved outdated.

export async function getSession(req) {
  const { payload } = getJsonBody(req);

  if (!payload) {
    throw new Error('Invalid request');
  }

  const cache = req.headers['x-umami-cache'];

  if (cache) {
    const result = await parseToken(cache, secret());

    if (result) {
      return result;
    }
  }
// ...

Correct me if I'm wrong, but wouldn't this resolve the issue fair and square? As long as the session ids are never outdated, you would be able to group by it reliably, within or without the time bucket to it. I would get this to be the most reliable way forward as long as the header can be produced reliably over time...

Granted of course that I understand the issue at hand.

2. Create a custom aggregate

Even though the type of subquery request you were trying to write can't be done due to technical limitations of the current cagg implementation, since 2.7 ordered-set aggregates such as mode() and percentile_disc() are fully supported, meaning that you're also able to secure the ordering within any given group, including but not limited to session_id in either page view, or event tables correspondingly.

Namely, if all you need is to calculate the time spent in the site while accounting for gaps, you can do away without a lag window.

-- the custom aggregate state
CREATE TYPE elapsed__t AS (acc interval, cur timestamptz);
-- state transition function
CREATE OR REPLACE FUNCTION elapsed__sf(s elapsed__t, x interval, cur timestamptz)
    RETURNS elapsed__t IMMUTABLE LANGUAGE sql
AS $$ SELECT
          coalesce(s.acc, '0'::interval) + case
              when cur-s.cur > x then null else cur-s.cur end,
          cur $$;
-- final function
CREATE OR REPLACE FUNCTION elapsed__ff(s elapsed__t)
    RETURNS interval IMMUTABLE LANGUAGE sql
AS $$ select s.acc $$;
-- elapsed('15 minutes', created_at order by created_at)
CREATE OR REPLACE AGGREGATE elapsed(interval, timestamptz) (
    STYPE     = elapsed__t,
    SFUNC     = elapsed__sf,
    FINALFUNC = elapsed__ff);

Now that this aggregate is in place all you need to do is put it into a continuous aggregate in place of a sum:

CREATE MATERIALIZED VIEW umami.session_elapsed
    WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', created_at) bucket,
    website_id,
    session_id,
    elapsed('15 minutes', created_at order by created_at) elapsed,
FROM umami.event e
GROUP BY bucket, website_id, session_id
ORDER BY bucket DESC, website_id;

This is currently the solution to many issues whenever a JOIN seems inevitable.

Another case would be something like HAVING count(*) filter (event_name = 'purchase') > 0 to only limit the aggregate to groups (sessions) containing a certain type of event. Normally, you would only be able to do this using a subquery... This one can be parallelised, too, as ordering is not required for a function like it.

3. For session-time keeping, rely on events instead

This to me sounds like most appropriate solution to the problem. We're running Umami on a set of landing pages, and most of the time users won't load a single other page yet we would like to look at the session time nonetheless. This is a bit of a pickle as long as the session calculation is page- and not event-dependent, you're simply not going to get good enough resolution for websites with limited navigation.

In such case, we would need load and potentially unload events to measure the time elapsed to a high degree of precision. Now, you say that neither navigator.sendBeacon nor keepalive are going to work here for unload which is okay, I guess some debounced scroll and mouse events are going to provide sufficient filler to count the min-max across the whole session group in the time bucket sets.

However, let's say I have no intention of capturing these mouse/scroll events, what's left?

What's left is either some sort of long-polling heartbeat setup, or a websocket connection. In case of a websocket connection, we can always tell that the client is gone if they don't restore a hanging connection within a minute. Other than that the events can be written directly to a websocket and it's going to be substantially more efficient client-side with the back-end taking a slight hit from having to juggle all the connections.

I wonder if this is a good compromise.

Either way, page views don't seem to work for single-page sites so some kind of "implicit" events like load/unload would be required.

Upd: brought in a much simpler version of the aggregate

0 replies

mikecao · 2022-11-04T21:51:04Z

mikecao
Nov 4, 2022
Maintainer

Thanks for the informative reply. We may be able to do something with your custom aggregate example. The difficult part is finding the time to be able to dive back into Timescale.

Session ids are calculated in real-time for each user and that's how we identify unique visitors. This is then mixed with a rotating monthly salt to bucket uniqueness at the month level. So your first proposal won't work because we can't just issue a new session id.

For session time keeping only, you are right that we would need to insert event records that simply track time elapsed only so they could be easy to sum up. I don't think web sockets or long-polling would be feasible because of the many different hosting setups people are using. For example we use Vercel and they use serverless functions. I think client-side handling from the tracker script is the best option. The tracker can be made to send ping events with elapsed time, configurable to your desired resolution. This can be as simple as a timer or triggered by blur and unload events.

I should probably mention that we are actively working on a v2 of Umami. In the new data model there is only going to be a single events table. We are merging pageviews with events and just segmenting events by name. I think this will enable a lot of the functionality you mentioned.

0 replies

tucnak · 2022-11-06T13:50:09Z

tucnak
Nov 6, 2022
Author

You did mention that you're actively working on V2 and there's going to be significant changes to the schema, et cetera... well, I would like to offer you my help and help of my people potentially for bringing some new features, including perhaps timescale/promscale support which shall benefit us a lot. i know you've made a switch to clickhouse and it's kind of okay, but the truth is it's lacking in all respects because that's not real SQL, and you don't get things like promscale and postgresml for live regression/classification training as well as prediction, too. We have deployed Umami at a reasonably large scale, think 100k events a day on average and we're really looking to grow what's umami is capable of, and I was very glad to see that there's patches to add role capability which is definitely going to help people to use Umami in upscale organisations such as ours here.

Timescale is getting cagg joins, nested caggs, potentially support for returning setof functions such as unnest and more, and more, especially in the realm of promscale; it functions like a prometheus/jaeger collector and it's very good for lots of other things, and we see Umami as bridging the gap to the front-end here, so it's in our primary interest to develop its capability and suit it to best utilise the currently offered and certain capabilities that we have been exploring. You mentioned that there's going to be a new event-only schema, I'm not sure that it's the best choice, really. The concept of a session as it's currently set up is just fine, and page views are useful for all reasons normalisation; the revised pageview (hit) hypertable is best referenced from the events table via hit_id with its schema planned in advance to better leverage the benefits that come with caggs, namely optimise them for joins. I mean, you're getting another grouping dimension this way and it can be super beneficial if you wish to evaluate particular url/referer pairs based on certain type of event data, think A/B where you're doing a cagg over hits HAVING A/B events in them only and that's how you aggregate: website_id, hit_id, version— this is the cagg that you're later going to join with the hits cagg to compare different versions across; perhaps this is something that could land in Umami UI which would be a powerful feature all things considered
So after some considerations, I've arrived at the conclusion that the best hypertable composition is: session -> hit -> event, and three grouping correspondingly. This is likely to yield as much result as possible in cagg join cagg model, regardless of the fact that materialised joins are coming early next year...

0 replies

tucnak · 2022-11-07T13:59:36Z

tucnak
Nov 7, 2022
Author

Update: in the current incarnation will Umami utilise all the system cores available to it for concurrency? If not, what would be the steps to set it up in a way that would allow it? We've enabled Redis, and now attempting to determine if additional autoscaling is ever required.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On scalability of the schema, Timescale support #1626

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

On scalability of the schema, Timescale support #1626

tucnak Oct 27, 2022

Replies: 11 comments

mikecao Oct 27, 2022 Maintainer

tucnak Oct 28, 2022 Author

tucnak Oct 28, 2022 Author

mikecao Oct 31, 2022 Maintainer

tucnak Nov 1, 2022 Author

tucnak Nov 1, 2022 Author

mikecao Nov 2, 2022 Maintainer

tucnak Nov 3, 2022 Author

mikecao Nov 4, 2022 Maintainer

tucnak Nov 6, 2022 Author

tucnak Nov 7, 2022 Author

tucnak
Oct 27, 2022

mikecao
Oct 27, 2022
Maintainer

tucnak
Oct 28, 2022
Author

tucnak
Oct 28, 2022
Author

mikecao
Oct 31, 2022
Maintainer

tucnak
Nov 1, 2022
Author

tucnak
Nov 1, 2022
Author

mikecao
Nov 2, 2022
Maintainer

tucnak
Nov 3, 2022
Author

mikecao
Nov 4, 2022
Maintainer

tucnak
Nov 6, 2022
Author

tucnak
Nov 7, 2022
Author