Avoiding fat partitions #162

smiklosovic · 2023-09-14T09:30:05Z

smiklosovic
Sep 14, 2023

If I have a table like this

CREATE TABLE IF NOT EXISTS temperature (
    sensor_id uuid,
    registered_at timestamp,
    temperature int,
    location text,
    PRIMARY KEY ((sensor_id), registered_at)
);

The problem with that is that sensor_id is the partition key and registered_at is the clustering column. So, if I have a sensor at home which is taking a temperature each minute and I run it for 1 year, then after 1 year that partition would have 60x24x365 = 525600 records. If each record had 1 KB, every time you do a query you need to deal with a partition of size 513 MB. This can easily timeout your query.

If you have a 3 node cluster you want to save your metrics to and the replication factor is 1, then all these metrics would ever ended up being stored on a one node only. So you would have two nodes basically empty and the third node would hold all data so the data would be spread very unevenly. If that node goes down you have no metrics at all.

This example is not performance friendly at all. It would be more appropriate to apply some kind of bucketing which is a common technique for Cassandra, for this example like this:

CREATE TABLE IF NOT EXISTS temperature (
    sensor_id uuid,
    registered_at timestamp,
    temperature int,
    location text,
    PRIMARY KEY ((sensor_id, day_number), registered_at)
);

That means that there would be at most 60x24 records = 1440 records per day and if each record has 1KB it would read 1.4MB at most for each day.

However, there would need to be some support for this in the plugin as well. Otherwise a user is forced to construct his queries with static number of day which basically renders his dashboard valid for 1 day only and then he needs to reconfigure it.

Dashboards valid for 1 month would be like this:

CREATE TABLE IF NOT EXISTS temperature (
    sensor_id uuid,
    month_number int,
    registered_at timestamp,
    temperature int,
    location text,
    PRIMARY KEY ((sensor_id, month_number), registered_at)
);

But that is same, it helps with bucketing, all records for a particular month would be in one partition, but dashboard would be valid for 1 month only. This is all true unless this plugin would dynamically propagate there the number of the current month.

I think having a placeholder for the number of current month and day would make this way more performance friedly.

unflag · 2023-10-21T12:03:10Z

unflag
Oct 21, 2023
Collaborator

@smiklosovic
Hi! That is a great point! However, there is nothing needs to be done - grafana already gives instrument to achieve that.
There are built-in variables $__from/$__to that have formatting capabilities and are processed by grafana itself: https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#__from-and-__to

For instance, in case when bucketing was done by date, query would look like this:

SELECT sensor_id, temperature, registered_at
WHERE sensor_id IN (99051fe9-6a9c-46c2-b949-38ef78858dd1, 99051fe9-6a9c-46c2-b949-38ef78858dd0) 
AND registered_at > $__timeFrom 
AND registered_at < $__timeTo
AND date >= '${__from:date:YYYY-MM-DD}'
AND date <= '${__to:date:YYYY-MM-DD}'

Thank you for pointing to that particular case - it is quite important and have to be added to the datasource documentation.

1 reply

smiklosovic Oct 23, 2023
Author

@unflag

I am not sure what your table definition looks like but as I see that you have date >= / date <= that seems like date is a clustering column, as well as registered_at has to be, if you use <> relations.

How is that different? You have still all data for a particular sensor living in one partition only. I guess your primary key looks like

(sensor_id, date, registered_at)

or

((sensor_id), date, registered_at)

to be more explicit in what the partition key is.

I think it should be like this:

((sensor_id, date), registered_at) (I am not sure what role "registered_at" has in this example anyway so I assume it is some clustering column).

If we wanted to have this over multiple days, there would need to be so many queries how many days you want to see the metrics for. There would be a query per day and then it would need to be concatenated in the backend.

If you have this table

create table ks.tb (id int, value int, primary key ((id, value));

then you can not do this (probably without using allow filtering which a big no-no anyway)

select * from ks.tb where id = 1 and value >= 5;

You can use <> for clustering columns only.

unflag · 2023-10-23T10:55:51Z

unflag
Oct 23, 2023
Collaborator

@smiklosovic
Oh, well, you are right - my example is incorrect, we definitely can't use > or < for primary key. If we have a date field in the primary key, than we have a table like

CREATE TABLE test(
    sensor_id uuid,
    date date,
    registered_at timestamp,
    temperature int,
    PRIMARY KEY ((sensor_id, date), registered_at)
);

date, of course, could be an int(e.g. day or month number) or even string(day or month name).
In that case we can use queries like

SELECT sensor_id, temperature, registered_at
WHERE sensor_id IN (99051fe9-6a9c-46c2-b949-38ef78858dd1, 99051fe9-6a9c-46c2-b949-38ef78858dd0) 
AND registered_at > $__timeFrom 
AND registered_at < $__timeTo
AND date = '${__from:date:YYYY-MM-DD}'

However, as you mentioned, it will work for the only one day. If a time range includes more than one day/month/etc, then we have to add all these days/months/etcs to a query predicate AND date IN (...). It is a bit uncomfortable, but still working, I believe. It is possible to improve that by implementing some macros support in the datasource, something that will take an interval and create list of discrete elements using it. It is a matter to think about.

0 replies

smiklosovic · 2023-10-24T12:54:32Z

smiklosovic
Oct 24, 2023
Author

@unflag

yes, and date IN (...) works. However, it is quite an anti-pattern to over-use this. It is OK if you have like ... 5 elements in IN but if we have a daily granularity and you want whole month it is not so performance friendly but I think that it is nothing horrible either.

Consider this table:

cqlsh:myks> DESCRIBE TABLE tbl3;
CREATE TABLE myks.tbl3 (
    id int,
    date int,
    hour int,
    temp int,
    PRIMARY KEY ((id, date), hour)
) WITH CLUSTERING ORDER BY (hour ASC)

cqlsh:myks> select * from tbl3 where id = 1 and date in (24,23);
 id | date | hour | temp
----+------+------+------
  1 |   23 |    0 |   15
  1 |   23 |    1 |   16
  1 |   23 |    2 |   13
  1 |   23 |    3 |    8
  1 |   23 |    4 |    9
  1 |   23 |    5 |   10
  1 |   24 |    3 |   13
  1 |   24 |    4 |   14
  1 |   24 |    5 |   15
  1 |   24 |    6 |   16
  1 |   24 |    7 |    9

The advantage of this is that hour is a clustering column so you will have them in ascending order, naturally, and temp is in the last column. It seems that date is ordered automatically too which is nice.

0 replies

HadesArchitect · 2023-10-24T14:31:30Z

HadesArchitect
Oct 24, 2023
Maintainer

Dear participants,

despite @smiklosovic being technically absolutely right, the purpose of this plugin is not to teach Cassandra users how to create data models the right way. The purpose of this plugin is to visualize data.

@unflag If the developers don't know how to design data model, we only can wish them luck. Overcomplicating sample code is the wrong way to educate them.

@smiklosovic If you would like to improve documentation, we gladly review the pull request.

1 reply

unflag Oct 24, 2023
Collaborator

@HadesArchitect You are totally right! However, I am thinking about helping those who designed data model properly and want to use convenient tool to query data. Never mind, I'll make a basic design of macros support a bit later and raise a separate issue to discuss it in details.

smiklosovic · 2023-10-24T15:04:05Z

smiklosovic
Oct 24, 2023
Author

Well, technically this plugin does not support the propagation of date ranges into IN, no?

1 reply

HadesArchitect Oct 24, 2023
Maintainer

That should be partially doable with variables, those are to be released in 3.0.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding fat partitions #162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Avoiding fat partitions #162

smiklosovic Sep 14, 2023

Replies: 5 comments · 3 replies

unflag Oct 21, 2023 Collaborator

smiklosovic Oct 23, 2023 Author

unflag Oct 23, 2023 Collaborator

smiklosovic Oct 24, 2023 Author

HadesArchitect Oct 24, 2023 Maintainer

unflag Oct 24, 2023 Collaborator

smiklosovic Oct 24, 2023 Author

HadesArchitect Oct 24, 2023 Maintainer

smiklosovic
Sep 14, 2023

Replies: 5 comments 3 replies

unflag
Oct 21, 2023
Collaborator

smiklosovic Oct 23, 2023
Author

unflag
Oct 23, 2023
Collaborator

smiklosovic
Oct 24, 2023
Author

HadesArchitect
Oct 24, 2023
Maintainer

unflag Oct 24, 2023
Collaborator

smiklosovic
Oct 24, 2023
Author

HadesArchitect Oct 24, 2023
Maintainer