Avoiding fat partitions #162
Replies: 5 comments 3 replies
-
@smiklosovic For instance, in case when bucketing was done by date, query would look like this:
Thank you for pointing to that particular case - it is quite important and have to be added to the datasource documentation. |
Beta Was this translation helpful? Give feedback.
-
@smiklosovic
However, as you mentioned, it will work for the only one day. If a time range includes more than one day/month/etc, then we have to add all these days/months/etcs to a query predicate |
Beta Was this translation helpful? Give feedback.
-
yes, Consider this table:
The advantage of this is that |
Beta Was this translation helpful? Give feedback.
-
Dear participants, despite @smiklosovic being technically absolutely right, the purpose of this plugin is not to teach Cassandra users how to create data models the right way. The purpose of this plugin is to visualize data. @unflag If the developers don't know how to design data model, we only can wish them luck. Overcomplicating sample code is the wrong way to educate them. @smiklosovic If you would like to improve documentation, we gladly review the pull request. |
Beta Was this translation helpful? Give feedback.
-
Well, technically this plugin does not support the propagation of date ranges into |
Beta Was this translation helpful? Give feedback.
-
@HadesArchitect
If I have a table like this
The problem with that is that sensor_id is the partition key and registered_at is the clustering column. So, if I have a sensor at home which is taking a temperature each minute and I run it for 1 year, then after 1 year that partition would have 60x24x365 = 525600 records. If each record had 1 KB, every time you do a query you need to deal with a partition of size 513 MB. This can easily timeout your query.
If you have a 3 node cluster you want to save your metrics to and the replication factor is 1, then all these metrics would ever ended up being stored on a one node only. So you would have two nodes basically empty and the third node would hold all data so the data would be spread very unevenly. If that node goes down you have no metrics at all.
This example is not performance friendly at all. It would be more appropriate to apply some kind of bucketing which is a common technique for Cassandra, for this example like this:
That means that there would be at most 60x24 records = 1440 records per day and if each record has 1KB it would read 1.4MB at most for each day.
However, there would need to be some support for this in the plugin as well. Otherwise a user is forced to construct his queries with static number of day which basically renders his dashboard valid for 1 day only and then he needs to reconfigure it.
Dashboards valid for 1 month would be like this:
But that is same, it helps with bucketing, all records for a particular month would be in one partition, but dashboard would be valid for 1 month only. This is all true unless this plugin would dynamically propagate there the number of the current month.
I think having a placeholder for the number of current month and day would make this way more performance friedly.
Beta Was this translation helpful? Give feedback.
All reactions