Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多维分析场景,维度组合大概有30种组合,数据膨胀后会生产很多bitmap,导致gc时间很长,有好的解决方案? #15

Open
17521708864 opened this issue Jun 17, 2024 · 5 comments

Comments

@17521708864
Copy link

No description provided.

@lihuigang
Copy link
Owner

1、膨胀前可以先对明细数据预聚合一下 to_bitmap -> bitmap_union_count

@lihuigang
Copy link
Owner

你计算结果保存的是Bitmap?还是去重(bitmap_count)后的值?

@17521708864
Copy link
Author

1、膨胀前可以先对明细数据预聚合一下 to_bitmap -> bitmap_union_count

是这么操作的,先构建了一张轻度聚合表,然后useri_id也做了映射表,to_bitmap(user_id)落盘存储了。基于这个轻度聚合层的数据做了cube操作,维度比较多,spark gc时间很长

@17521708864
Copy link
Author

你计算结果保存的是Bitmap?还是去重(bitmap_count)后的值?

轻度聚合层保存的是bitmap,然后基于这个轻度聚合层做cube,用到了bitmap_union和bitmap_count

@lihuigang
Copy link
Owner

1、膨胀前可以先对明细数据预聚合一下 to_bitmap -> bitmap_union_count

是这么操作的,先构建了一张轻度聚合表,然后useri_id也做了映射表,to_bitmap(user_id)落盘存储了。基于这个轻度聚合层的数据做了cube操作,维度比较多,spark gc时间很长

  1. 映射表中的值是连续吗?值最好是连续的
  2. Gc 时间长,可能是内存不够,可以调大spark.executor.memory和spark.yarn.executor.memoryOverhead的值
    3.可以增加分区数量spark.sql.shuffle.partitions,能减少单个stage处理的数据量

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants