-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc] reconstruct bitmap index #46061
Conversation
Signed-off-by: hellolilyliuyi <[email protected]>
|
||
### 如何合理设计 Bitmap 索引,以便加速查询 | ||
|
||
选择 Bitmap 索引的首要考虑因素是**列的基数和 Bitmap 索引对查询的过滤效果。**与普遍观念相反,Bitmap 索引比较适用于**较高基数列的查询和多个低基数列的组合查询,此时 Bitmap 索引对查询的过滤效果比较好**,能够过滤较多的 Page 数据。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得一般情况下,不需要透露 page 的概念。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不透露Page概念的话,不好解释低基数列为什么效果不好
|
||
::: | ||
|
||
然而如果基数过于高,也会带来其他问题,比如**占用较多的磁盘空间**,并且因为需要导入时需要构建 Bitmap 索引,导入频繁时则**导入性能会受影响**。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“导入性能”影响有多大,如果不大,就不用说了。这里主要关注查询性能。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have a greater impact on import performance and capacity.
|
||
然而如果基数过于高,也会带来其他问题,比如**占用较多的磁盘空间**,并且因为需要导入时需要构建 Bitmap 索引,导入频繁时则**导入性能会受影响**。 | ||
|
||
并且还需要考虑**查询时加载 Bitmap 索引的开销**。因为查询时候只会按需加载 Bitmap 索引,即 `查询条件涉及的列值/基数 x Bitmap 索引`。这一值越大,则查询时加载的 Bitmap 索引开销也越大。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
查询条件涉及的列值
,是啥?列值的数量?
x Bitmap 索引
是啥意思?大小?行数?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
查询条件涉及的值的数量 / 基数 * 单个Bimtap 索引大小
|
||
并且还需要考虑**查询时加载 Bitmap 索引的开销**。因为查询时候只会按需加载 Bitmap 索引,即 `查询条件涉及的列值/基数 x Bitmap 索引`。这一值越大,则查询时加载的 Bitmap 索引开销也越大。 | ||
|
||
因此为了确定 Bitmap 索引适合列的基数和查询,建议您参考本文的 [Bitmap 索引性能测试](#Bitmap 索引性能测试),根据实际业务数据和查询进行性能测试:**在不同基数的列上使用 Bitmap 索引,分析和权衡 Bitmap 索引对于查询过滤效果,以及带来的磁盘空间占用,导入性能的影响,和查询时加载 Bitmap 索引的开销等额外影响。** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看是否能给个相对明确的范围,比如多大基数范围。否则这段话,对用户来说,也只能是“听君一席话”而已。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BitmapIndex至少能过滤掉999/1000的数据
|
||
## 使用说明 | ||
- 能够快速定位 1 个值所在的数据行号,适用于点查或是小范围查询。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么突出“1 个值”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“1个值” 可以去掉
|
||
![figure](../../assets/3.6.1-2.png) | ||
总共耗时约 0.91 ms**,其中加载数据花了 0.47 ms**,低基数优化字典解码花了 0.31 ms,谓词过滤花了 0.23 ms。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 ** 好像有点问题
|
||
1. 构建字典:StarRocks 根据 `Gender` 列的取值构建一个字典,将 `female` 和 `male` 分别映射为 INT 类型的编码值:`0` 和 `1`。 | ||
2. 生成 bitmap:StarRocks 根据字典的编码值生成 bitmap。因为 `female` 出现在前三行,所以 `female` 的 bitmap 是 `1110`;`male` 出现在第 4 行,所以 `male` 的 bitmap 是 `0001`。 | ||
```SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个就不搞 SQL语法了,看着花花绿绿的,搞成bash?
DictDecode: 329.696ms // 因为输出的行数是一样的,所以低基数优化字典解码的时间所花时间差不多 | ||
BitmapIndexFilter: 419.308ms // Bitmap 索引过滤数据的时间。 | ||
BitmapIndexFilterRows: 123.433M (123432975) // Bitmap 索引过滤掉的数据行数。 | ||
ZoneMapIndexFiter: 171.580ms // ZoneMap 索引过滤数据花了 0.17s。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么比上面多了个 zonemap的时间?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里比较复杂,这个先这样写上就行,后面我再想下怎么解释
**查询语句**: | ||
|
||
```SQL | ||
SELECT count(1) FROM lineorder_without_index WHERE lo_shipmode="MAIL" AND lo_quantity=10 AND lo_discount=9 AND lo_tax=8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议分行,否则看起来有些累
|
||
**查询性能分析**:由于是基于多个低基数列的组合查询,Bitmap 索引效果较好,能够过滤掉一部分 Page,读取数据的时间明显减少。 | ||
|
||
总共耗时 0.68s,**其中加载数据和 Bitmap 索引花了 0.54s,**Bitmap 索引过滤数据花了 0.14s。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 ** 好像也有些问题
Signed-off-by: hellolilyliuyi <[email protected]>
Signed-off-by: hellolilyliuyi <[email protected]>
Signed-off-by: hellolilyliuyi <[email protected]>
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
@Mergifyio backport branch-3.3 |
✅ Backports have been created
|
Signed-off-by: hellolilyliuyi <[email protected]> (cherry picked from commit ae01369)
@mergify backport branch-3.2 |
✅ Backports have been created
|
Signed-off-by: hellolilyliuyi <[email protected]> (cherry picked from commit ae01369)
Co-authored-by: hellolilyliuyi <[email protected]>
@mergify backport branch-3.1 branch-3.0 |
✅ Backports have been created
|
Signed-off-by: hellolilyliuyi <[email protected]> (cherry picked from commit ae01369) # Conflicts: # docs/zh/using_starrocks/Bitmap_index.md
Signed-off-by: hellolilyliuyi <[email protected]> (cherry picked from commit ae01369) # Conflicts: # docs/zh/using_starrocks/Bitmap_index.md
Signed-off-by: hellolilyliuyi <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]>
Signed-off-by: hellolilyliuyi <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
@mergify backport branch-2.5 |
✅ Backports have been created
|
Signed-off-by: hellolilyliuyi <[email protected]> (cherry picked from commit ae01369) # Conflicts: # docs/zh/using_starrocks/Bitmap_index.md
Signed-off-by: hellolilyliuyi <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]>
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: