-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improvement](segment) reduce memory usage when open segments #46570
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
e971b1a
to
36370c7
Compare
run buildall |
061c170
to
a2e2bcf
Compare
run buildall |
a2e2bcf
to
6921b73
Compare
run buildall |
6921b73
to
842ce50
Compare
run buildall |
TPC-H: Total hot run time: 34231 ms
|
if (seg_start == seg_end) { | ||
seg_start = 0; | ||
seg_end = segments.size(); | ||
_segments_rows)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把这个 _segment_rows 属性去掉吧,实际这个就这里用了一次,我们如果缓存下来,我怕有问题
if (seg_start == seg_end) { | ||
seg_start = 0; | ||
seg_end = segments.size(); | ||
_segments_rows)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also check rowid_conversion != nullptr, to avoid core
@@ -40,7 +40,7 @@ struct RowSetSplits { | |||
// if segment_offsets is not empty, means we only scan | |||
// [pair.first, pair.second) segment in rs_reader, only effective in dup key | |||
// and pipeline | |||
std::pair<int, int> segment_offsets; | |||
std::pair<int64_t, int64_t> segment_offsets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove get_segment_num_rows method from this class
const auto& tmp_segments = segment_cache_handle.get_segments(); | ||
_segments_rows[i] = tmp_segments[0]->num_rows(); | ||
if (i >= seg_start && i < seg_end) { | ||
segments[i] = tmp_segments[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行代码是什么意思?
} | ||
if (_read_context->record_rowids) { | ||
_segments_rows.resize(segment_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要加一个comment,解释一下什么时候会走到这里
auto segment_count = _rowset->num_segments(); | ||
std::vector<segment_v2::SegmentSharedPtr> segments(segment_count); | ||
auto [seg_start, seg_end] = _segment_offsets; | ||
if (seg_start == seg_end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也需要加一下注释,为什么== 的时候end =segment_count 了
TPC-DS: Total hot run time: 198259 ms
|
ClickBench: Total hot run time: 31.54 s
|
aedcfb8
to
bfdaa6f
Compare
run buildall |
TPC-H: Total hot run time: 32928 ms
|
TPC-DS: Total hot run time: 189023 ms
|
ClickBench: Total hot run time: 31.11 s
|
bfdaa6f
to
df41457
Compare
run buildall |
TPC-H: Total hot run time: 32999 ms
|
TPC-DS: Total hot run time: 195802 ms
|
ClickBench: Total hot run time: 32.19 s
|
TeamCity be ut coverage result: |
df41457
to
9688288
Compare
run buildall |
be/src/olap/rowset/beta_rowset.cpp
Outdated
// the segments in this rowset will be loaded by calling load_segments() explicitly. | ||
auto segment_count = num_segments(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要在这改,这里改有问题。
// do not use cache to load index
// because the index file may conflict
// and the cached fd may be invalid
RETURN_IF_ERROR(org_rowset->load(false));
有一些备份恢复的地方会调用这个函数,此时就会带来很多不必要的IO。
我们给get_segment_num_rows 这个函数加一个lock,让他去加载,然后返回。
然后这个do load 函数,可能没啥用了,你顺道删了把。之前的实现都是空的。
be/src/olap/rowset/beta_rowset.cpp
Outdated
|
||
DCHECK(seg_id >= 0); | ||
auto seg_path = DORIS_TRY(segment_path(seg_id)); | ||
io::FileReaderOptions reader_options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方,我们直接调用segment loader 的load segment 方法,把cache 设置为false,是不是等价的?
RETURN_IF_ERROR(_read_context->rowid_conversion->init_segment_map(rowset()->rowset_id(), | ||
segment_num_rows)); | ||
} | ||
const bool use_lazy_init_iterators = !_is_merge_iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个为啥不用lazy?
@@ -86,12 +86,24 @@ std::string file_cache_key_str(const std::string& seg_path) { | |||
return file_cache_key_from_path(seg_path).to_string(); | |||
} | |||
|
|||
Status Segment::get_num_rows(io::FileSystemSPtr fs, const std::string& path, int64_t tablet_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没意义
TPC-H: Total hot run time: 9985 ms
|
9688288
to
ca2ea85
Compare
ca2ea85
to
edb84dc
Compare
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 32583 ms
|
TPC-DS: Total hot run time: 188004 ms
|
ClickBench: Total hot run time: 31.55 s
|
TeamCity be ut coverage result: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: When there are a lot of segments in one rowset, it will consume plenty of memory if open all the segments all at once. This PR open segments one by one and release the `Segment` object immediately if it's not need to be kept for later use, thus reduce memory footprints dramatically.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
When there are a lot of segments in one rowset, it will consume plenty of memory if open all the segments all at once. This PR open segments one by one and release the
Segment
object immediately if it's not need to be kept for later use, thus reduce memory footprints dramatically.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)