-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] add hive/hudi connector scan range source #49451
[Refactor] add hive/hudi connector scan range source #49451
Conversation
fb74dbd
to
3eec261
Compare
fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java
Show resolved
Hide resolved
|
||
private void tryPopulateBuffer() { | ||
while (buffer.isEmpty() && fileIndex < files.size()) { | ||
splitFile(files.get(fileIndex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffer is not empty after splitFile, so just populate one file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. try to split a singe files into scan ranges everytime.
fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java
Show resolved
Hide resolved
backendSplitFile = (backendSplitCount > 2 * nodes); | ||
} | ||
|
||
private void updateIterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getOneParitionFiles is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I have no problem with naming. I think updateIterator
is also good enough.
return; | ||
} | ||
if (remoteFileInfoSource == null) { | ||
init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not init()
in the setup()
. This method has too much logic, we may try to simplify it asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- I'd like this operation to be as lazy as possible.
- it's compatible with our UT cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I'll rename it to initRemoteFileInfoSource
, which is more clear.
fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java
Show resolved
Hide resolved
} | ||
// if splits is small comparing to nodes, then better not let backend do split. | ||
int nodes = connectContext.getAliveComputeNumber() + connectContext.getAliveBackendNumber(); | ||
backendSplitFile = (backendSplitCount > 2 * nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will backendSplitFile be modified as split accumulates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there are many splits already (exceeds BE nodes), then we can let BE do split without worrying some BE does not get splits.
but if there is few splits, it's better to split file at FE side using small split size, so all BE nodes could get splits.
In old code, we iterate all files to check split number is large enough. but in new code, since we are emitting splits incrementally, that's the best what we can do.
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
Signed-off-by: yanz <[email protected]>
04028d5
to
cf5b2b6
Compare
Quality Gate failedFailed conditions See analysis details on SonarCloud Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 268 / 287 (93.38%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Why I'm doing:
What I'm doing:
The main purpose is to to make
scanNode.getScanRangeLocations
returnConnectorScanRangeSource
, instead ofList<TScanRangeLocations>
.And here is the interface of
ConnectorScanRangeSource
. With that, you can get scan ranges incrementally, instead of get all of them in one batch.I implement
HiveConnectorScanRangeSource
andHudiConnectorScanRangeSource
: it fetchesRemoteFileInfo
fromRemoteInfoSource
, and splits to scan ranges. The code to split file into scan ranges was inRemoteScanRangeLocations.java
file, but it's useless any more.And here is the arch of
HiveConnectorScanRangeSource
(HudiConnectorScanRangeSource
is almost same, but different file split logic)Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: