Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] add hive/hudi connector scan range source #49451

Merged
merged 9 commits into from
Aug 15, 2024

Conversation

dirtysalt
Copy link
Contributor

@dirtysalt dirtysalt commented Aug 6, 2024

Why I'm doing:

What I'm doing:

The main purpose is to to make scanNode.getScanRangeLocations return ConnectorScanRangeSource, instead of List<TScanRangeLocations>.

And here is the interface of ConnectorScanRangeSource. With that, you can get scan ranges incrementally, instead of get all of them in one batch.

public interface ConnectorScanRangeSource {
    List<TScanRangeLocations> getOutputs(int maxSize);
    boolean hasMoreOutput();
}

I implement HiveConnectorScanRangeSource and HudiConnectorScanRangeSource: it fetches RemoteFileInfo from RemoteInfoSource, and splits to scan ranges. The code to split file into scan ranges was in RemoteScanRangeLocations.java file, but it's useless any more.

And here is the arch of HiveConnectorScanRangeSource (HudiConnectorScanRangeSource is almost same, but different file split logic)

image

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@dirtysalt dirtysalt requested a review from a team as a code owner August 6, 2024 12:13
@dirtysalt dirtysalt force-pushed the connector-split-source-3 branch 2 times, most recently from fb74dbd to 3eec261 Compare August 9, 2024 03:49

private void tryPopulateBuffer() {
while (buffer.isEmpty() && fileIndex < files.size()) {
splitFile(files.get(fileIndex));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer is not empty after splitFile, so just populate one file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. try to split a singe files into scan ranges everytime.

backendSplitFile = (backendSplitCount > 2 * nodes);
}

private void updateIterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOneParitionFiles is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I have no problem with naming. I think updateIterator is also good enough.

return;
}
if (remoteFileInfoSource == null) {
init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not init() in the setup(). This method has too much logic, we may try to simplify it asap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. I'd like this operation to be as lazy as possible.
  2. it's compatible with our UT cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I'll rename it to initRemoteFileInfoSource, which is more clear.

}
// if splits is small comparing to nodes, then better not let backend do split.
int nodes = connectContext.getAliveComputeNumber() + connectContext.getAliveBackendNumber();
backendSplitFile = (backendSplitCount > 2 * nodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will backendSplitFile be modified as split accumulates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if there are many splits already (exceeds BE nodes), then we can let BE do split without worrying some BE does not get splits.

but if there is few splits, it's better to split file at FE side using small split size, so all BE nodes could get splits.

In old code, we iterate all files to check split number is large enough. but in new code, since we are emitting splits incrementally, that's the best what we can do.

Copy link

sonarcloud bot commented Aug 14, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
9.4% Duplication on New Code (required ≤ 3%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 268 / 287 (93.38%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/connector/hudi/HudiConnectorScanRangeSource.java 57 66 86.36% [99, 100, 104, 125, 128, 129, 132, 133, 134]
🔵 com/starrocks/connector/hive/HiveConnectorScanRangeSource.java 191 201 95.02% [181, 182, 322, 323, 324, 348, 365, 393, 407, 408]
🔵 com/starrocks/connector/hive/RemoteFileInputFormat.java 3 3 100.00% []
🔵 com/starrocks/planner/FileTableScanNode.java 3 3 100.00% []
🔵 com/starrocks/planner/HudiScanNode.java 4 4 100.00% []
🔵 com/starrocks/connector/ConnectorScanRangeSource.java 1 1 100.00% []
🔵 com/starrocks/qe/ShortCircuitExecutor.java 3 3 100.00% []
🔵 com/starrocks/planner/HdfsScanNode.java 6 6 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@stephen-shelby stephen-shelby merged commit 328e4c5 into StarRocks:main Aug 15, 2024
47 of 48 checks passed
@dirtysalt dirtysalt deleted the connector-split-source-3 branch August 15, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants