New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Refactor] add hive/hudi connector scan range source #49451

Merged

stephen-shelby merged 9 commits into StarRocks:main from dirtysalt:connector-split-source-3

Aug 15, 2024

Contributor

dirtysalt commented Aug 6, 2024 •

edited

Loading

Why I'm doing:

What I'm doing:

The main purpose is to to make scanNode.getScanRangeLocations return ConnectorScanRangeSource, instead of List<TScanRangeLocations>.

And here is the interface of ConnectorScanRangeSource. With that, you can get scan ranges incrementally, instead of get all of them in one batch.

public interface ConnectorScanRangeSource {
    List<TScanRangeLocations> getOutputs(int maxSize);
    boolean hasMoreOutput();
}

I implement HiveConnectorScanRangeSource and HudiConnectorScanRangeSource: it fetches RemoteFileInfo from RemoteInfoSource, and splits to scan ranges. The code to split file into scan ranges was in RemoteScanRangeLocations.java file, but it's useless any more.

And here is the arch of HiveConnectorScanRangeSource (HudiConnectorScanRangeSource is almost same, but different file split logic)

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

dirtysalt requested a review from a team as a code owner

August 6, 2024 12:13

mergify bot assigned dirtysalt

dirtysalt force-pushed the connector-split-source-3 branch 2 times, most recently from fb74dbd to 3eec261 Compare

August 9, 2024 03:49

stephen-shelby reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Outdated Show resolved Hide resolved

stephen-shelby reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Show resolved Hide resolved

Youngwb reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java

+                      private void tryPopulateBuffer() {
+                          while (buffer.isEmpty() && fileIndex < files.size()) {
+                              splitFile(files.get(fileIndex));

Contributor

Youngwb Aug 14, 2024

buffer is not empty after splitFile, so just populate one file?

Contributor Author

dirtysalt Aug 14, 2024

yes. try to split a singe files into scan ranges everytime.

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Show resolved Hide resolved

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java

+                      backendSplitFile = (backendSplitCount > 2 * nodes);
+                  }
+                  private void updateIterator() {

Contributor

Youngwb Aug 14, 2024

getOneParitionFiles is better?

Contributor Author

dirtysalt Aug 14, 2024

OK. I have no problem with naming. I think updateIterator is also good enough.

stephen-shelby reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Outdated

+                          return;
+                      }
+                      if (remoteFileInfoSource == null) {
+                          init();

Contributor

stephen-shelby Aug 14, 2024

why not init() in the setup(). This method has too much logic, we may try to simplify it asap

Contributor Author

dirtysalt Aug 14, 2024

Two reasons:

I'd like this operation to be as lazy as possible.
it's compatible with our UT cases.

Contributor Author

dirtysalt Aug 14, 2024

also I'll rename it to initRemoteFileInfoSource, which is more clear.

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Outdated Show resolved Hide resolved

stephen-shelby reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java Show resolved Hide resolved

stephen-shelby reviewed

View reviewed changes

fe/fe-core/src/main/java/com/starrocks/connector/hive/HiveConnectorScanRangeSource.java

+                      }
+                      // if splits is small comparing to nodes, then better not let backend do split.
+                      int nodes = connectContext.getAliveComputeNumber() + connectContext.getAliveBackendNumber();
+                      backendSplitFile = (backendSplitCount > 2 * nodes);

Contributor

stephen-shelby Aug 14, 2024

Will backendSplitFile be modified as split accumulates?

Contributor Author

dirtysalt Aug 14, 2024

Yes, if there are many splits already (exceeds BE nodes), then we can let BE do split without worrying some BE does not get splits.

but if there is few splits, it's better to split file at FE side using small split size, so all BE nodes could get splits.

In old code, we iterate all files to check split number is large enough. but in new code, since we are emitting splits incrementally, that's the best what we can do.

dirtysalt added 9 commits

August 14, 2024 22:02


          [Refactor] add hive/hudi connector scan range source

0e1a4bb

Signed-off-by: yanz <[email protected]>


          replace with scan range source

8a1e5d2

Signed-off-by: yanz <[email protected]>


          fix ut

Signed-off-by: yanz <[email protected]>


          fix unused import

d45c592

Signed-off-by: yanz <[email protected]>


          lazily to create remote file info source

d3cf0df

Signed-off-by: yanz <[email protected]>


          fix ut

cf216ec

Signed-off-by: yanz <[email protected]>


          remove remote scan range locations

4acc2fc

Signed-off-by: yanz <[email protected]>


          add shuffle comment

0415e50

Signed-off-by: yanz <[email protected]>


          fix for comment

cf5b2b6

Signed-off-by: yanz <[email protected]>

dirtysalt force-pushed the connector-split-source-3 branch from 04028d5 to cf5b2b6 Compare

August 14, 2024 14:03

sonarcloud bot commented Aug 14, 2024

Quality Gate failed

Failed conditions
9.4% Duplication on New Code (required ≤ 3%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Youngwb approved these changes

View reviewed changes

github-actions bot commented Aug 14, 2024

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions bot commented Aug 14, 2024

[FE Incremental Coverage Report]

✅ pass : 268 / 287 (93.38%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/connector/hudi/HudiConnectorScanRangeSource.java	57	66	86.36%	[99, 100, 104, 125, 128, 129, 132, 133, 134]
🔵	com/starrocks/connector/hive/HiveConnectorScanRangeSource.java	191	201	95.02%	[181, 182, 322, 323, 324, 348, 365, 393, 407, 408]
🔵	com/starrocks/connector/hive/RemoteFileInputFormat.java	3	3	100.00%	[]
🔵	com/starrocks/planner/FileTableScanNode.java	3	3	100.00%	[]
🔵	com/starrocks/planner/HudiScanNode.java	4	4	100.00%	[]
🔵	com/starrocks/connector/ConnectorScanRangeSource.java	1	1	100.00%	[]
🔵	com/starrocks/qe/ShortCircuitExecutor.java	3	3	100.00%	[]
🔵	com/starrocks/planner/HdfsScanNode.java	6	6	100.00%	[]

github-actions bot commented Aug 14, 2024

[BE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

stephen-shelby approved these changes

View reviewed changes

stephen-shelby merged commit 328e4c5 into StarRocks:main

47 of 48 checks passed

dirtysalt deleted the connector-split-source-3 branch

August 15, 2024 03:35

dirtysalt mentioned this pull request

To support incremental scan ranges deployment. #50196

Closed

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet