Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4554: Counter for used nodes within a DAG #362

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Jun 19, 2024

tested on a small tez container mode cluster

with 2 healthy and running yarn nodemanagers:

INFO  : org.apache.tez.common.counters.DAGCounter:
...
INFO  :    NODE_USED_COUNT: 2
INFO  :    NODE_TOTAL_COUNT: 2

after stopping 1 nodemanager:

INFO  : org.apache.tez.common.counters.DAGCounter:
...
INFO  :    NODE_USED_COUNT: 1
INFO  :    NODE_TOTAL_COUNT: 2

after decommissioning 1 nodemanager:

INFO  : org.apache.tez.common.counters.DAGCounter:
...
INFO  :    NODE_USED_COUNT: 1
INFO  :    NODE_TOTAL_COUNT: 1

also tested with LLAP on Cloudera CDW (3 LLAP daemons)

INFO  :    NODE_USED_COUNT: 3
INFO  :    NODE_TOTAL_COUNT: 3

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

public void addUsedContainer(Container container) {
containersUsedByCurrentDAG.add(container.getId());
nodesUsedByCurrentDAG.add(container.getNodeId());
nodeHostsUsedByCurrentDAG.add(container.getNodeId().getHost());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can container.getNodeId() be null in any scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, should not be, in that case it's a yarn bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhk

@abstractdog
Copy link
Contributor Author

thanks @ayushtkn for the review
I think that empty findbugs alert doesn't make any sense, I'll rerun the precommit test once again

@tez-yetus

This comment was marked as outdated.

@abstractdog
Copy link
Contributor Author

abstractdog commented Jun 21, 2024

thanks @ayushtkn for the review so far

seems like I have to slightly change this patch before moving to HIVE-28201, because it returns all nodes by TaskSchedulerManager.getNumClusterNodes, which is served by a cachedNodeCount, local to the TaskSchedulerManager, it won't be accessible from the LlapTaskSchedulerService
I'm about to change this method to call and accumulate the individual TaskSchedulers' (e.g. the LLAP one) getClusterNodeCount to make LLAP able to be return with a real value:
https://github.com/apache/hive/blob/1c9969a003b09abc851ae7e19631ad208d3b6066/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java#L1062

best scenario is that HIVE-28201 won't need any implementation

I'll let you know later

@abstractdog
Copy link
Contributor Author

abstractdog commented Jun 21, 2024

with the new commit, I was able to handle LLAP as well, updated the PR description with beeline outputs
also, removed NODE_HOSTS_USED_COUNT I felt a bit useless and confusing handling it separately from NODE_USED_COUNT

@ayushtkn : I would appreciate a second look when you have the chance

@tez-yetus

This comment was marked as outdated.

Change-Id: Ic72fbe3d490e2729ece74f07b42891d1c191b1c5
@tez-yetus

This comment was marked as outdated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 22m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 5m 47s Maven dependency ordering for branch
+1 💚 mvninstall 10m 2s master passed
+1 💚 compile 1m 22s master passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚 compile 1m 15s master passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚 checkstyle 1m 29s master passed
+1 💚 javadoc 1m 15s master passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚 javadoc 1m 6s master passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+0 🆗 spotbugs 1m 20s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 2m 59s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 10s Maven dependency ordering for patch
+1 💚 mvninstall 0m 50s the patch passed
+1 💚 compile 0m 53s the patch passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚 javac 0m 53s the patch passed
+1 💚 compile 0m 48s the patch passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚 javac 0m 48s the patch passed
+1 💚 checkstyle 0m 34s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 34s the patch passed with JDK Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1
+1 💚 javadoc 0m 36s the patch passed with JDK Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
+1 💚 findbugs 2m 16s the patch passed
_ Other Tests _
+1 💚 unit 2m 19s tez-api in the patch passed.
+1 💚 unit 5m 13s tez-dag in the patch passed.
+1 💚 asflicense 0m 26s The patch does not generate ASF License warnings.
63m 48s
Subsystem Report/Notes
Docker ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-362/9/artifact/out/Dockerfile
GITHUB PR #362
JIRA Issue TEZ-4554
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 7f89bbf01fa4 5.15.0-106-generic #116-Ubuntu SMP Wed Apr 17 09:17:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 0ac505b
Default Java Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.23+9-post-Ubuntu-1ubuntu122.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_412-8u412-ga-1~22.04.1-b08
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-362/9/testReport/
Max. process+thread count 367 (vs. ulimit of 5500)
modules C: tez-api tez-dag U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-362/9/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abstractdog abstractdog merged commit 19b2351 into apache:master Jun 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants