-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup #3708
[GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup #3708
Conversation
gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinClusterManagerTest.java
Outdated
Show resolved
Hide resolved
gobblin-cluster/src/test/java/org/apache/gobblin/cluster/GobblinClusterManagerTest.java
Outdated
Show resolved
Hide resolved
@@ -504,6 +506,18 @@ private static void printUsage(Options options) { | |||
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options); | |||
} | |||
|
|||
public void disableLiveHelixInstances() { | |||
HelixManager helixManager = this.multiManager.getJobClusterHelixManager(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that this is actually connected before using it? If you check the multimanager, it only connects if we are in dedicated mode.
So would this throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further investigation, this job cluster manager should be connected elsewhere. But I see another issue where you are fetching all live instances (include itself!). Which means that when the AM calls this method, it will disable itself.
Please take a look at the below code https://github.com/apache/gobblin/blob/5af6bca57df909e44b995e5b2d667c70e0399877/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java#L187C1-L196C6
This code fetches all live participants that are task runners. We should only disable taskrunners, so let's do that instead.
And then we can also rename this method to disableTaskRunnersFromPreviousExecutions
.
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterManager.java
Outdated
Show resolved
Hide resolved
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterManager.java
Outdated
Show resolved
Hide resolved
d20bf51
to
e75a235
Compare
e75a235
to
0bb15ad
Compare
@@ -504,6 +506,18 @@ private static void printUsage(Options options) { | |||
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options); | |||
} | |||
|
|||
public void disableLiveHelixInstances() { | |||
HelixManager helixManager = this.multiManager.getJobClusterHelixManager(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further investigation, this job cluster manager should be connected elsewhere. But I see another issue where you are fetching all live instances (include itself!). Which means that when the AM calls this method, it will disable itself.
Please take a look at the below code https://github.com/apache/gobblin/blob/5af6bca57df909e44b995e5b2d667c70e0399877/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java#L187C1-L196C6
This code fetches all live participants that are task runners. We should only disable taskrunners, so let's do that instead.
And then we can also rename this method to disableTaskRunnersFromPreviousExecutions
.
Codecov Report
@@ Coverage Diff @@
## master #3708 +/- ##
============================================
+ Coverage 46.97% 49.23% +2.25%
+ Complexity 10794 9314 -1480
============================================
Files 2138 1760 -378
Lines 84132 68853 -15279
Branches 9356 7845 -1511
============================================
- Hits 39518 33897 -5621
+ Misses 41015 31808 -9207
+ Partials 3599 3148 -451
... and 431 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
f497a70
to
c5d09c6
Compare
…inApplicationMaster
c5d09c6
to
7914810
Compare
@@ -504,6 +516,7 @@ private static void printUsage(Options options) { | |||
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options); | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove random whitespace. This will probably fail the linter
@@ -272,7 +260,7 @@ void runInternal() { | |||
} | |||
// Find all participants appearing in this cluster. Note that Helix instances can contain cluster-manager | |||
// and potentially replanner-instance. | |||
Set<String> allParticipants = getParticipants(HELIX_YARN_INSTANCE_NAME_PREFIX); | |||
Set<String> allParticipants = HelixUtils.getParticipants(helixDataAccessor,HELIX_YARN_INSTANCE_NAME_PREFIX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add whitespace after the comma. This will probably fail the linter
int instanceCount = 3; | ||
Map<String, HelixProperty> mockChildValuesMap = new HashMap<>(); | ||
for (int i = 0; i < instanceCount; i++) { | ||
mockChildValuesMap.put("GobblinYarnTaskRunner_TestInstance_" + i, Mockito.mock(HelixProperty.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also be adding non GobblinYarnTaskRunner
s as part of the cluster. E.g. GobblinClusterManager
. Because we'd only want to disable those that start with GobblinYarnTaskRunner
.
And then in your mockito verify, you should verify we never call disable on GobblinClusterManager
Set<String> taskRunners = HelixUtils.getParticipants(helixDataAccessor, | ||
GobblinYarnTaskRunner.HELIX_YARN_INSTANCE_NAME_PREFIX); | ||
LOGGER.warn("Found {} task runners in the cluster.", taskRunners.size()); | ||
for (String taskRunner: taskRunners) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace before the :
. But this will probably be caught by the linter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work 🚢
/** | ||
* A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application | ||
* leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use | ||
* the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the | ||
* new cluster. | ||
* | ||
* NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application | ||
* completion, this method should be removed. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this comment back to the new impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure to add back the comment from the previous impl
…utions implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
* upstream/master: Fix bug with total count watermark whitelist (apache#3724) [GOBBLIN-1858] Fix logs relating to multi-active lease arbiter (apache#3720) [GOBBLIN-1838] Introduce total count based completion watermark (apache#3701) Correct num of failures (apache#3722) [GOBBLIN- 1856] Add flow trigger handler leasing metrics (apache#3717) [GOBBLIN-1857] Add override flag to force generate a job execution id based on gobbl… (apache#3719) [GOBBLIN-1855] Metadata writer tests do not work in isolation after upgrading to Iceberg 1.2.0 (apache#3718) Remove unused ORC writer code (apache#3710) [GOBBLIN-1853] Reduce # of Hive calls during schema related updates (apache#3716) [GOBBLIN-1851] Unit tests for MysqlMultiActiveLeaseArbiter with Single Participant (apache#3715) [GOBBLIN-1848] Add tags to dagmanager metrics for extensibility (apache#3712) [GOBBLIN-1849] Add Flow Group & Name to Job Config for Job Scheduler (apache#3713) [GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup (apache#3708) [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time (apache#3704) [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched (apache#3711) [GOBBLIN-1842] Add timers to GobblinMCEWriter (apache#3703) [GOBBLIN-1844] Ignore workflows marked for deletion when calculating container count (apache#3709) [GOBBLIN-1846] Validate Multi-active Scheduler with Logs (apache#3707) [GOBBLIN-1845] Changes parallelstream to stream in DatasetsFinderFilteringDecorator to avoid classloader issues in spark (apache#3706) [GOBBLIN-1843] Utility for detecting non optional unions should convert dataset urn to hive compatible format (apache#3705) [GOBBLIN-1837] Implement multi-active, non blocking for leader host (apache#3700) [GOBBLIN-1835]Upgrade Iceberg Version from 0.11.1 to 1.2.0 (apache#3697) Update CHANGELOG to reflect changes in 0.17.0 Reserving 0.18.0 version for next release [GOBBLIN-1836] Ensuring Task Reliability: Handling Job Cancellation and Graceful Exits for Error-Free Completion (apache#3699) [GOBBLIN-1805] Check watermark for the most recent hour for quiet topics (apache#3698) [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail (apache#3687) [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (apache#3692) [GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion (apache#3693) [GOBBLIN-1833]Emit Completeness watermark information in snapshotCommitEvent (apache#3696)
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
The method
disableLiveHelixInstances()
originally exists inGobblinYarnAppLauncher
to disable pre-existing live instances in a Helix cluster. However, there can be an edge case when theGobblinYarnAppLauncher
is bypassed due to a lost Application Master, like an automated replacement of the Application Master by Yarn, which will bypass the Azkaban Yarn App Launcher. In such cases, thedisableLiveHelixInstances()
will be not invoked.Thus, we moved this method to
GobblinClusterManager
, which is at the Application Master level, and invokes it at the start stage. In this way, we can guarantee that it won't miss the edge case, and each time these pre-existing live instances can be disabled.Tests
Commits