Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup #3708

Conversation

Peiyingy
Copy link
Contributor

@Peiyingy Peiyingy commented Jun 21, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

The method disableLiveHelixInstances() originally exists in GobblinYarnAppLauncher to disable pre-existing live instances in a Helix cluster. However, there can be an edge case when the GobblinYarnAppLauncher is bypassed due to a lost Application Master, like an automated replacement of the Application Master by Yarn, which will bypass the Azkaban Yarn App Launcher. In such cases, the disableLiveHelixInstances() will be not invoked.
Thus, we moved this method to GobblinClusterManager, which is at the Application Master level, and invokes it at the start stage. In this way, we can guarantee that it won't miss the edge case, and each time these pre-existing live instances can be disabled.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
Screenshot 2023-06-21 at 10 21 26 AM

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@@ -504,6 +506,18 @@ private static void printUsage(Options options) {
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options);
}

public void disableLiveHelixInstances() {
HelixManager helixManager = this.multiManager.getJobClusterHelixManager();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check that this is actually connected before using it? If you check the multimanager, it only connects if we are in dedicated mode.

So would this throw an exception?

Copy link
Contributor

@homatthew homatthew Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further investigation, this job cluster manager should be connected elsewhere. But I see another issue where you are fetching all live instances (include itself!). Which means that when the AM calls this method, it will disable itself.

Please take a look at the below code https://github.com/apache/gobblin/blob/5af6bca57df909e44b995e5b2d667c70e0399877/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java#L187C1-L196C6

This code fetches all live participants that are task runners. We should only disable taskrunners, so let's do that instead.

And then we can also rename this method to disableTaskRunnersFromPreviousExecutions.

@Peiyingy Peiyingy force-pushed the py-move-disabling-of-current-live-instances-GOBBLIN-1841 branch from d20bf51 to e75a235 Compare June 23, 2023 00:33
@Peiyingy Peiyingy force-pushed the py-move-disabling-of-current-live-instances-GOBBLIN-1841 branch from e75a235 to 0bb15ad Compare June 23, 2023 00:35
@@ -504,6 +506,18 @@ private static void printUsage(Options options) {
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options);
}

public void disableLiveHelixInstances() {
HelixManager helixManager = this.multiManager.getJobClusterHelixManager();
Copy link
Contributor

@homatthew homatthew Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further investigation, this job cluster manager should be connected elsewhere. But I see another issue where you are fetching all live instances (include itself!). Which means that when the AM calls this method, it will disable itself.

Please take a look at the below code https://github.com/apache/gobblin/blob/5af6bca57df909e44b995e5b2d667c70e0399877/gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnAutoScalingManager.java#L187C1-L196C6

This code fetches all live participants that are task runners. We should only disable taskrunners, so let's do that instead.

And then we can also rename this method to disableTaskRunnersFromPreviousExecutions.

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2023

Codecov Report

Merging #3708 (fabb39f) into master (51a852d) will increase coverage by 2.25%.
The diff coverage is 77.77%.

@@             Coverage Diff              @@
##             master    #3708      +/-   ##
============================================
+ Coverage     46.97%   49.23%   +2.25%     
+ Complexity    10794     9314    -1480     
============================================
  Files          2138     1760     -378     
  Lines         84132    68853   -15279     
  Branches       9356     7845    -1511     
============================================
- Hits          39518    33897    -5621     
+ Misses        41015    31808    -9207     
+ Partials       3599     3148     -451     
Impacted Files Coverage Δ
...in/java/org/apache/gobblin/cluster/HelixUtils.java 47.87% <0.00%> (+0.35%) ⬆️
...rg/apache/gobblin/yarn/GobblinYarnAppLauncher.java 26.08% <ø> (+0.58%) ⬆️
.../apache/gobblin/cluster/GobblinClusterManager.java 53.70% <75.00%> (+0.43%) ⬆️
.../apache/gobblin/yarn/GobblinApplicationMaster.java 31.11% <100.00%> (+12.69%) ⬆️
...rg/apache/gobblin/yarn/YarnAutoScalingManager.java 57.03% <100.00%> (+0.41%) ⬆️

... and 431 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Peiyingy Peiyingy force-pushed the py-move-disabling-of-current-live-instances-GOBBLIN-1841 branch 2 times, most recently from f497a70 to c5d09c6 Compare June 26, 2023 21:01
@Peiyingy Peiyingy force-pushed the py-move-disabling-of-current-live-instances-GOBBLIN-1841 branch from c5d09c6 to 7914810 Compare June 26, 2023 21:04
@@ -504,6 +516,7 @@ private static void printUsage(Options options) {
formatter.printHelp(GobblinClusterManager.class.getSimpleName(), options);
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove random whitespace. This will probably fail the linter

@@ -272,7 +260,7 @@ void runInternal() {
}
// Find all participants appearing in this cluster. Note that Helix instances can contain cluster-manager
// and potentially replanner-instance.
Set<String> allParticipants = getParticipants(HELIX_YARN_INSTANCE_NAME_PREFIX);
Set<String> allParticipants = HelixUtils.getParticipants(helixDataAccessor,HELIX_YARN_INSTANCE_NAME_PREFIX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add whitespace after the comma. This will probably fail the linter

int instanceCount = 3;
Map<String, HelixProperty> mockChildValuesMap = new HashMap<>();
for (int i = 0; i < instanceCount; i++) {
mockChildValuesMap.put("GobblinYarnTaskRunner_TestInstance_" + i, Mockito.mock(HelixProperty.class));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be adding non GobblinYarnTaskRunners as part of the cluster. E.g. GobblinClusterManager. Because we'd only want to disable those that start with GobblinYarnTaskRunner.

And then in your mockito verify, you should verify we never call disable on GobblinClusterManager

Set<String> taskRunners = HelixUtils.getParticipants(helixDataAccessor,
GobblinYarnTaskRunner.HELIX_YARN_INSTANCE_NAME_PREFIX);
LOGGER.warn("Found {} task runners in the cluster.", taskRunners.size());
for (String taskRunner: taskRunners) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace before the :. But this will probably be caught by the linter

Copy link
Contributor

@homatthew homatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work 🚢

Comment on lines -543 to -551
/**
* A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application
* leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use
* the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the
* new cluster.
*
* NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application
* completion, this method should be removed.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this comment back to the new impl

Copy link
Contributor

@homatthew homatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure to add back the comment from the previous impl

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ZihanLi58 ZihanLi58 merged commit 3d14b86 into apache:master Jun 30, 2023
phet added a commit to phet/gobblin that referenced this pull request Aug 15, 2023
* upstream/master:
  Fix bug with total count watermark whitelist (apache#3724)
  [GOBBLIN-1858] Fix logs relating to multi-active lease arbiter (apache#3720)
  [GOBBLIN-1838] Introduce total count based completion watermark (apache#3701)
  Correct num of failures (apache#3722)
  [GOBBLIN- 1856] Add flow trigger handler leasing metrics (apache#3717)
  [GOBBLIN-1857] Add override flag to force generate a job execution id based on gobbl… (apache#3719)
  [GOBBLIN-1855] Metadata writer tests do not work in isolation after upgrading to Iceberg 1.2.0 (apache#3718)
  Remove unused ORC writer code (apache#3710)
  [GOBBLIN-1853] Reduce # of Hive calls during schema related updates (apache#3716)
  [GOBBLIN-1851] Unit tests for MysqlMultiActiveLeaseArbiter with Single Participant (apache#3715)
  [GOBBLIN-1848] Add tags to dagmanager metrics for extensibility (apache#3712)
  [GOBBLIN-1849] Add Flow Group & Name to Job Config for Job Scheduler (apache#3713)
  [GOBBLIN-1841] Move disabling of current live instances to the GobblinClusterManager startup (apache#3708)
  [GOBBLIN-1840] Helix Job scheduler should not try to replace running workflow if within configured time (apache#3704)
  [GOBBLIN-1847] Exceptions in the JobLauncher should try to delete the existing workflow if it is launched (apache#3711)
  [GOBBLIN-1842] Add timers to GobblinMCEWriter (apache#3703)
  [GOBBLIN-1844] Ignore workflows marked for deletion when calculating container count (apache#3709)
  [GOBBLIN-1846] Validate Multi-active Scheduler with Logs (apache#3707)
  [GOBBLIN-1845] Changes parallelstream to stream in DatasetsFinderFilteringDecorator  to avoid classloader issues in spark (apache#3706)
  [GOBBLIN-1843] Utility for detecting non optional unions should convert dataset urn to hive compatible format (apache#3705)
  [GOBBLIN-1837] Implement multi-active, non blocking for leader host (apache#3700)
  [GOBBLIN-1835]Upgrade Iceberg Version from 0.11.1 to 1.2.0 (apache#3697)
  Update CHANGELOG to reflect changes in 0.17.0
  Reserving 0.18.0 version for next release
  [GOBBLIN-1836] Ensuring Task Reliability: Handling Job Cancellation and Graceful Exits for Error-Free Completion (apache#3699)
  [GOBBLIN-1805] Check watermark for the most recent hour for quiet topics (apache#3698)
  [GOBBLIN-1825]Hive retention job should fail if deleting underlying files fail (apache#3687)
  [GOBBLIN-1823] Improving Container Calculation and Allocation Methodology (apache#3692)
  [GOBBLIN-1830] Improving Container Transition Tracking in Streaming Data Ingestion (apache#3693)
  [GOBBLIN-1833]Emit Completeness watermark information in snapshotCommitEvent (apache#3696)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants