Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIVY-231: Multi node HA for batch sessions #222

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

meisam
Copy link
Contributor

@meisam meisam commented Nov 4, 2016

This is a preliminary PR for LIVY-231 (https://issues.cloudera.org/browse/LIVY-231) and it has known issues, but we can use it to discuss the design of multi-node HA for Livy.

The PR uses a cache for each Livy node. The cache keeps sessions' metadata in sync with ZooKeeper. Any change in ZooKKeeper data updates signals cache and updates the local copy of the data on Livy nodes.

The cache is implemented using Apache curator's "Path Cache" recipe: http://curator.apache.org/curator-recipes/path-cache.html.

This PR should be revised based on #220 (JIRA ticket: https://issues.cloudera.org/browse/LIVY-239)

@meisam meisam changed the title LIVY-231: Multi server batch recovery LIVY-231: Multi node HA for batch sessions Nov 4, 2016
@alex-the-man
Copy link
Contributor

We are trying to get a stable build for 0.3 first for Spark 2.0 and session recovery. Do you mind if we handle HA after 0.3 is released?

@meisam
Copy link
Contributor Author

meisam commented Nov 8, 2016

That would work.

@meisam meisam force-pushed the multi-server-batch-recovery branch from 62e1cde to b6e52fd Compare November 8, 2016 19:26
@codecov-io
Copy link

codecov-io commented Nov 8, 2016

Codecov Report

Merging #222 into master will increase coverage by -0.79%.

@@            Coverage Diff             @@
##           master     #222      +/-   ##
==========================================
- Coverage   71.53%   70.74%   -0.79%     
==========================================
  Files          91       89       -2     
  Lines        4697     4601      -96     
  Branches      811      780      -31     
==========================================
- Hits         3360     3255     -105     
- Misses        861      910      +49     
+ Partials      476      436      -40
Impacted Files Coverage Δ
.../com/cloudera/livy/server/batch/BatchSession.scala 81.92% <ø> (-4.15%)
...la/com/cloudera/livy/sessions/SessionManager.scala 62.5% <ø> (-18.66%)
...com/cloudera/livy/server/recovery/StateStore.scala 65.51% <ø> (-10.49%)
...era/livy/server/recovery/BlackholeStateStore.scala 83.33% <ø> (-16.67%)
...oudera/livy/server/batch/BatchSessionServlet.scala 87.5% <100%> (+0.54%)
...in/scala/com/cloudera/livy/server/LivyServer.scala 34.53% <100%> (-0.26%)
...ra/livy/server/recovery/FileSystemStateStore.scala 66.66% <100%> (+2.83%)
...m/cloudera/livy/server/recovery/SessionStore.scala 76% <75%> (+4.57%)
...era/livy/server/recovery/ZooKeeperStateStore.scala 39.36% <8.16%> (-36.73%)
...rc/main/scala/com/cloudera/livy/repl/Session.scala 56.16% <ø> (-13.07%)
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69ac11e...6e663a8. Read the comment docs.

@meisam
Copy link
Contributor Author

meisam commented Nov 8, 2016

I updated the pull request to fix the merge errors.

@alex-the-man alex-the-man added this to the HA milestone Nov 8, 2016
1. stores the session ID in Zookeeper/Filesystem

2. builds a cache on top of Zookeeper. The cache keeps metadata of all
batch sessions across all Livy servers connected to it and notifies all
livy servers with changes in cache.

3. Adds callback methods from SessionStore to SessionManager.
SessionStore watches events in the ZooKeeper cache and calls into proper
callback methods in SessionManager.

Task-url: https://issues.cloudera.org/browse/LIVY-231
Mocking batchSessionsCache in ZooKeeperStateStore.
Without mocking, Livy test fails to start a ZooKeeperStateStore.

Task-url: https://issues.cloudera.org/browse/LIVY-231
@meisam meisam force-pushed the multi-server-batch-recovery branch from b6e52fd to 6e663a8 Compare November 9, 2016 18:27
@alex-the-man alex-the-man modified the milestones: HA, 0.4 Dec 16, 2016
@shenh062326
Copy link

Hi, @meisam, @alex-the-man, is there any progress of this issue. Livy ha is import to us.

@pkasinathan
Copy link

pkasinathan commented Jun 7, 2017

Hi @shenh062326,

We had enabled MultiNode HA long time back on Paypal and have been using for close to a year and we already submitted 1.3 million spark jobs through livy multinode HA.

Even today we presented our updates on Spark Summit and committed to open source all our enhancements. @meisam will send updated PR soon and we will merge it soon.

Thanks
Prabhu

@harsh829
Copy link

harsh829 commented Apr 2, 2018

Is there any progress on multi-node HA for interactive sessions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants