Use Flink API to retrieve savepoint name #22

Nick-Triller · 2018-11-09T07:35:20Z

Thank you for this Project.

The PR changes the code in such a way that the Flink API endpoint /jobs/:jobid/savepoints/:triggerid is used to retrieve the name of the latest savepoint when updating a job.

How I imagine the deployer API should be used:

Intent	Deployer command(s)	Deployer behaviour
Update running job with new version (using an up-to-date savepoint)	`update` with job-name-base	Deployer cancels job and creates a savepoint in one operation. Deployer starts new job with savepoint.
Restart a job from an existing savepoint	`cancel` (not implemented yet) + `deploy` with savepoint-path	Deployer cancels the running job. Deployer starts a new job with the savepoint defined by a CLI argument.
Start a job without using a savepoint	`deploy`	Deployer starts new job without savepoint.

I removed the logic for retrieval of the most current savepoint through the file system completely. I think using the Flink API is a more flexible approach because you don't need to mount a volume with the snapshots, therefore allowing the usage of different storage solutions for savepoints / checkpoints such as blob stores like AWS S3 or Google Cloud Storage.

Also, the job name base mechanism didn't seem functional yet. It should work as advertised now.

Docker image: https://hub.docker.com/r/nicktriller/flink-deployer/

codecov-io · 2018-11-09T08:10:22Z

Codecov Report

Merging #22 into master will decrease coverage by 2.44%.
The diff coverage is 95.23%.

@@            Coverage Diff            @@
##           master     #22      +/-   ##
=========================================
- Coverage   55.85%   53.4%   -2.45%     
=========================================
  Files          12      11       -1     
  Lines         478     440      -38     
=========================================
- Hits          267     235      -32     
+ Misses        172     168       -4     
+ Partials       39      37       -2

Impacted Files	Coverage Δ
cmd/cli/main.go	`27.18% <0%> (-0.78%)`	⬇️
cmd/cli/operations/update_job.go	`88.73% <100%> (-0.61%)`	⬇️
cmd/cli/operations/deploy.go	`75% <100%> (-1.32%)`	⬇️
cmd/cli/flink/savepoint.go	`77.77% <100%> (+0.63%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56a4d94...7b7556a. Read the comment docs.

mrooding · 2018-11-23T10:06:20Z

Hi Nick,

Thanks for the contribution. Both me and Niels have had quite a few debates about how to properly implement the savepoint retrieval. Referring to your table, option 2 would work if cancel would return the full savepoint-path. If not, restarting from an existing savepoint would require the caller of the deploy action to know the full path to the savepoint. That means that an external system would need to store the savepoint path and pass it to the deployer. This is something we didn't want to impose on our users immediately.

This problem also applies to supporting a deploy with a savepoint path without cancelling first. There's no way of knowing the full path to the savepoint.

joshuavijay · 2019-09-17T15:46:42Z

@mrooding I think Nicks intentions were correct. We should be retrieving savepoint path from Flink API. There are some edge cases where two consecutive savepoints are triggered, or savepoints are triggered for different jobs (parallel invocation of flink-deployer, or one from flink-deployer and one from flinkUI), where simply retrieving the latest savepoint will not give the correct savepoint required to restart the job. Also, Nick's change support all savepoint schemes that flink supports and will support in the future (s3, hdfs) without having to implement specifically for each scheme like I saw in the other PR created by kerinin. Also currently flink-deployer only supports local file system, where as must flink deployments these days run in k8s where there is no local storage. What we need is part of Nick's changes merged into the latest code base:

Leave the deploy action alone
Update action will not call terminate, but trigger a savepoint and cancel the job as part of that trigger (Nick's change)
Monitor savepoint will return in the response, the location of savepoint (because we monitor by triggerid); there is no chance of retrieving the wrong savepoint.
Update need not call retrieveLatestSavepoint

What do you think? If Nick is busy I can do it.

Nick-Triller · 2019-09-17T21:13:44Z

Hi @joshuavijay!
Thanks for your input. I haven't been working much with Flink in 2019, therefore the details of the relevant mechanisms aren't on the top of my mind. Feel free to reuse any parts that are useful to you.

Nick-Triller added 2 commits November 9, 2018 08:17

Use API to retrieve latest savepoint name

e9f7c40

Fix tests

7b7556a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Flink API to retrieve savepoint name #22

Use Flink API to retrieve savepoint name #22

Nick-Triller commented Nov 9, 2018 •

edited

Loading

codecov-io commented Nov 9, 2018

mrooding commented Nov 23, 2018

joshuavijay commented Sep 17, 2019

Nick-Triller commented Sep 17, 2019 •

edited

Loading

Use Flink API to retrieve savepoint name #22

Are you sure you want to change the base?

Use Flink API to retrieve savepoint name #22

Conversation

Nick-Triller commented Nov 9, 2018 • edited Loading

codecov-io commented Nov 9, 2018

Codecov Report

mrooding commented Nov 23, 2018

joshuavijay commented Sep 17, 2019

Nick-Triller commented Sep 17, 2019 • edited Loading

Nick-Triller commented Nov 9, 2018 •

edited

Loading

Nick-Triller commented Sep 17, 2019 •

edited

Loading