Add design for only restore volume data. #7481

blackpiglet · 2024-02-29T08:15:13Z

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)
This is the design document for issue #7345.

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

draghuram · 2024-02-29T20:31:33Z

@blackpiglet, Can you please rebase the PR? It is currently showing what appears to be many unrelated commits.

Few comments/questions:

How are the new PVC names generated? I doesn't look like users can specify the target names so I guess Velero will rename them in a certain way?
Since "pods" is not included in the resource type, I guess this restore is not supported for FSB?
Does Snapshot data mover restore work for PVCs without specifying pods as resource type? I will verify it but it can work I guess because a mover pod is started?

I personally think that in-place restore is going to be more useful, especially if files can be selected for restore. If the whole PVC is being created any way, users will need to restart pods to attach the PVCs so it is almost like restoring pods themselves.

anshulahuja98 · 2024-03-04T10:17:12Z

Can #6354 be used in a generic way to achieve this use case?
CC: @kaovilai

kaovilai · 2024-03-04T14:09:39Z

Can #6354 be used in a generic way to achieve this use case?
CC: @kaovilai

So restore limiting resources to a labeled pv (or namespaced PVC) with recreate flag?

blackpiglet · 2024-03-04T14:42:12Z

@anshulahuja98
If the in-place restore is required, the recreating Existing Resource Policy cannot do that.

blackpiglet · 2024-03-04T14:56:21Z

@draghuram
Thanks. I rebased the branch, but there is still some problem with GPRC generating.

First, could you share some thoughts about why the in-place restore is more useful? I recently also heard some requirements about the GitOps and DevOps pipeline scenario. In those cases, some tools guarantee that the workload in k8s always running as expected. The in-place restore is needed in those cases, but I don't know whether those are common cases for k8s usage.

Second, to answer your question.

How are the new PVC names generated? I doesn't look like users can specify the target names so I guess Velero will rename them in a certain way? - User cannot specify the generated name. I haven't decided how to build the name yet. Maybe something like rename-pvc-<uuid>.
Since "pods" is not included in the resource type, I guess this restore is not supported for FSB? - FSB restore is supported, because the Generate Backup/Restore will first create an intermediate pod and PVC to mount the PV. After the backup/restore, the intermediate pod and PVC are deleted.
Does Snapshot data mover restore work for PVCs without specifying pods as resource type? I will verify it but it can work I guess because a mover pod is started? - Snapshot data mover can work without the pod, the reason is the same as the second item.

kaovilai · 2024-03-05T16:55:58Z

design/data_only_restore.md

+- Can filter the volumes that needs data restore.
+
+## Non Goals
+- Will not support the in-place volume data restore. To achieve data integrity, new PVC and PV are created, and wait for the user to mount the restored volume.


Heard from US community meeting that this will be moved into Goals.

Would have to discuss how we're handling volumes that are already mounted.. or do we say, you have to pre-delete/scaledown so volumes are not mounted by restore time.

@kaovilai For in-place, we can't scale down unless we create a dummy pod to mount the PVC, since there must be a mounting pod for kopia to have access to it from the node agent.

We already have dummy pod pattern from data mover work to follow

kaovilai · 2024-03-06T17:15:51Z

design/data_only_restore.md

+	// DataOnly specify whether to only restore volume data and related k8s resources
+	// without any other k8s resources.
+	DataOnly *bool `json:"dataOnly"`


Does this work via CSI only? file system backup only? or both?

We could use the data mover like approach where we write to a new pvc using file system backup through a dummy pod.

If we go with the in-place restore, only filesystem backup is supported.

An advanced CSI in-place could auto patch deployments to a new pvc restored from snapshots.

kaovilai · 2024-03-06T17:20:49Z

design/data_only_restore.md

+    // DataOnlyPvcMap is a map of backed-up PVC name and the data-only
+    // restored PVC name.
+    DataOnlyPvcMap map[types.NamespacedName]types.NamespaceName `json:"dataOnlyPvcMap,omitempty"`


It should be possible to add labelSelector to all resources that user expect to be swapping to restored PVC name, and having velero do the swapping on labeled (core k8s only for now) resources in the namespace.

kubectl label deployment deployName velero.io/sync-restored-pvc-name=<original-pvc-in-backup-name>

Any PVC restored using this method with a new name will have annotation velero.io/restored-pvc-original-name=<original name> that velero can use as reference for subsequent backup/restore pvc name syncing.

As we go with the in-place way, the name-mapping logic is unnecessary.

right.. for FSB with deleting PVC prior to restoring to the same name.

draghuram · 2024-03-07T15:20:40Z

@blackpiglet, gitops is certainly one reason for in place restore. Another use case is the need to periodically update data in an alternate standby cluster. Finally, if some application files are deleted or corrupted, user may only want to restore those files.

blackpiglet · 2024-03-20T06:51:34Z

@blackpiglet, gitops is certainly one reason for in place restore. Another use case is the need to periodically update data in an alternate standby cluster. Finally, if some application files are deleted or corrupted, user may only want to restore those files.

@draghuram
Thanks for the feedback.
If we go for the in-place data restore way, there will be some limitations to the using scenario. Only the filesystem uploader backup is supported. For the snapshot-based backup, Achieving the in-place restore result is impossible.

Is this acceptable? Data-only restore cannot support snapshot-based backups.
Involve more maintainers in this discussion.
@anshulahuja98 @kaovilai @sseago

kaovilai · 2024-03-20T06:55:12Z

FSB only is acceptable.

Tho I think restore from CSI snapshot to new PVC and then patching to use new PVC could be considered in-place with the caveat that it requires pod restart.

blackpiglet · 2024-03-20T07:05:16Z

However I think restore from CSI snapshot to new PVC and then patching to use new PVC could be considered in-place with the caveat that it requires pod restart.

Thanks for the quick response, then I will follow the filesystem-only way for now.
I got your point. There was a similar discussion in the Velero team. It does work, although it seems a bit inefficient. We can continue the discussion.
The idea of the filesystem data-only restore will also cause the pod to restart because the PodVolumeRestore way is preferred. That will introduce a new InitContainer in the pod. The advantage is that the pod's data operation will not interrupt the data restore process.

codecov · 2024-03-26T07:06:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.71%. Comparing base (2518824) to head (156685a).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7481   +/-   ##
=======================================
  Coverage   61.71%   61.71%           
=======================================
  Files         263      263           
  Lines       28869    28869           
=======================================
  Hits        17816    17816           
  Misses       9793     9793           
  Partials     1260     1260

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

anshulahuja98 · 2024-03-26T07:31:48Z

@blackpiglet, gitops is certainly one reason for in place restore. Another use case is the need to periodically update data in an alternate standby cluster. Finally, if some application files are deleted or corrupted, user may only want to restore those files.

@draghuram Thanks for the feedback. If we go for the in-place data restore way, there will be some limitations to the using scenario. Only the filesystem uploader backup is supported. For the snapshot-based backup, Achieving the in-place restore result is impossible.

Is this acceptable? Data-only restore cannot support snapshot-based backups. Involve more maintainers in this discussion. @anshulahuja98 @kaovilai @sseago

I went through further over the discussion
I understand a pure in-place restore can't be achieved without pod downtime.

With that caveat, I want to push for CSI snapshot based in-place restore. Here in-place restore refers mainly to Detaching PVC from workload, deleting PVC and re-creating PVC.

This is useful for Disaster recovery scenarios.

blackpiglet · 2024-03-26T07:52:18Z

With that caveat, I want to push for CSI snapshot based in-place restore. Here in-place restore refers mainly to Detaching PVC from workload, deleting PVC and re-creating PVC.

This proposal also cannot avoid pod downtime, so I suppose it aims to make the data-only restore can support more types of backups, right?

I understand a pure in-place restore can't be achieved without pod downtime.

If no downtime is a must-have feature, we can achieve that by ignoring the PodVolumeRestore's related pod InitContainer check logic, although that will compromise the data integrity, and data write could fail due to conflict.

Signed-off-by: Xun Jiang <[email protected]>

Lyndon-Li · 2024-07-24T03:04:37Z

I just noticed that this design doesn't cover the consideration for WaitForFirstConsumer volumes, some discussion see here.
According to the design, we will create the PVC/PV for users.
If the volume to be restored is with WaitForFirstConsumer mode, there is no way to create the volume appropriately without having the information which possible nodes the pod (that is going to consume the restored volume) will be scheduled.
For data only restore, we cannot assume that users could create the pod beforehand, but users must know which nodes they want to schedule their pod if they want to make some constraints.
Therefore, we need to add a new parameter for users to deliver the candidate nodes for a specific volume. These is a must have parameter.

github-actions bot added the Area/Design Design Documents label Feb 29, 2024

blackpiglet added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Feb 29, 2024

reasonerjt self-requested a review February 29, 2024 09:25

blackpiglet force-pushed the 7345_design branch from 456ba46 to c7fd291 Compare February 29, 2024 09:26

blackpiglet marked this pull request as ready for review February 29, 2024 09:26

github-actions bot assigned blackpiglet Feb 29, 2024

github-actions bot requested a review from ywk253100 February 29, 2024 09:26

blackpiglet force-pushed the 7345_design branch from c7fd291 to 192a22d Compare February 29, 2024 10:29

github-actions bot added has-changelog has-e2e-2tests has-unit-tests labels Feb 29, 2024

blackpiglet force-pushed the 7345_design branch from 74c975d to 8ebc939 Compare March 1, 2024 01:28

github-actions bot removed has-e2e-2tests has-unit-tests has-changelog labels Mar 1, 2024

reasonerjt mentioned this pull request Mar 3, 2024

Doing a full cluster restore fails to restore PVC contents #7487

Closed

kaovilai reviewed Mar 5, 2024

View reviewed changes

kaovilai mentioned this pull request Mar 5, 2024

Implement recreate ExistingResourcePolicy to restore API #6354

Closed

3 tasks

kaovilai reviewed Mar 6, 2024

View reviewed changes

blackpiglet force-pushed the 7345_design branch from 8ebc939 to ddd6fee Compare March 25, 2024 17:30

blackpiglet force-pushed the 7345_design branch 2 times, most recently from d9dbdc5 to f80b339 Compare March 26, 2024 06:58

blackpiglet mentioned this pull request Mar 26, 2024

Snapshot data movement restore does not fully work with StorageClass with binding mode WaitForFirstConsumer #7561

Open

Add design for only restore volume data.

156685a

Signed-off-by: Xun Jiang <[email protected]>

blackpiglet force-pushed the 7345_design branch from f80b339 to 156685a Compare March 26, 2024 16:33

kaovilai mentioned this pull request Apr 16, 2024

Add support for delete and recreate option to ExistingResourcePolicy feature #6142

Open

Lyndon-Li mentioned this pull request Jun 6, 2024

Restore failes for PVCs that have an ownerReference #7862

Closed

Lyndon-Li mentioned this pull request Jun 14, 2024

Persistent Volume Claims Remain Pending and Timeout During Restore #7890

Closed

Lyndon-Li mentioned this pull request Jul 24, 2024

bug? how to restic restore only one pvc #6690

Open

sseago mentioned this pull request Jul 24, 2024

Override current pv data with new restore #8047

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add design for only restore volume data. #7481

Add design for only restore volume data. #7481

blackpiglet commented Feb 29, 2024 •

edited

Loading

draghuram commented Feb 29, 2024

anshulahuja98 commented Mar 4, 2024

kaovilai commented Mar 4, 2024

blackpiglet commented Mar 4, 2024 •

edited

Loading

blackpiglet commented Mar 4, 2024 •

edited

Loading

kaovilai Mar 5, 2024

sseago Mar 5, 2024

kaovilai Mar 6, 2024

kaovilai Mar 6, 2024

kaovilai Mar 7, 2024

blackpiglet Mar 12, 2024

kaovilai Mar 16, 2024

kaovilai Mar 6, 2024

blackpiglet Mar 25, 2024

kaovilai Mar 25, 2024

draghuram commented Mar 7, 2024

blackpiglet commented Mar 20, 2024

kaovilai commented Mar 20, 2024

blackpiglet commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 26, 2024 •

edited

Loading

anshulahuja98 commented Mar 26, 2024

blackpiglet commented Mar 26, 2024 •

edited

Loading

Lyndon-Li commented Jul 24, 2024

Add design for only restore volume data. #7481

Are you sure you want to change the base?

Add design for only restore volume data. #7481

Conversation

blackpiglet commented Feb 29, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

draghuram commented Feb 29, 2024

anshulahuja98 commented Mar 4, 2024

kaovilai commented Mar 4, 2024

blackpiglet commented Mar 4, 2024 • edited Loading

blackpiglet commented Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

draghuram commented Mar 7, 2024

blackpiglet commented Mar 20, 2024

kaovilai commented Mar 20, 2024

blackpiglet commented Mar 20, 2024 • edited Loading

codecov bot commented Mar 26, 2024 • edited Loading

Codecov Report

anshulahuja98 commented Mar 26, 2024

blackpiglet commented Mar 26, 2024 • edited Loading

Lyndon-Li commented Jul 24, 2024

blackpiglet commented Feb 29, 2024 •

edited

Loading

blackpiglet commented Mar 4, 2024 •

edited

Loading

blackpiglet commented Mar 4, 2024 •

edited

Loading

blackpiglet commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 26, 2024 •

edited

Loading

blackpiglet commented Mar 26, 2024 •

edited

Loading