Include additional helpful information in default DataUpload/DataDownload display. #8095

mrnold · 2024-08-08T00:26:39Z

Describe the problem/challenge you have
We had users evaluating backup performance by watching the progress of DataUploads, basically like this:

$ oc get dataupload --all-namespaces -w
NAMESPACE NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE
openshift-adp test ts-velero-test-1 0s
openshift-adp test ts-velero-test-1 0s
openshift-adp test Accepted ts-velero-test-1 0s
openshift-adp test Prepared ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 0s ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 13s 555745280 ts-velero-test-1 75s ip.lan
openshift-adp test InProgress 23s 1073741824 ts-velero-test-1 85s ip.lan
openshift-adp test InProgress 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan
openshift-adp test Completed 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan

It would have been helpful to see a little more information about the transfer, to help identify where to allocate resources for their cluster.

Describe the solution you'd like
I would like to add another field or two to this output so we can see elapsed time and transfer rates at a glance. This would help them estimate what backups can fit in their time window. It would also be good to see a count of the actual bytes moved over the network, taking kopia's incremental snapshots into account.

Anything else you would like to add:
I have not yet researched what kind of progress reporting kopia provides, please chime in if this sounds infeasible.

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

reasonerjt · 2024-08-12T06:30:00Z

elapsed time - This is equivalent to the value in STARTED field
transfer rate - This is more straight-forward, but I'm not sure if we can easily map it to a field in the CRD.

Given the information has been provided in the CR of dataupload I wanna tentatively move it to icebox

sseago · 2024-08-12T13:32:00Z

@reasonerjt elapsed time is only equivalent to STARTED before the DU completes. The main point here is to make it easy for a user looking at kubectl get du output to see how long each DU took to process.

mrnold · 2024-08-20T02:26:50Z

I opened #8128 to hold some work that will eventually show the difference in a clearer way.

Lyndon-Li · 2024-08-20T10:52:12Z

@mrnold
I don't think we need to add any new fields in DUCR/DDCR for this issue:
elapsed time = [current time] - startTimeStamp = STARTED
transfer rate = doneBytes / elapsed time

What we need here -- startTimeStamp and doneBytes have both included in the DUCR/DDCR.

Therefore, we can see minor gap for transfer rate (elapsed time is already there as STARTED column). But the real gap is how to use the kubuilder marker +kubebuilder:printcolumn to show the calculation result.
So please check with this direction first, we don't want to add redundant info into DUCR/DDCR.

Lyndon-Li · 2024-08-20T10:58:42Z

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice.
E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

mrnold · 2024-08-21T01:13:37Z

Just to summarize a few comments from the community meeting: one of the goals is to allow users to see that incremental backups are working. This is more important than showing a transfer rate. I think doneBytes might not be enough to show that successive backups transfer less data than an initial backup. I am looking into what options Kopia has for showing bytes moved.

I will see if I can get printcolumn to show calculated results, but I haven't yet figured out a way to do this without changing the CRD.

Lyndon-Li · 2024-08-21T09:18:40Z

@mrnold

allow users to see that incremental backups are working

The cachedBytes as the code here could fulfill this purpose. It means the bytes that has been skipped due to the incremental behavior, it monotonically increases along the running of the backup.

The current code has retrieved this info but not set to DUCR, so we need to add one more field to DUCR's DataUploadStatus and also add a new column through +kubebuilder:printcolumn

sseago · 2024-08-21T14:50:55Z

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice. E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

Hmm. OK -- so maybe the answer is to enhance backup describe to include the elapsed time for each du? What we want here is some way for a user to see in one place, how long each of the du operations took. This could be in the output of kubectl get du or it could be in the output of velero backup describe. Same for the other new field we want to add around transferred and/or cached bytes.

@Lyndon-Li Does moving this functionality to backup describe resolve your objections?
@mrnold Does backup describe work for your use case as well as kubectl get du?

sseago · 2024-08-21T15:44:00Z

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

Lyndon-Li · 2024-08-21T16:03:51Z

@sseago Let me conclude the info I think we need to have for the backup/restore performance observability:

elapsed time

It means the total elapsed time for the data transfer (it starts to count only after data transfer starts)
If we want to check it realtime during the backup running though kubectl get du, we already have it, that is, the STARTED column
We also want to have a permanent statistic after backup completes, then we need to have it in velero backup describe because DU doesn't guarantee existing after backup completes and it is not so meaningful to check static info through kubectl get du

cached bytes

It means the total bytes for the files that have been skipped by the incremental backup
For full backup, it should be equal to total bytes and for incremental backup, it should normally be less than totalBytes
It changes from time to time during the backup running
We probably need to change the name. "cached bytes" is not straightforward

throughput

It gives the most intuitive observability for how the backup performs
Some software make it as doneBytes/elapsedTime; some others make it as (doneBytes - cachedBytes) / elapsedTime. Both is acceptable, the latter just makes a trick to make the performance looks better

For all these info:

First thing first is to have them in velero backup describe. To make it, I think we can add the info to VolumeBackupInfo/VolumeRestoreInfo, so that we don't need to download every DU/DD during the describe.
It is icing on the cake to have them with kubectl get du. Because kubectl get du is for realtime observability only, so it improves user experience since users see the info dynamically changed and earlier.

Lyndon-Li · 2024-08-21T16:11:12Z

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

sseago · 2024-08-21T16:16:20Z

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

It isn't though. If you have 2 DUs that run sequentially in a backup (on same node, with parallel execution disabled), and each takes 10 minutes, at minute 19, you'll see DU1 showing STARTED at "19m" and DU2 showing STARTED at "9m" -- but elapsed time should tell us how long the DU has been in progress until completion -- so those fields would need to show "10m" and "9m".

Lyndon-Li · 2024-08-21T16:22:40Z

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

sseago · 2024-08-21T16:31:38Z

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

OK, lets step back a bit. If I'm a user monitoring a running backup and my main concern is DU progress, I want to look in one place to see:

which DUs are currently in progress
for this backup, how long has each DU taken (either to completion or to now, still running), how much is done (for still running ones), and total bytes (and eventually some data indicating skipped/transferred for incremental volume backup)

Currently, the only place to see any of this is kubectl get du. There we see in progress status, bytesDone, totalBytes, but we don't see the correct elapsed time for completed, only for in-progress.

One possibility is to enhance the du display columns. Of course, to get elapsed time would require a new field.

Another would be to enhance velero backup describe -- perhaps a du summary which gives similar information to kubectl get du, with proper calculated field for elapsed time. If done here, we wouldn't need the CRD change. But the important point is that this would need to work for in-progress backups as well as completed ones, since this is needed for real-time observability.

Either way, we need a real-time view (changes while backup and du progress) that shows completed, in-progress, and queued DUs for the current backup with elapsed time, bytes progress, etc. as appropriate.

Lyndon-Li · 2024-08-21T16:44:29Z

@sseago After a further thinking, below may be better for elapsedTime:

Add the elapsedTime field into DUCR/DDCR
Make the STARTED column show the time since the DU/DD is Accepted

Use elapsedTime over STARTED because it is updated simultaneously with doneBytes and cachedBytes. Then when we calculate throughput, the result is always correct; while STARTED or currentTime - startTimestamp is updated by kubernetes, if we use it to calculate throughput the number may be weirdly jumping if doneBytes / cachedBytes is not updated timely.

Changing STARTED because it is duplicated with elapsedTime. But rather than simply removing it, we can change it to express other info --- the time to prepare the snasphot/restored volume.

Lyndon-Li added area/datamover kind/requirement labels Aug 8, 2024

Lyndon-Li self-assigned this Aug 8, 2024

reasonerjt added the Icebox We see the value, but it is not slated for the next couple releases. label Aug 12, 2024

mrnold linked a pull request Aug 20, 2024 that will close this issue

[WIP] Include total transfer time in DataUpload/DataDownload columns. #8128

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include additional helpful information in default DataUpload/DataDownload display. #8095

Include additional helpful information in default DataUpload/DataDownload display. #8095

mrnold commented Aug 8, 2024

reasonerjt commented Aug 12, 2024

sseago commented Aug 12, 2024

mrnold commented Aug 20, 2024

Lyndon-Li commented Aug 20, 2024 •

edited

Loading

Lyndon-Li commented Aug 20, 2024 •

edited

Loading

mrnold commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024 •

edited

Loading

sseago commented Aug 21, 2024

sseago commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024 •

edited

Loading

Lyndon-Li commented Aug 21, 2024

sseago commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024

sseago commented Aug 21, 2024 •

edited

Loading

Lyndon-Li commented Aug 21, 2024

Include additional helpful information in default DataUpload/DataDownload display. #8095

Include additional helpful information in default DataUpload/DataDownload display. #8095

Comments

mrnold commented Aug 8, 2024

reasonerjt commented Aug 12, 2024

sseago commented Aug 12, 2024

mrnold commented Aug 20, 2024

Lyndon-Li commented Aug 20, 2024 • edited Loading

Lyndon-Li commented Aug 20, 2024 • edited Loading

mrnold commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024 • edited Loading

sseago commented Aug 21, 2024

sseago commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024 • edited Loading

Lyndon-Li commented Aug 21, 2024

sseago commented Aug 21, 2024

Lyndon-Li commented Aug 21, 2024

sseago commented Aug 21, 2024 • edited Loading

Lyndon-Li commented Aug 21, 2024

Lyndon-Li commented Aug 20, 2024 •

edited

Loading

Lyndon-Li commented Aug 20, 2024 •

edited

Loading

Lyndon-Li commented Aug 21, 2024 •

edited

Loading

Lyndon-Li commented Aug 21, 2024 •

edited

Loading

sseago commented Aug 21, 2024 •

edited

Loading