Backup(BR) is getting failed for huge table count #58513

kabileshKabi · 2024-12-24T13:21:38Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Issue Description:
Our application is a SAAS based Multi-tenancy application with each tenant will have a DB , in which we have more than 14k databases and having more than 600k Tables.

While we have a strick backup requirement when we run the BR full backup its not showing any progress and gets failed with some RPC error

Command and Log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec_232024 --log-file backupdec232024.log
Detail BR log in backupdec232024.log 

Full Backup <..................................................................................................................................................................> 0.00%```


Am also attaching the details logs as well


### 2. What did you expect to see? (Required)

We expect the backup to run completely and consistently

### 3. What did you see instead (Required)
we see backup failures and BR backup is not happening

### 4. What is your TiDB version? (Required)

```mysql> select tidb_version();
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v8.1.1
Edition: Community
Git Commit Hash: a7df4f9845d5d6a590c5d45dad0dcc9f21aa8765
Git Branch: HEAD
UTC Build Time: 2024-08-22 05:49:03
GoVersion: go1.21.13
Race Enabled: false
Check Table Before Drop: false
Store: tikv |```[backup_log.txt](https://github.com/user-attachments/files/18239868/backup_log.txt)

The text was updated successfully, but these errors were encountered:

BornChanger · 2024-12-24T23:40:23Z

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

BornChanger · 2024-12-24T23:45:40Z

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

kabileshKabi · 2024-12-25T04:42:25Z

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

The BR is running in VM node, Backup is not getting killed it just exits out

kabileshKabi · 2024-12-25T04:43:02Z

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

This seems to be the issue as it needs to load the schema and its getting a RPC time-out

kabileshKabi · 2024-12-25T05:04:28Z

am sharing the log again:

command execuetion log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec252024 --log-file backupdec252024.log
Detail BR log in backupdec252024.log


Full Backup <...................................................................................................................................................................> 0.00%

[ec2-user@ip-10-1```

The above command is fired from the tiup node and just exits as above, without printing any error, we just get a Info stating as below. When the command is in execution we also noticed a high memory and CPU usage in the tiup node

```[2024/12/25 04:53:54.515 +00:00] [INFO] [backup.go:491] ["current backup safePoint job"] [safePoint="{ID=br-cfab8f42-1743-412b-b509-4e8661991690,TTL=1h12m0s,BackupTime=\"2024-12-25 04:53:54.48 +0000 UTC\",BackupTS=454846692584325250}"]
[2024/12/25 04:53:54.563 +00:00] [INFO] [manager.go:261] ["break campaign loop, NewSession failed"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:178\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb/pkg/util.contextDone\n\t/workspace/source/tidb/pkg/util/etcd.go:90\ngithub.com/pingcap/tidb/pkg/util.NewSession\n\t/workspace/source/tidb/pkg/util/etcd.go:50\ngithub.com/pingcap/tidb/pkg/owner.(*ownerManager).campaignLoop\n\t/workspace/source/tidb/pkg/owner/manager.go:259\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"]
[2024/12/25 04:53:55.840 +00:00] [INFO] [manager.go:317] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2024/12/25 04:54:57.889 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=test]
[2024/12/25 04:55:19.481 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=testplay]
[2024/12/25 04:55:51.300 +00:00] [INFO] [backup.go:578] ["get placement policies"] [count=0]```

Am also here with attaching the full logs for this.

Thank you,[backup_log_new.txt](https://github.com/user-attachments/files/18243431/backup_log_new.txt)

kabileshKabi added the type/bug The issue is confirmed as a bug. label Dec 24, 2024

BornChanger added the component/br This issue is related to BR of TiDB. label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup(BR) is getting failed for huge table count #58513

Backup(BR) is getting failed for huge table count #58513

kabileshKabi commented Dec 24, 2024

BornChanger commented Dec 24, 2024 •

edited

Loading

BornChanger commented Dec 24, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024 •

edited

Loading

Backup(BR) is getting failed for huge table count #58513

Backup(BR) is getting failed for huge table count #58513

Comments

kabileshKabi commented Dec 24, 2024

Bug Report

1. Minimal reproduce step (Required)

BornChanger commented Dec 24, 2024 • edited Loading

BornChanger commented Dec 24, 2024 • edited Loading

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024 • edited Loading

BornChanger commented Dec 24, 2024 •

edited

Loading

BornChanger commented Dec 24, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024 •

edited

Loading