Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup(BR) is getting failed for huge table count #58513

Open
kabileshKabi opened this issue Dec 24, 2024 · 5 comments
Open

Backup(BR) is getting failed for huge table count #58513

kabileshKabi opened this issue Dec 24, 2024 · 5 comments
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.

Comments

@kabileshKabi
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Issue Description:
Our application is a SAAS based Multi-tenancy application with each tenant will have a DB , in which we have more than 14k databases and having more than 600k Tables.

While we have a strick backup requirement when we run the BR full backup its not showing any progress and gets failed with some RPC error

Command and Log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec_232024 --log-file backupdec232024.log
Detail BR log in backupdec232024.log 

Full Backup <..................................................................................................................................................................> 0.00%```


Am also attaching the details logs as well


### 2. What did you expect to see? (Required)

We expect the backup to run completely and consistently

### 3. What did you see instead (Required)
we see backup failures and BR backup is not happening

### 4. What is your TiDB version? (Required)

```mysql> select tidb_version();
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v8.1.1
Edition: Community
Git Commit Hash: a7df4f9845d5d6a590c5d45dad0dcc9f21aa8765
Git Branch: HEAD
UTC Build Time: 2024-08-22 05:49:03
GoVersion: go1.21.13
Race Enabled: false
Check Table Before Drop: false
Store: tikv |```[backup_log.txt](https://github.com/user-attachments/files/18239868/backup_log.txt)



@kabileshKabi kabileshKabi added the type/bug The issue is confirmed as a bug. label Dec 24, 2024
@BornChanger
Copy link
Contributor

BornChanger commented Dec 24, 2024

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

@BornChanger BornChanger added the component/br This issue is related to BR of TiDB. label Dec 24, 2024
@BornChanger
Copy link
Contributor

BornChanger commented Dec 24, 2024

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

@kabileshKabi
Copy link
Author

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

The BR is running in VM node, Backup is not getting killed it just exits out

@kabileshKabi
Copy link
Author

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

This seems to be the issue as it needs to load the schema and its getting a RPC time-out

@kabileshKabi
Copy link
Author

kabileshKabi commented Dec 25, 2024

am sharing the log again:

command execuetion log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec252024 --log-file backupdec252024.log
Detail BR log in backupdec252024.log


Full Backup <...................................................................................................................................................................> 0.00%

[ec2-user@ip-10-1```

The above command is fired from the tiup node and just exits as above, without printing any error, we just get a Info stating as below. When the command is in execution we also noticed a high memory and CPU usage in the tiup node

```[2024/12/25 04:53:54.515 +00:00] [INFO] [backup.go:491] ["current backup safePoint job"] [safePoint="{ID=br-cfab8f42-1743-412b-b509-4e8661991690,TTL=1h12m0s,BackupTime=\"2024-12-25 04:53:54.48 +0000 UTC\",BackupTS=454846692584325250}"]
[2024/12/25 04:53:54.563 +00:00] [INFO] [manager.go:261] ["break campaign loop, NewSession failed"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:178\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb/pkg/util.contextDone\n\t/workspace/source/tidb/pkg/util/etcd.go:90\ngithub.com/pingcap/tidb/pkg/util.NewSession\n\t/workspace/source/tidb/pkg/util/etcd.go:50\ngithub.com/pingcap/tidb/pkg/owner.(*ownerManager).campaignLoop\n\t/workspace/source/tidb/pkg/owner/manager.go:259\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"]
[2024/12/25 04:53:55.840 +00:00] [INFO] [manager.go:317] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2024/12/25 04:54:57.889 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=test]
[2024/12/25 04:55:19.481 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=testplay]
[2024/12/25 04:55:51.300 +00:00] [INFO] [backup.go:578] ["get placement policies"] [count=0]```

Am also here with attaching the full logs for this.

Thank you,[backup_log_new.txt](https://github.com/user-attachments/files/18243431/backup_log_new.txt)


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

2 participants