Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on owner change, when DDL with higher schema version commits before a DDL with lower schema version, MDL fails to update version for it #58483

Open
D3Hunter opened this issue Dec 23, 2024 · 0 comments
Assignees
Labels
component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@D3Hunter
Copy link
Contributor

D3Hunter commented Dec 23, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

this case only happens on owner change, as there can only be 1 schema version update in progress at the same time.

we use create-table as a example, timeline for the case:

  • owner A runs create table aaa J1, after alloc schema version X, but before it commit the txn with diff, owner change to B
  • B might run create table bbb J2 first as we run jobs without dependency concurrently, so it alloc schema version X+1, and commit the txn with diff
  • B start to wait schema version synced for J2, and all node update their version to X+1, and they all set jobNeedToSync=false

    tidb/pkg/domain/domain.go

    Lines 1098 to 1100 in 042a332

    if len(jobsVerMap) == jobNeedToCheckCnt {
    jobNeedToSync = false
    }
  • A might hasn't retire completely, so it keeps running, and commit the txn with diff, J1 state changed to done
  • A retire completely before it start waiting schema synced
  • B starts to run J1, and fails with create the table as it's already created by A, but we didn't cancel it in here, so it keeps running
    err = createTableOrViewWithCheck(metaMut, job, schemaID, tbInfo)
    if err != nil {
    return tbInfo, errors.Trace(err)
    }
    .
    • Also due to the bug DDL job state change might be overridden #52747, the state of J1 is overridden to running, because:
      • we set to running in here
        job.State = model.JobStateRunning
      • we fail at here as it's already set by A
        err = createTableOrViewWithCheck(metaMut, job, schemaID, tbInfo)
        if err != nil {
        return tbInfo, errors.Trace(err)
        }
      • the runJobErr is saved, and retried later, in this case the job never ends until it used up ddl-error-count-limit, it's another issue,

        tidb/pkg/ddl/job_worker.go

        Lines 589 to 600 in 042a332

        if runJobErr != nil && !job.IsRollingback() && !job.IsRollbackDone() {
        // If the running job meets an error
        // and the job state is rolling back, it means that we have already handled this error.
        // Some DDL jobs (such as adding indexes) may need to update the table info and the schema version,
        // then shouldn't discard the KV modification.
        // And the job state is rollback done, it means the job was already finished, also shouldn't discard too.
        // Otherwise, we should discard the KV modification when running job.
        w.sess.Reset()
        // If error happens after updateSchemaVersion(), then the schemaVer is updated.
        // Result in the retry duration is up to 2 * lease.
        schemaVer = 0
        }
  • B tries to run J1 again, in this time, J1 state is running, but found that it has un-synced schema version change in here
    if jobCtx.isUnSynced(job.ID) || (job.Started() && !jobCtx.maybeAlreadyRunOnce(job.ID)) {
    if variable.EnableMDL.Load() {
    version, err := s.sysTblMgr.GetMDLVer(s.schCtx, job.ID)
    if err == nil {
    err = waitVersionSynced(s.schCtx, jobCtx, job, version)
  • but all nodes have already synced to X+1, so MDL check loop keeps continue in here

    tidb/pkg/domain/domain.go

    Lines 1063 to 1070 in 042a332

    maxVer := do.mdlCheckTableInfo.newestVer
    if maxVer > saveMaxSchemaVersion {
    saveMaxSchemaVersion = maxVer
    } else if !jobNeedToSync {
    // Schema doesn't change, and no job to check in the last run.
    do.mdlCheckTableInfo.mu.Unlock()
    continue
    }

it also introduced another bug #58486 that even through all node have synced to X+1, they all think X is a hole and skip it, so no one knows aaa is created even we fix all above issues.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ddl This issue is related to DDL of TiDB. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant