Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race condition in JobScheduler#schedule and JobScheduler#deschedule #275

Open
xiaoyuan0821 opened this issue Nov 29, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@xiaoyuan0821
Copy link

xiaoyuan0821 commented Nov 29, 2022

What is the bug?
We met a managed ISM index stuck at initializing status for several weeks
image

How can one reproduce the bug?
There's a race condition in JobScheduler#schedule and JobScheduler#deschedule, following unit test will fail(add it in JobSchedulerTests)

public void testRaceCondition() throws InterruptedException {

        String indexName = ".opendistro-ism-config";
        String docId = "test-doc-id";
        ScheduledJobParameter jobParameter = buildScheduledJobParameter(docId, "dummy job name",
                Instant.now(), Instant.now(), new IntervalSchedule(Instant.now(), 5, ChronoUnit.MINUTES), true);
        ScheduledJobRunner runner = Mockito.mock(ScheduledJobRunner.class);
        Scheduler.ScheduledCancellable cancellable = Mockito.mock(Scheduler.ScheduledCancellable.class);
        Mockito.when(this.threadPool.schedule(Mockito.any(), Mockito.any(), Mockito.anyString())).thenReturn(cancellable);
        Mockito.when(cancellable.cancel()).thenReturn(true);

        for (int i = 0; i < 10000; i++) {
            logger.info("start iteration {}", i);
            // schedule thread
            Thread scheduleThread = new Thread(() -> scheduler.schedule(indexName, docId, jobParameter, runner, dummyVersion, jitterLimit));
            // deschedule thread
            Thread descheduleThread = new Thread(() -> scheduler.deschedule(indexName, docId));
            // start them
            scheduleThread.start();
            descheduleThread.start();
            // wait for them to end
            scheduleThread.join();
            descheduleThread.join();
            // deschedule again to make sure the job is removed from scheduler#scheduledJobInfo
            scheduler.deschedule(indexName, docId);
            // after deschedule, the scheduledJobInfo should not contains the job again
            assertNull(scheduler.getScheduledJobInfo().getJobInfo(indexName, docId));
            logger.info("end iteration {}", i);
        }
    }

On my desktop, after 1200+ iterations, the tests fails

// ... omit lot of logs
[2022-11-29T07:36:23,512][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] start iteration 1265
[2022-11-29T07:36:23,512][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2535]] Scheduling job id test-doc-id for index .opendistro-ism-config .
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [testRaceCondition] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] end iteration 1265
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] start iteration 1266
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2537]] Scheduling job id test-doc-id for index .opendistro-ism-config .
[2022-11-29T07:36:23,513][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2538]] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,514][INFO ][o.o.j.s.JobScheduler     ] [[Thread-2537]] not scheduled because already removed
[2022-11-29T07:36:23,514][INFO ][o.o.j.s.JobScheduler     ] [testRaceCondition] Descheduling jobId: test-doc-id
[2022-11-29T07:36:23,546][INFO ][o.o.j.s.JobSchedulerTests] [testRaceCondition] after test
REPRODUCE WITH: gradlew ':test' --tests "org.opensearch.jobscheduler.scheduler.JobSchedulerTests.testRaceCondition" -Dtests.seed=E33377BC38A3CD99 -Dtests.security.manager=false -Dtests.locale=ar-JO -Dtests.timezone=Africa/Nouakchott -Druntime.java=12

expected null, but was:<org.opensearch.jobscheduler.scheduler.JobSchedulingInfo@2a85bf7a>
java.lang.AssertionError: expected null, but was:<org.opensearch.jobscheduler.scheduler.JobSchedulingInfo@2a85bf7a>
	at __randomizedtesting.SeedInfo.seed([E33377BC38A3CD99:4CEC8FE92BF01C19]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotNull(Assert.java:756)
	at org.junit.Assert.assertNull(Assert.java:738)
	at org.junit.Assert.assertNull(Assert.java:748)

in ISM, if user add policy to index and remove it immediately, there's a chance to trigger this bug.

What is your host/environment?

  • All versions include Opendistro JobScheduler and Opensearch JobScheduler has this bug.
@saratvemulapalli
Copy link
Member

@joshpalis @vibrantvarun tagging you along, looks like a genuine bug. Lets take care of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

4 participants