Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(task): fixed the issue where the default startup timeout was producing agreements and activities on many providers #73

Merged
merged 19 commits into from
May 10, 2024

Conversation

mgordel
Copy link
Contributor

@mgordel mgordel commented May 7, 2024

This pr includes fixes to add additional checks when adding a task to the queue. Also, when starting a task in the service, task status is checked before starting.

@mgordel mgordel requested review from SewerynKras and grisha87 May 8, 2024 11:07
@mgordel mgordel self-assigned this May 8, 2024
@grisha87
Copy link
Contributor

grisha87 commented May 8, 2024

I did test the implementation and there are few issues present:

Test code:

import { TaskExecutor } from "@golem-sdk/task-executor";

import { pinoPrettyLogger } from "@golem-sdk/pino-logger";

(async function main() {
  const executor = await TaskExecutor.create({
    package: "golem/alpine:latest",
    logger: pinoPrettyLogger({
      level: "info",
    }),
    maxTaskRetries: 3,
    // taskStartupTimeout: 100, // ms - FIXME, THIS DOESN'T  WORK - not used by TE
    startupTimeout: 100, // ms
    maxParallelTasks: 100, // ms
  });
  try {
    await Promise.allSettled(
      [1].map((val) =>
        executor.run((ctx) => ctx.run(`echo 'Hello World - ${val}'`)).then((res) => console.log(res.stdout)),
      ),
    );
  } catch (error) {
    console.error("Computation failed:", error);
  } finally {
    await executor.shutdown();
  }
})();

Output:

[21:51:39.381] INFO: Agreement Pool Service has started
[21:51:39.383] INFO: Stats Service has started
[21:51:39.387] INFO: Payment Service has started
[21:51:39.390] INFO: GFTP server spawned
[21:51:39.391] INFO: GFTP Version: 0.15.0 (d2d88ad3 2024-04-03 build #420)
[21:51:39.684] INFO: Demand published on the market
[21:51:39.685] INFO: Market Service has started
[21:51:39.685] INFO: Task Service has started
[21:51:39.686] INFO: Task Executor has started {"subnet":"public","network":"holesky","driver":"erc20"}
Could not start any work on Golem. Processed 0 initial proposals from yagna, filters accepted 0. Check your demand if it's not too restrictive or restart yagna.
[21:51:40.936] WARN: Task execution failed. Trying to redo the task. {"taskId":"1","attempt":1,"reason":"Task startup 1 timeout. Failed to sign an agreement with the provider within the specified time."}
[21:51:41.938] WARN: Task execution failed. Trying to redo the task. {"taskId":"1","attempt":2,"reason":"Task startup 1 timeout. Failed to sign an agreement with the provider within the specified time."}
[21:51:42.027] INFO: Agreement confirmed by provider {"providerName":"UK1-6.h"}
[21:51:42.102] WARN: Task 1 cannot be started because it is in retry state
[21:51:42.493] INFO: Invoice has been accepted {"invoiceId":"8453d5ea-e193-428f-a122-fd88f11d3ab6","agreementId":"056561ff1600c2d7d39177d50c40a12b6efc63e1c3893f9cbf0481ee636e5ab7","providerName":"UK1-6.h"}
[21:51:42.939] WARN: Task execution failed. Trying to redo the task. {"taskId":"1","attempt":3,"reason":"Task startup 1 timeout. Failed to sign an agreement with the provider within the specified time."}
[21:51:42.957] INFO: Agreement confirmed by provider {"providerName":"bristlecone.h"}
[21:51:43.006] WARN: Task 1 cannot be started because it is in retry state
[21:51:43.065] INFO: Agreement confirmed by provider {"providerName":"DE1-8.h"}
[21:51:43.158] WARN: Task 1 cannot be started because it is in retry state
[21:51:43.487] INFO: Invoice has been accepted {"invoiceId":"d0b26ba0-2d6a-4bcd-b9cf-0eadc5f9f8b5","agreementId":"bfde48399cba66639ff955022c58b1b1c938dc2f036b976c4979cf0583694c68","providerName":"bristlecone.h"}
[21:51:43.520] INFO: Invoice has been accepted {"invoiceId":"045cd657-3a58-4cee-8dfe-b9fcce3336ca","agreementId":"1336ab42a40803baf81ff62d6381a1ce5c976fc4ce020b00877b6a7e47b0f350","providerName":"DE1-8.h"}
[21:51:43.940] ERROR: Task has been rejected {"taskId":"1","reason":"Task startup 1 timeout. Failed to sign an agreement with the provider within the specified time.","retries":3}
[21:51:44.052] INFO: Agreement confirmed by provider {"providerName":"DE1-4.h"}
[21:51:44.143] WARN: Task 1 cannot be started because it is in rejected state
[21:51:44.525] INFO: Invoice has been accepted {"invoiceId":"a40bb24a-5d5e-4c5c-b0c5-65276e45faec","agreementId":"f580298039d3ea2ccfb06d951893385385779bade38ab44219681435d13b554b","providerName":"DE1-4.h"}
[21:51:45.688] INFO: Task Service has been stopped
[21:51:45.688] INFO: Agreement Pool Service has been stopped
[21:51:45.690] INFO: Market Service has been stopped
[21:51:45.690] INFO: Waiting for all agreement processes to be completed. {"numberOfProcesses":4}
[21:51:49.443] INFO: Allocation has been released
[21:51:49.443] INFO: Payment service has been stopped
[21:51:49.448] INFO: Computation finished in 6.0 sec.
[21:51:49.448] INFO: Negotiation summary: {"agreements":4,"providers":4}
[21:51:49.448] INFO: Agreement 1: {"agreement":"056561ff16","provider":"UK1-6.h","tasks":0,"cost":0,"paymentStatus":"paid"}
[21:51:49.448] INFO: Agreement 2: {"agreement":"bfde48399c","provider":"bristlecone.h","tasks":0,"cost":0,"paymentStatus":"paid"}
[21:51:49.448] INFO: Agreement 3: {"agreement":"1336ab42a4","provider":"DE1-8.h","tasks":0,"cost":0,"paymentStatus":"paid"}
[21:51:49.448] INFO: Agreement 4: {"agreement":"f580298039","provider":"DE1-4.h","tasks":0,"cost":0,"paymentStatus":"paid"}
[21:51:49.448] INFO: Cost summary: {"totalCost":0,"totalPaid":0}
[21:51:49.448] INFO: Task Service has stopped
[21:51:49.448] INFO: Task Executor has shut down

Actual result

The task timeout was reached pretty fast (100ms). We poked 4 providers in total, generating 4 invoices as a consequence. If these providers had a starting price set, the payments would be made even if we didn't schedule any work (we didn't create an activity at all because of the bail-out).

Expected result

These invoices shouldn't be generated at all. In general, the expected results described in JST-881 are not achieved with this implementation.

Additional issues

When testing, I found out that:

  • We print out Failed to sign an agreement with the provider within the specified time., but that's a misleading lie. All that we know is that we didn't start the task on time (X ms). This piece of log should be removed to not mislead the users.
  • ExecutorOptions.taskStartupTimeout property is not used anywhere in the codebase (ineffective setting to 5ms), I had to use startupTimeout to trigger the behavior. It seems that the logic is bound to the wrong config setting.
  • I wanted to attach a listener via Task.onStateChange when the task was started to see how the statuses transition over time. The only transition I was able to spot was entering the "Pending" state - none of the "Rejected", and "Retry" were passed to my listener. It seems that the resource-cleaning logic on the task is doing an aggressive cleanup.

What's more, increasing the startupTimeout to 2 seconds can also produce an outcome where more providers is getting paid even due to timeouts and retries:

Code:

const executor = await TaskExecutor.create({
    package: "golem/alpine:latest",
    logger: pinoPrettyLogger({
      level: "info",
    }),
    maxTaskRetries: 3,
    // taskStartupTimeout: 100, // ms - FIXME, THIS DOESN'T  WORK - not used by TE
    startupTimeout: 2000, // ms
    maxParallelTasks: 5, // ms
  });

Output:

[22:06:05.600] INFO: Agreement Pool Service has started
[22:06:05.601] INFO: Stats Service has started
[22:06:05.605] INFO: Payment Service has started
[22:06:05.608] INFO: GFTP server spawned
[22:06:05.609] INFO: GFTP Version: 0.15.0 (d2d88ad3 2024-04-03 build #420)
[22:06:05.886] INFO: Demand published on the market
[22:06:05.887] INFO: Market Service has started
[22:06:05.887] INFO: Task Service has started
[22:06:05.888] INFO: Task Executor has started {"subnet":"public","network":"holesky","driver":"erc20"}
[22:06:09.038] WARN: Task execution failed. Trying to redo the task. {"taskId":"1","attempt":1,"reason":"Task startup 1 timeout. Failed to sign an agreement with the provider within the specified time."}
[22:06:09.317] INFO: Agreement confirmed by provider {"providerName":"WAW1-9.h"}
[22:06:09.377] INFO: Agreement confirmed by provider {"providerName":"BHS5-7.h"}
[22:06:10.280] INFO: Task started {"taskId":"1","providerName":"BHS5-7.h","activityId":"df6dd339a236494eacc554351f4f6756"}
[22:06:12.780] WARN: Task execution failed. Trying to redo the task. {"taskId":"1","attempt":2,"reason":"You cannot start a task that is not queued"}
[22:06:13.243] INFO: Agreement confirmed by provider {"providerName":"DE1-2.h"}
[22:06:13.784] INFO: Invoice has been accepted {"invoiceId":"a67183e3-c4ce-43a7-ad64-3afe572bd0e6","agreementId":"77c4a4a5a9115e0b0e933b7349b3c2e8b44df8c9c1b38879b1718dcd488b2856","providerName":"BHS5-7.h"}
[22:06:14.838] INFO: Task started {"taskId":"1","providerName":"DE1-2.h","activityId":"789674d9ddda4032831c20a9b2fd6391"}
[22:06:22.776] INFO: Task computed {"taskId":"1","retries":2,"providerName":"DE1-2.h"}
Hello World - 1

[22:06:23.899] INFO: Market Service has been stopped
[22:06:24.030] INFO: Agreement Pool Service has been stopped
[22:06:24.753] INFO: Task Service has been stopped
[22:06:24.753] INFO: Waiting for all agreement processes to be completed. {"numberOfProcesses":3}
[22:06:24.756] INFO: Invoice has been accepted {"invoiceId":"5b1db29b-07b9-408f-9b16-42d50afb1a1f","agreementId":"ba955a3b2ff70012b404459c8e36404426a29faf4f595b0953caa473be93f8a3","providerName":"DE1-2.h"}
[22:06:24.757] WARN: DebitNote rejected {"reason":"DebitNote 3934fbc9-fa2a-429d-8bfa-0bd193481dc7 rejected because the agreement ba955a3b2ff70012b404459c8e36404426a29faf4f595b0953caa473be93f8a3 is already covered with a final invoice that should be paid instea
d of the debit note"}
[22:06:24.778] INFO: Invoice has been accepted {"invoiceId":"52d5aa6a-efa4-4374-bd21-28419bed7d74","agreementId":"cccff7beb5103e6ee18219e0a11792e15f278ae9171b144ed172f7d060e2fbfb","providerName":"WAW1-9.h"}
[22:06:24.778] WARN: DebitNote rejected {"reason":"DebitNote 1c2aa178-1980-41d7-9872-f452d049fc42 rejected because the agreement cccff7beb5103e6ee18219e0a11792e15f278ae9171b144ed172f7d060e2fbfb is already covered with a final invoice that should be paid instea
d of the debit note"}
[22:06:29.673] INFO: Allocation has been released
[22:06:29.673] INFO: Payment service has been stopped
[22:06:29.678] INFO: Computation finished in 18.0 sec.
[22:06:29.678] INFO: Negotiation summary: {"agreements":3,"providers":3}
[22:06:29.678] INFO: Agreement 1: {"agreement":"77c4a4a5a9","provider":"BHS5-7.h","tasks":0,"cost":0.0002030522428,"paymentStatus":"paid"}
[22:06:29.678] INFO: Agreement 2: {"agreement":"cccff7beb5","provider":"WAW1-9.h","tasks":0,"cost":0.0006554719435,"paymentStatus":"paid"}
[22:06:29.678] INFO: Agreement 3: {"agreement":"ba955a3b2f","provider":"DE1-2.h","tasks":1,"cost":0.00085927482475,"paymentStatus":"paid"}
[22:06:29.678] INFO: Cost summary: {"totalCost":0.00171779901105,"totalPaid":0.00171779901105}
[22:06:29.678] INFO: Task Service has stopped
[22:06:29.678] INFO: Task Executor has shut down

Copy link
Contributor

@grisha87 grisha87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per previous comment.

@mgordel mgordel changed the base branch from master to beta May 9, 2024 13:46
@mgordel
Copy link
Contributor Author

mgordel commented May 10, 2024

It has been fixed:

  • fixed passing taskStartupTimeout parameter as ExecutorOptions.
  • removed default timeouts for taskTimeout and taskStartupTimeout - by default there will be no timeout if user does not explicitly specify these values.
  • fixed log message in case of task startup timeout
  • only retry a task if a GolemWorkError occurs, in any other type of error (including TimeoutError) the task will be rejected.
  • added additional task status check during execution
  • TaskService is started only when at least one proposal is confirmed

@mgordel mgordel requested a review from grisha87 May 10, 2024 12:14
src/config.ts Outdated Show resolved Hide resolved
@@ -36,7 +36,7 @@ export class ExecutorConfig {
readonly package?: Package | string;
readonly maxParallelTasks: number;
readonly taskTimeout: number;
readonly startupTaskTimeout: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one way that's a BC-breaking change (removal of the field). We will have to point this out in the release notes. Probably that will bump the version of the TE :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, this is just a change in the config file. The public ExecutorOptions remain unchanged as the correct taskStartupTimeout.

src/executor.ts Outdated
@@ -540,4 +544,16 @@ export class TaskExecutor {
}
}, this.options.startupTimeout);
}

private async waitForAtLeastOneConfirmedProposal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more elegant solution would be to move this logic to TaskService.run. You have a loop there already, you can check the marketService in that loop for proposals, and if there are none, sleep and don't take a task from the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it will be much simpler in TaskService. I moved this

@grisha87 grisha87 changed the title fix(queue): fixed adding existing task to the queue when retrying fix(task): fixed the issue where the default startup timeout was producing agreements and activities on many providers May 10, 2024
@grisha87
Copy link
Contributor

Results after A/B testing:

  • A - new implementation, taskTimeout and taskStartupTimeout are no longer with 2m and 5m defaults + we don't retry tasks automatically if they fail due to the user specified timeout (controlled by taskRetryOnTimeout
  • B - old version with 2m task startup timeout and 5m of execution timeout defaults

Test

10x runs which execute 20 tasks on the network

import { TaskExecutor } from "@golem-sdk/task-executor";
import { pinoPrettyLogger } from "@golem-sdk/pino-logger";

type TestConfig = {
  name: string;
  tasks: number;
  maxParallelTasks: number;
  maxTaskRetries: number;
  taskStartupTimeoutSec: number | null;
  taskTimeoutSec: number | null;
  taskRetryOnTimeout: boolean | null;
  useReputation: boolean | null;
};

/**
 * Configs:
 *
 * - maxParallelTasks, 5
 * - maxTaskRetries: 3
 * - taskStartupTimeout: undefined, 5s, 15s, 30s, 60s, 90s, 120s
 * - taskTimeout: undefined, 5s, 15s, 30s, 60s, 90s, 120s
 * - taskRetryOnTimeout: false, true
 * - use reputation
 *
 * Metrics:
 *
 * - number of tasks
 * - number of retries
 * - number of successful
 * - number of failed
 * - number of providers
 * - number of agreements
 * - number of paid invoices
 * - number of unpaid invoices
 * - total execution time
 */

async function testRun(config: TestConfig) {
  const {
    name,
    tasks,
    maxParallelTasks,
    maxTaskRetries,
    taskStartupTimeoutSec,
    taskTimeoutSec,
    taskRetryOnTimeout,
    useReputation,
  } = config;

  const stats = {
    tasks,
    retries: 0,
    successful: 0,
    failed: 0,
    successRate: 0,
    providers: 0,
    agreements: 0,
    invoicesReceived: 0,
    invoicesPaid: 0,
    invoicesUnpaid: 0,
    invoicesMissing: 0,
    invoicePaymentRate: 0,
  };

  const timeStartMs = Date.now();

  const executor = await TaskExecutor.create({
    package: "golem/alpine:latest",
    logger: pinoPrettyLogger({
      level: "silent",
    }),
    maxTaskRetries,
    taskStartupTimeout: taskStartupTimeoutSec ? taskStartupTimeoutSec * 1000 : undefined,
    taskTimeout: taskTimeoutSec ? taskTimeoutSec * 1000 : undefined,
    taskRetryOnTimeout,
    maxParallelTasks,
  });

  try {
    const results = await Promise.allSettled(
      new Array(tasks)
        .fill(0)
        .map((val, index) =>
          executor.run((ctx) => ctx.run(`echo 'Hello World - ${index}.${val}'`)).then((res) => res.stdout),
        ),
    );

    stats.successful = results.filter((p) => p.status === "fulfilled").length;
    stats.failed = results.filter((p) => p.status === "rejected").length;
    stats.successRate = stats.successful / (stats.successful + stats.failed);
  } catch (error) {
    console.error("Computation failed:", error);
  } finally {
    await executor.shutdown();
  }

  const timeStopMs = Date.now();
  const testDurationSec = (timeStopMs - timeStartMs) / 1000;

  const teStats = executor.getStats();

  stats.retries = teStats.retries;
  stats.providers = teStats.providers;
  stats.agreements = teStats.agreements;
  stats.invoicesReceived = teStats.invoicesReceived;
  stats.invoicesUnpaid = teStats.invoicesUnpaid;
  stats.invoicesPaid = teStats.invoicesPaid;
  stats.invoicesMissing = teStats.invoicesMissing;
  stats.invoicePaymentRate = teStats.invoicePaymentRate;

  return {
    ...config,
    ...stats,
    testDurationSec,
  };
}

(async () => {
  const variants: TestConfig[] = [
    {
      name: "New defaults - no timeout and timeout retries",
      tasks: 20,
      taskTimeoutSec: null,
      taskStartupTimeoutSec: null,
      maxParallelTasks: 5,
      maxTaskRetries: 3,
      taskRetryOnTimeout: false,
      useReputation: null,
    },
    {
      name: "Short (10s) timeouts without retries",
      tasks: 20,
      taskTimeoutSec: 10,
      taskStartupTimeoutSec: 10,
      maxParallelTasks: 5,
      maxTaskRetries: 3,
      taskRetryOnTimeout: false,
      useReputation: null,
    },
    {
      name: "Short (10s) timeouts with retries",
      tasks: 20,
      taskTimeoutSec: 10,
      taskStartupTimeoutSec: 10,
      maxParallelTasks: 5,
      maxTaskRetries: 3,
      taskRetryOnTimeout: true,
      useReputation: null,
    },
    {
      name: "Old default timeouts (2m + 5m) timeouts without retries",
      tasks: 20,
      taskTimeoutSec: 5 * 60,
      taskStartupTimeoutSec: 2 * 60,
      maxParallelTasks: 5,
      maxTaskRetries: 3,
      taskRetryOnTimeout: false,
      useReputation: null,
    },
    {
      name: "Old default timeouts (2m + 5m) timeouts with retries",
      tasks: 20,
      taskTimeoutSec: 5 * 60,
      taskStartupTimeoutSec: 2 * 60,
      maxParallelTasks: 5,
      maxTaskRetries: 3,
      taskRetryOnTimeout: true,
      useReputation: null,
    },
  ];

  const attempts = 10;

  for (const variant of variants) {
    const summary: Awaited<ReturnType<typeof testRun>>[] = [];

    for (let i = 0; i < attempts; i++) {
      console.log("Running attempt", i + 1, variant.name);
      const report = await testRun(variant);
      console.log("Test Run Report");
      console.table(report);
      summary.push(report);
    }

    console.log("Test Summary:", variant.name);
    console.table(summary);
  }
})().catch(console.error);

Results

Average test duration

  • A = 65.2s (3.26 s/task)
  • B = 55.4s (2.77 s/task)

Note: The new implementation required 0.5s per task more to complete.

Average number of agreements created

  • A = 5.1
  • B = 10.3

Note: The new implementation produced 50% fewer agreements and invoices.

Task success rate

  • A = 100%
  • B = 97.50%

Note: The new implementation has a better success rate as it's not so aggressive with retries.

Payment rate

  • A = 98%
  • B = 95.80%

Note: The new implementation has a slightly better invoice payment rate. This mostly results from the fact that it produces 50% fewer invoices so the probability for failure decreases.

@mgordel mgordel merged commit cb571c6 into beta May 10, 2024
16 checks passed
@mgordel mgordel deleted the bugfix/JST-881/task-retrying branch May 10, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants